Skip to content

perf: Optimize SpillingGrouper to avoid unnecessary disk I/O for small spill runs#19439

Open
maytasm wants to merge 5 commits into
apache:masterfrom
maytasm:spill_file_improvement_v2
Open

perf: Optimize SpillingGrouper to avoid unnecessary disk I/O for small spill runs#19439
maytasm wants to merge 5 commits into
apache:masterfrom
maytasm:spill_file_improvement_v2

Conversation

@maytasm
Copy link
Copy Markdown
Contributor

@maytasm maytasm commented May 9, 2026

perf: Optimize SpillingGrouper to avoid unnecessary disk I/O for small spill runs

Description

This is a followup to #19357

The spill batching logic (introduced to avoid thousands of tiny disk files) previously had to write to disk first and check the file size afterward, because the serialized size isn't known upfront — and if the serialized data turns out to be large, buffering it entirely in memory before deciding would risk OOM. So the safe path was: always write to a temp file, then read it back into memory only if it was small enough to batch.

This is correct but expensive for certain cases. When groupBy queries produce spill runs whose serialized size is much smaller than their in-memory buffer (e.g., HLL sketches in sparse/SET mode serialize to a fraction of their pre-allocated buffer), this creates thousands of unnecessary file create/write/read/delete cycles just to discover the data was small enough to batch in memory.

SpillOutputStream solves both concerns: it writes to a heap buffer first, and only when the buffer exceeds the threshold does it open a file and flush the accumulated bytes to disk. Large spills still go to disk (no OOM risk), but small spills never touch the filesystem. Peak extra heap is bounded to the threshold size (minSpillFileSize, default 1MB).

Key changed/added classes in this PR
  • Introduces SpillOutputStream, an OutputStream that buffers in memory and only spills to disk when written bytes exceed the minSpillFileSize threshold. This eliminates the previous write-to-file → check-size → read-back → delete round-trip for small spill runs. This integrates with LimitedTemporaryStorage and hence still enforces all the limits.
  • Refactors SpillingGrouper.spill() to serialize through SpillOutputStream instead of always creating a temp file first. The serialization logic is extracted into serializeToStream() to separate it from file lifecycle management.
  • Adds comprehensive unit tests for SpillOutputStream covering in-memory buffering, disk spillover, single-byte writes, threshold boundary behavior, and error handling.

Benchmarks result

Before this PR:

Benchmark                                                  (initialBuckets)  (numProcessingThreads)  (numSegments)  (queryGranularity)  (rowsPerSegment)  (schemaAndQuery)  (vectorize)  Mode  Cnt       Score       Error  Units
GroupByBenchmark.queryMultiQueryableIndexWithSpilling                         -1                       4              4                 all            100000           basic.A        force  avgt   15  352910.001 ±  7477.535  us/op
GroupByBenchmark.queryMultiQueryableIndexWithSpillingTTFR                     -1                       4              4                 all            100000           basic.A        force  avgt   15  149027.367 ±  4829.306  us/op

New Benchmarks
bufferGrouperMaxSize=100 (spill size ~6 KB)
GroupByBenchmark.queryMultiQueryableIndexWithSmallSpilling                    -1                       4              4                 all            100000           basic.A        force  avgt   15  750775.683 ± 10103.062  us/op
GroupByBenchmark.queryMultiQueryableIndexWithSmallSpillingTTFR                -1                       4              4                 all            100000           basic.A        force  avgt   15  539507.482 ±  7332.067  us/op

bufferGrouperMaxSize=70000 (spill size ~4 MB)
GroupByBenchmark.queryMultiQueryableIndexWithLargeSpilling                    -1                       4              4                 all           1000000           basic.A        force  avgt   15  3529647.635 ±  69419.867  us/op
GroupByBenchmark.queryMultiQueryableIndexWithLargeSpillingTTFR                -1                       4              4                 all           1000000           basic.A        force  avgt   15  1358536.490 ± 139732.304  us/op

After this PR:

Benchmark                                                  (initialBuckets)  (numProcessingThreads)  (numSegments)  (queryGranularity)  (rowsPerSegment)  (schemaAndQuery)  (vectorize)  Mode  Cnt       Score       Error  Units
GroupByBenchmark.queryMultiQueryableIndexWithSpilling                    -1                       4              4                 all            100000           basic.A        force  avgt   15  344787.202 ± 13998.456  us/op
GroupByBenchmark.queryMultiQueryableIndexWithSpillingTTFR                -1                       4              4                 all            100000           basic.A        force  avgt   15  141381.603 ±  3583.413  us/op

New Benchmarks
bufferGrouperMaxSize=100 (spill size ~6 KB)
GroupByBenchmark.queryMultiQueryableIndexWithSmallSpilling                    -1                       4              4                 all            100000           basic.A        force  avgt   15  420431.477 ± 5600.233  us/op
GroupByBenchmark.queryMultiQueryableIndexWithSmallSpillingTTFR                -1                       4              4                 all            100000           basic.A        force  avgt   15   195450.731 ± 2552.290  us/op

bufferGrouperMaxSize=70000 (spill size ~4 MB)
GroupByBenchmark.queryMultiQueryableIndexWithLargeSpilling                    -1                       4              4                 all           1000000           basic.A        force  avgt   15  3569103.927 ± 282616.750  us/op
GroupByBenchmark.queryMultiQueryableIndexWithLargeSpillingTTFR                -1                       4              4                 all           1000000           basic.A        force  avgt   15  1243519.844 ±  80173.130  us/op

The existing queryMultiQueryableIndexWithSpilling/queryMultiQueryableIndexWithSpillingTTFR uses bufferGrouperMaxSize=4000 which produce reasonable spills of ~200kb. I have also added new benchmarks with similar idea but producing spill files on extremes ends for size. queryMultiQueryableIndexWithSmallSpilling/queryMultiQueryableIndexWithSmallSpillingTTFR sets bufferGrouperMaxSize=100, producing spill size ~6 KB. This would result in more batching. queryMultiQueryableIndexWithLargeSpilling/queryMultiQueryableIndexWithLargeSpillingTTFR sets bufferGrouperMaxSize=70000, producing spill size ~4 MB. This would skip batching. These new benchmarks are not added to the PR since they are really the same as queryMultiQueryableIndexWithSpilling/queryMultiQueryableIndexWithSpillingTTFR just with different config values.

Default spilling (bufferGrouperMaxSize=4000) — within noise:
                                                                                                   
  ┌───────────┬───────────────┬───────────────┬───────┐
  │ Benchmark │      OLD      │      NEW      │ Delta │
  ├───────────┼───────────────┼───────────────┼───────┤
  │ Spilling  │ 352,910 us/op │ 344,787 us/op │ -2.3% │                                            
  ├───────────┼───────────────┼───────────────┼───────┤
  │ TTFR      │ 149,027 us/op │ 141,382 us/op │ -5.1% │
  └───────────┴───────────────┴───────────────┴───────┘
Error bars overlap, so statistically neutral. Spills here are moderately sized and few in number — not the target optimization scenario.

Small spills (~6 KB each, bufferGrouperMaxSize=100) — huge win:

  ┌───────────┬───────────────┬───────────────┬───────┐
  │ Benchmark │      OLD      │      NEW      │ Delta │                                            
  ├───────────┼───────────────┼───────────────┼───────┤
  │ Spilling  │ 750,776 us/op │ 420,431 us/op │ -44%  │
  ├───────────┼───────────────┼───────────────┼───────┤
  │ TTFR      │ 539,507 us/op │ 195,451 us/op │ -64%  │
  └───────────┴───────────────┴───────────────┴───────┘
This is the sweet spot for the optimization. Each spill is ~6 KB, well under the 1 MB MIN_SPILL_FILE_BYTES threshold, so they stay entirely in memory — no file create/write/read/delete round-trip. The TTFR improvement is even larger because the first result no longer waits on disk I/O for early spills.

Large spills (~4 MB each, bufferGrouperMaxSize=70000) — neutral:
  
  ┌───────────┬─────────────────┬─────────────────┬──────────────────────────┐
  │ Benchmark │       OLD       │       NEW       │          Delta           │
  ├───────────┼─────────────────┼─────────────────┼──────────────────────────┤
  │ Spilling  │ 3,529,648 us/op │ 3,569,104 us/op │ +1.1% (noise)            │
  ├───────────┼─────────────────┼─────────────────┼──────────────────────────┤                     
  │ TTFR      │ 1,358,536 us/op │ 1,243,520 us/op │ -8.5% (large error bars) │
  └───────────┴─────────────────┴─────────────────┴──────────────────────────┘
Spills exceed the 1 MB threshold and go to disk in both versions, so no difference.


The optimization eliminates disk I/O for small spills, saving ~330,000 us/op total and ~344,000 us/op to first result in the small-spill case. Large spills exceed the 1MB threshold and hit disk regardless, so no change there.

The optimization delivers exactly where designed — many small spills that previously hit disk now stay in memory, cutting latency 44-64%. Large spills are unaffected. No regressions.

Key changed/added classes in this PR
  • SpillingGrouper
  • SpillOutputStream

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@maytasm maytasm marked this pull request as draft May 9, 2026 02:35
@maytasm maytasm changed the title Optimize Spill files more v2 perf: Optimizes SpillingGrouper spill logic May 14, 2026
@maytasm maytasm changed the title perf: Optimizes SpillingGrouper spill logic perf: Optimize SpillingGrouper to avoid unnecessary disk I/O for small spill runs May 14, 2026
@maytasm maytasm marked this pull request as ready for review May 14, 2026 05:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes GroupBy spilling by introducing an output stream that buffers small spill runs in memory and only creates disk files when the serialized spill exceeds the configured threshold, reducing unnecessary file I/O for small spills.

Changes:

  • Adds SpillOutputStream to switch from heap buffering to LimitedTemporaryStorage only after the threshold is exceeded.
  • Refactors SpillingGrouper spill serialization to use the new stream while preserving pending-run batching and disk-spill behavior.
  • Adds and updates unit tests for in-memory spill handling, threshold behavior, disk fallback, and storage-limit enforcement.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/SpillOutputStream.java Adds the threshold-aware spill output stream.
processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/SpillingGrouper.java Routes grouper spill serialization through SpillOutputStream.
processing/src/test/java/org/apache/druid/query/groupby/epinephelinae/SpillOutputStreamTest.java Adds unit coverage for the new stream behavior.
processing/src/test/java/org/apache/druid/query/groupby/epinephelinae/SpillingGrouperTest.java Updates spilling tests for in-memory small-spill behavior and storage-limit scenarios.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Member

@FrankChen021 FrankChen021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed the code for correctness, edge cases, concurrency, and integration risks; no issues found.

Reviewed 4 of 4 changed files.


This is an automated review by Codex GPT-5.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants