Skip to content

fix: Optimizes SpillingGrouper for high cardinality dimension(s) GroupBy with large memory footprint aggregators#19357

Merged
maytasm merged 8 commits into
apache:masterfrom
maytasm:spill_file_improvement
May 7, 2026
Merged

fix: Optimizes SpillingGrouper for high cardinality dimension(s) GroupBy with large memory footprint aggregators#19357
maytasm merged 8 commits into
apache:masterfrom
maytasm:spill_file_improvement

Conversation

@maytasm
Copy link
Copy Markdown
Contributor

@maytasm maytasm commented Apr 21, 2026

Optimizes SpillingGrouper for high cardinality dimension(s) GroupBy with large memory footprint aggregators

Description

tldr; Batch small spill files in SpillingGrouper to reduce file count

Problem

When aggregators like HLL sketches (or ThetaSketch) are used in groupBy queries, the BufferHashGrouper pre-allocates a large fixed-size buffer per group slot (e.g. ~64KB per slot for HLL with lgK=16). This causes the in-memory grouper to fill up quickly and spill frequently (as the number of unique grouping the BufferHashGrouper can store in memory is low). However, when each key has only been seen a few times (i.e. high cardinality dimension(s) GroupBy), the sketch serializes to a compact form of just a handful of bytes (HLL in List mode). The result is thousands of tiny spill files on disk — each one only a few KB despite the grouper buffer being full.

Large number of spill files can cause OOM when GroupBy merges spill files by opening all of them simultaneously. We previously prevent this by adding guardrail in #19141 which would fail the query. This PR fix the issue and will allow the query to succeed.

Fix

Instead of writing every grouper flush directly to its own spill file, small flushes (serialized size < 1MB) are batched in heap memory. When the accumulated pending bytes reach the 1MB threshold, all pending runs are merge-sorted and written as a single consolidated file. Large flushes (>= 1MB) bypass batching and go directly to disk as before.

The approach avoids holding the full serialized data in heap during the spill by streaming directly to a temp file first, then reading back only small files (< 1MB) into memory for batching. This prevents OOM risk when the serialized size happens to be close to the buffer size.

Key changed/added classes in this PR

All changes are in SpillingGrouper.java:

  • spill(): Streams grouper contents to a temp file first. If the file is small (< MIN_SPILL_FILE_BYTES), reads it back into pendingSpillRuns and deletes the temp file. If large, keeps it on disk. Triggers flushPendingRunsToDisk() when accumulated pending bytes reach the threshold.
  • flushPendingRunsToDisk() (new): Deserializes all pending in-memory runs, merge-sorts them (each run is individually sorted from grouper.iterator(true)), and writes the merged result as a single sorted spill file. Also writes a dictionary file for the accumulated dictionary entries.
  • deserializeIterator() (new / refactor): Extracted the deserialization transform (deserialize aggregator values, Integer→Long coercion) that was previously inlined in iterator(). Now shared between iterator() and flushPendingRunsToDisk().
  • reset() / close(): Clear pending state (pendingSpillRuns, pendingSpillBytes, pendingDictionaryEntries).
  • iterator(): Calls flushPendingRunsToDisk() before iteration to flush any runs that didn't reach the threshold during the spill phase.

Memory overhead

Peak additional heap usage is bounded by ~2 × MIN_SPILL_FILE_BYTES (2MB): up to 1MB of accumulated pending byte arrays plus one file being read back. The grouper buffer itself is an off-heap direct ByteBuffer and is reset before the read-back.

Optimizations considered but not pursued

Combine duplicate keys during flush (combineByKey)

During flushPendingRunsToDisk(), we tried deserializing all pending runs, combining entries with duplicate keys by merging their aggregator values (using AggregatorFactory.combine()), then serializing the deduplicated result. The idea was to reduce the output file size and the final merge fan-in.

Flame graph analysis showed combineByKey itself consumed only 28 CPU samples — negligible overhead. However, wall-clock time regressed from 180s to 190s (+5.5%). The cause: even though the CPU cost of combine() itself was small, the synchronous deserialize-combine-reserialize pipeline added latency to the processing thread's critical path. The processing thread cannot ingest new rows while flushing, so any additional work during the flush directly extends the stall. In this workload, the keys had low duplication across the small runs, so combineByKey eliminated very few entries while paying the full deserialization cost for every entry.

Reducing merge fan-in with cascaded intermediate merges

Instead of writing all pending runs as a single merged file, we considered a cascaded merge approach — progressively merging spill files on disk (similar to an external merge sort with bounded fan-in) to keep the final iterator() merge across a small number of large files. This was not pursued because the batching approach already achieves the same goal more simply: by accumulating small runs in memory and writing them as one file, the number of files at iterator() time is naturally reduced by orders of magnitude (e.g., from ~2,000 to ~dozens). Cascaded disk merges would add I/O amplification (each entry written to disk multiple times) for marginal additional fan-in reduction.

Potential future improvements

Reducing sketch initialization cost

Flame graph profiling shows that HllSketchMergeBufferAggregatorHelper.createNewUnion() accounts for 50.7% of the processing thread's CPU time — the dominant bottleneck. Every time a new group slot is initialized in the BufferHashGrouper, init() calls createNewUnion() which copies a pre-built empty Union image (~65KB for lgK=16) into the buffer via initializeEmptyUnion(), then wraps it with Union.writableWrap(). During hash table growth (adjustTableWhenFull()), every existing slot is relocated via relocate(), which calls createNewUnion() again. With frequent spills, the grouper repeatedly fills, spills, resets, and re-initializes — multiplying the init cost.

Possible approaches:

  1. Variable-size buffer slots in BufferHashGrouper: Currently, ByteBufferHashTable uses fixed-width buckets (bucketSizeWithHash = HASH_SIZE + keySize + aggregators.spaceNeeded()), where spaceNeeded() is the maximum intermediate size. For HLL with lgK=16, this pre-allocates ~65KB per slot even when the sketch is in List mode (few distinct values, only a handful of bytes). If the hash table supported variable-width slots — starting small and growing on demand as sketches transition from List → Set → HLL mode — far more groups would fit in the buffer before spilling. This would dramatically reduce the number of spills and proportionally reduce createNewUnion() calls. However, this is a fundamental architectural change to ByteBufferHashTable, which relies on fixed-width contiguous buckets for O(1) offset calculation and linear probing.

  2. Lazy sketch initialization: Instead of initializing a full Union object when a new bucket is created, defer the expensive createNewUnion() call until the first aggregation value is actually merged. For groups that are spilled before receiving many values, this could avoid the full initialization cost entirely. The BufferAggregator interface would need to support deferred init — e.g., storing a sentinel in the buffer and lazily initializing on first aggregate() call.

Benchmarks result

Before this PR:

Before Change
Benchmark                                                  (initialBuckets)  (numProcessingThreads)  (numSegments)  (queryGranularity)  (rowsPerSegment)  (schemaAndQuery)  (vectorize)  Mode  Cnt       Score       Error  Units
GroupByBenchmark.queryMultiQueryableIndexTTFR                            -1                       4              4                 all            100000           basic.A        force  avgt   15   42116.148 ±  2627.574  us/op
GroupByBenchmark.queryMultiQueryableIndexWithSerde                       -1                       4              4                 all            100000           basic.A        force  avgt   15  103143.571 ± 19105.285  us/op
GroupByBenchmark.queryMultiQueryableIndexWithSpilling                    -1                       4              4                 all            100000           basic.A        force  avgt   15  388720.771 ± 15675.278  us/op
GroupByBenchmark.queryMultiQueryableIndexWithSpillingTTFR                -1                       4              4                 all            100000           basic.A        force  avgt   15   95759.378 ±  1799.533  us/op
GroupByBenchmark.queryMultiQueryableIndexX                               -1                       4              4                 all            100000           basic.A        force  avgt   15   66064.295 ±  8078.106  us/op

New Benchmarks

bufferGrouperMaxSize=100 (spill size ~6 KB)
GroupByBenchmark.queryMultiQueryableIndexWithSmallSpilling                    -1                       4              4                 all            100000           basic.A        force  avgt   15  1422480.170 ± 73222.415  us/op
GroupByBenchmark.queryMultiQueryableIndexWithSmallSpillingTTFR                -1                       4              4                 all            100000           basic.A        force  avgt   15   938728.684 ± 54411.363  us/op

bufferGrouperMaxSize=70000 (spill size ~4 MB)
GroupByBenchmark.queryMultiQueryableIndexWithLargeSpilling                    -1                       4              4                 all           1000000           basic.A        force  avgt   15  5179514.738 ± 79378.483  us/op
GroupByBenchmark.queryMultiQueryableIndexWithLargeSpillingTTFR                -1                       4              4                 all           1000000           basic.A        force  avgt   15  1647320.687 ± 62434.948  us/op

After this PR:

Benchmark                                                  (initialBuckets)  (numProcessingThreads)  (numSegments)  (queryGranularity)  (rowsPerSegment)  (schemaAndQuery)  (vectorize)  Mode  Cnt       Score       Error  Units
GroupByBenchmark.queryMultiQueryableIndexTTFR                            -1                       4              4                 all            100000           basic.A        force  avgt   15   53740.436 ±  8723.658  us/op
GroupByBenchmark.queryMultiQueryableIndexWithSerde                       -1                       4              4                 all            100000           basic.A        force  avgt   15   99847.941 ±  5631.568  us/op
GroupByBenchmark.queryMultiQueryableIndexWithSpilling                    -1                       4              4                 all            100000           basic.A        force  avgt   15  372782.928 ± 31551.417  us/op
GroupByBenchmark.queryMultiQueryableIndexWithSpillingTTFR                -1                       4              4                 all            100000           basic.A        force  avgt   15  176957.932 ± 24544.399  us/op
GroupByBenchmark.queryMultiQueryableIndexX                               -1                       4              4                 all            100000           basic.A        force  avgt   15   75079.741 ± 23374.447  us/op


New Benchmarks

bufferGrouperMaxSize=100 (spill size ~6 KB)
GroupByBenchmark.queryMultiQueryableIndexWithSmallSpilling                    -1                       4              4                 all            100000           basic.A        force  avgt   15  843467.611 ± 81698.388  us/op
GroupByBenchmark.queryMultiQueryableIndexWithSmallSpillingTTFR                -1                       4              4                 all            100000           basic.A        force  avgt   15  618040.133 ± 64764.729  us/op

bufferGrouperMaxSize=70000 (spill size ~4 MB)
GroupByBenchmark.queryMultiQueryableIndexWithLargeSpilling                    -1                       4              4                 all           1000000           basic.A        force  avgt   15  4997633.779 ± 124961.464  us/op
GroupByBenchmark.queryMultiQueryableIndexWithLargeSpillingTTFR                -1                       4              4                 all           1000000           basic.A        force  avgt   15  2413843.500 ± 856006.565  us/op

The existing queryMultiQueryableIndexWithSpilling/queryMultiQueryableIndexWithSpillingTTFR uses bufferGrouperMaxSize=4000 which produce reasonable spills of ~200kb. I have also added new benchmarks with similar idea but producing spill files on extremes ends for size. queryMultiQueryableIndexWithSmallSpilling/queryMultiQueryableIndexWithSmallSpillingTTFR sets bufferGrouperMaxSize=100, producing spill size ~6 KB. This would result in more batching. queryMultiQueryableIndexWithLargeSpilling/queryMultiQueryableIndexWithLargeSpillingTTFR sets bufferGrouperMaxSize=70000, producing spill size ~4 MB. This would skip batching. These new benchmarks are not added to the PR since they are really the same as queryMultiQueryableIndexWithSpilling/queryMultiQueryableIndexWithSpillingTTFR just with different config values.

Non-spilling benchmarks (unchanged code path):

  ┌───────────────────────────────────┬────────────┬──────────┬───────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────┐   
  │             Benchmark             │   Before   │  After   │ Delta │                                                  Notes                                                   │
  ├───────────────────────────────────┼────────────┼──────────┼───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┤   
  │ queryMultiQueryableIndexX         │ 66,064 ±8K │ 67,746   │ +3%   │ No spilling. Within error bars. Noise.                                                                   │
  │                                   │            │ ±8K      │       │                                                                                                          │
  ├───────────────────────────────────┼────────────┼──────────┼───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┤   
  │ queryMultiQueryableIndexWithSerde │ 103,144    │ 87,389   │ -15%  │ No spilling. Looks like an improvement but the "before" had huge error bars (±19K). The "after" run just │   
  │                                   │ ±19K       │ ±2K      │       │  had less variance. Not attributable to the change.                                                      │   
  ├───────────────────────────────────┼────────────┼──────────┼───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┤   
  │ queryMultiQueryableIndexTTFR      │ 42,116     │ 41,499   │ -1%   │ No spilling. Identical within error bars.                                                                │
  │                                   │ ±2.6K      │ ±3K      │       │                                                                                                          │   
  └───────────────────────────────────┴────────────┴──────────┴───────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────┘
                                                                                                                                                                                     
  No real change, as expected — these don't hit the spill path.                                                                                                                      
  
  ---                                                                                                                                                                                
  Default spilling (bufferGrouperMaxSize=4000, spill size ~250KB — batching applies):
                                                                                                                                                                                     
  ┌──────────────────┬──────────┬──────────┬───────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
  │    Benchmark     │  Before  │  After   │ Delta │                                                            Notes                                                            │   
  ├──────────────────┼──────────┼──────────┼───────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
  │ WithSpilling     │ 388,721  │ 343,351  │ -12%  │ Improvement. Each spill (~250KB) is under 1MB, so batching kicks in — many small spills are accumulated in memory and       │
  │                  │ ±16K     │ ±56K     │       │ merged into fewer disk files, reducing file I/O. Error bars overlap though, so the magnitude is uncertain.                  │
  ├──────────────────┼──────────┼──────────┼───────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤   
  │                  │          │          │       │ TTFR regression. Before: spill files are written individually during spill(), so they're ready to iterate immediately.      │
  │ WithSpillingTTFR │ 95,759   │ 170,717  │ +78%  │ After: small spills are held in memory, and iterator() calls flushPendingRunsToDisk() to merge-sort and write them before   │   
  │                  │ ±1.8K    │ ±24K     │       │ returning the first row. That extra flush delays the first row. Doesn't affect total query time since GroupBy materializes  │
  │                  │          │          │       │ all rows before returning.                                                                                                  │   
  └──────────────────┴──────────┴──────────┴───────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

  ---
  Small spilling (bufferGrouperMaxSize=100, spill size ~6KB — batching's sweet spot):

  ┌───────────────────────┬─────────────┬───────────┬───────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
  │       Benchmark       │   Before    │   After   │ Delta │                                                       Notes                                                        │
  ├───────────────────────┼─────────────┼───────────┼───────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
  │ WithSmallSpilling     │ 1,422,480   │ 843,468   │ -41%  │ Large win. Without batching: thousands of tiny ~6KB files created on disk. With batching: accumulated in memory,   │
  │                       │ ±73K        │ ±82K      │       │ merged into a handful of files. Eliminates massive filesystem overhead (open/write/close/seek per file).           │
  ├───────────────────────┼─────────────┼───────────┼───────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤   
  │                       │ 938,729     │ 618,040   │       │ Also faster. Even with the flushPendingRunsToDisk() cost in iterator(), the total work is far less than writing    │
  │ WithSmallSpillingTTFR │ ±54K        │ ±65K      │ -34%  │ thousands of individual files during spill(). The batching saves so much I/O that TTFR improves too despite the    │   
  │                       │             │           │       │ deferred flush.                                                                                                    │
  └───────────────────────┴─────────────┴───────────┴───────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘   
                                                         
  This is the target scenario. Aggregators like ThetaSketch that pre-allocate large buffers but serialize small produce exactly this pattern.                                        
  
  ---                                                                                                                                                                                
  Large spilling (bufferGrouperMaxSize=70000, spill size ~4MB — bypasses batching):

  ┌───────────────────────┬────────────┬────────────┬───────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
  │       Benchmark       │   Before   │   After    │ Delta │                                                       Notes                                                       │
  ├───────────────────────┼────────────┼────────────┼───────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤    
  │ WithLargeSpilling     │ 5,179,515  │ 4,997,634  │ -4%   │ Within error bars. Each spill is ~4MB (>1MB threshold), so it goes directly to disk — the batching path is        │
  │                       │ ±79K       │ ±125K      │       │ skipped entirely. No change expected.                                                                             │    
  ├───────────────────────┼────────────┼────────────┼───────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤    
  │                       │ 1,647,321  │ 2,103,661  │       │ TTFR regression. Large spills bypass batching, but iterator() still calls flushPendingRunsToDisk() (which is a    │
  │ WithLargeSpillingTTFR │ ±62K       │ ±96K       │ +28%  │ no-op since pendingSpillRuns is empty). This shouldn't cause a 28% regression for a no-op call. More likely       │    
  │                       │            │            │       │ benchmark variance — the ±96K error bars are significant. Could also be environmental noise between runs.         │
  └───────────────────────┴────────────┴────────────┴───────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘    
         

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

try {
for (final byte[] runBytes : pendingSpillRuns) {
readers.add(spillMapper.readValues(
spillMapper.getFactory().createParser(new LZ4BlockInputStream(new ByteArrayInputStream(runBytes))),
@maytasm maytasm changed the title Optimizes SpillingGrouper for high cardinality group by dimensions Optimizes SpillingGrouper for high cardinality group by dimensions with large memory footprint aggregators Apr 21, 2026
@maytasm maytasm changed the title Optimizes SpillingGrouper for high cardinality group by dimensions with large memory footprint aggregators Optimizes SpillingGrouper for high cardinality dimension(s) group by with large memory footprint aggregators Apr 22, 2026
Copy link
Copy Markdown
Member

@FrankChen021 FrankChen021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes LGTM, no correctness issues found.

@maytasm maytasm marked this pull request as draft April 25, 2026 08:23
@maytasm maytasm marked this pull request as ready for review May 4, 2026 19:27
@maytasm maytasm changed the title Optimizes SpillingGrouper for high cardinality dimension(s) group by with large memory footprint aggregators Optimizes SpillingGrouper for high cardinality dimension(s) GroupBy with large memory footprint aggregators May 4, 2026
@maytasm maytasm changed the title Optimizes SpillingGrouper for high cardinality dimension(s) GroupBy with large memory footprint aggregators fix: Optimizes SpillingGrouper for high cardinality dimension(s) GroupBy with large memory footprint aggregators May 4, 2026
@maytasm maytasm requested a review from jtuglu1 May 4, 2026 23:09
@maytasm maytasm force-pushed the spill_file_improvement branch from e832d03 to 343994d Compare May 4, 2026 23:10
@jtuglu1
Copy link
Copy Markdown
Contributor

jtuglu1 commented May 4, 2026

Before I review, can we please include benchmarks before/after for this change?

Copy link
Copy Markdown
Member

@FrankChen021 FrankChen021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity Findings
P0 0
P1 1
P2 0
P3 0
Total 1

This is an automated review by Codex GPT-5


final long fileSize = file.length();
if (fileSize < MIN_SPILL_FILE_BYTES) {
pendingSpillRuns.add(Files.readAllBytes(file.toPath()));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] Deleted staging spills still consume the disk quota

This path writes every small spill through LimitedTemporaryStorage, reads it back into heap, then deletes the temp file. LimitedTemporaryStorage.delete only removes the file from the file set; it does not decrement bytesUsed, and LimitedOutputStream.grab has already charged those bytes against maxOnDiskStorage. As a result, high-cardinality small-spill queries can hit TemporaryStorageFullException even though those staging files were deleted and no persistent spill file exists yet, and later flushes double-charge the same data when writing the merged file. This undermines the batching optimization and can fail queries well below their configured on-disk limit; small runs should avoid charging LimitedTemporaryStorage or the accounting needs to refund deleted staging bytes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@maytasm
Copy link
Copy Markdown
Contributor Author

maytasm commented May 5, 2026

Before I review, can we please include benchmarks before/after for this change?

@jtuglu1 Added benchmark results.

Copy link
Copy Markdown
Contributor

@jtuglu1 jtuglu1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few comments

* and writing to disk only once this threshold is reached, we avoid that explosion in file count without any
* extra disk I/O for small spills.
*/
private static final long MIN_SPILL_FILE_BYTES = 1024 * 1024L; // 1MB
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make this a config. sketch sizes can vary with the columns they cover and the k-values.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

try {
Files.delete(file.toPath());
}
catch (IOException e) {
log.warn(e, "Cannot delete file: %s", file);
}
files.remove(file);
bytesUsed.addAndGet(-fileSize);
Copy link
Copy Markdown
Contributor

@jtuglu1 jtuglu1 May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this delete fails, I'd rather not have our accounting also be inaccurate – can we put this in the try after the delete? IMO, our accounting should always be an overestimate to avoid actual disk space issues where worse exceptions can happen (e.g. I'd rather throw a query error cleanly and debug that then see historicals/peons start crashing due to underestimated statistics leading to disk space errors).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -293,6 +320,22 @@ public void setSpillingAllowed(final boolean spillingAllowed)
@Override
public CloseableIterator<Entry<KeyType>> iterator(final boolean sorted)
{
// Flush any runs that did not reach MIN_SPILL_FILE_BYTES during the spill phase.
try {
flushPendingRunsToDisk();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should see what the overhead of sorted=false is here. If we don't need a sorted run as the end result, we can just do a simple concat to avoid the decompress/re-sort overhead of merge sort. I think we might need to condition this on sortHasNonGroupingFields=false too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overhead breakdown of flushPendingRunsToDisk():

  1. LZ4 decompress — fast (~GB/s)
  2. JSON parse — moderate (dominant cost)
  3. Merge-sort comparison — cheap (O(N log K), K = few pending runs)
  4. JSON serialize — moderate (dominant cost)
  5. LZ4 compress + write — fast

Replacing mergeSorted with concat (step 3) saves very little — the JSON serde in steps 2+4 dominates.

The other approach is to write each pending run's raw byte[] sequentially into one file (each is already a complete LZ4+JSON stream). At read time, create one iterator per sub-stream. The catch with this approach is that LZ4BlockInputStream stops at each stream boundary, so reading N streams from one file requires creating N LZ4BlockInputStream instances on the same underlying FileInputStream. LZ4BlockInputStream allocates a single decompression buffer (default 64KB, matching LZ4BlockOutputStream's default block size). With a lot of spills (the scenario is are trying to fix with large aggregators + high cardinality group bys), these LZ4BlockInputStream will adds up resulting in OOM like before.

The merge-sort serde cost in flushPendingRunsToDisk() is the price we pay for keeping both file count and read-time memory bounded. And as noted earlier, replacing mergeSorted with concat alone saves very little since JSON serde dominates the cost.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw I wanted to benchmark SpillingGrouper.iterator(false) code path but turns out that SpillingGrouper.iterator(false) is never reachable through any production query path without a code change.

The only place SpillingGrouper.iterator() is called is:

  • RowBasedGrouperHelper:634 → iterator(true) (always)
  • ConcurrentGrouper:430/470 → iterator(true) or iterator(sorted) (where sorted comes from ConcurrentGrouper.iterator(sorted), which is always called with true from
    RowBasedGrouperHelper)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the reason iterator(true) is hardcoded in RowBasedGrouperHelper:634 is that the merge layer above it relies on sorted input — CombiningIterator in ConcurrentGrouper and the broker merge both detect duplicate keys by comparing consecutive sorted entries. So sorted=true is for merge correctness, not output ordering.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add a comment for this on the method? Without looking at all the call-sites, it's tough to see why we ignore the sort param.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we are properly tracking bytes during deletion, I wonder if we should also look at deleting dictionaryFiles as well here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right — deleteFiles() only deletes data files but not dictionaryFiles. Those are also tracked by LimitedTemporaryStorage and consume disk quota. Fixed.

Assert.assertTrue(
"reason should mention disk space",
lastResult.getReason().contains("Not enough disk space")
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we maybe add some asserts/another test which verify the statistics on file delete errors? Something that would catch regressions for the bug that was patched in this PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

pendingDictionaryEntries.addAll(keySerde.getDictionary());
grouper.reset();

final long fileSize = file.length();
Copy link
Copy Markdown
Contributor

@jtuglu1 jtuglu1 May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there might be an optimization here to reserve entire needed capacity for the array list upfront to prevent it needing to repeatedly double up to sufficient capacity. I think it should noop on subsequent calls too once capacity is allocated. Since we only .clear() and not something like trimToSize() this resizing might just be amortized across all the files, especially if they are of similar size and don't amount to much.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's worth optimizing.

  1. ArrayList.clear() sets size to 0 but keeps the backing array, so after the first batch flushes, the capacity is already there for subsequent batches.
  2. Even on the first batch, the cost is trivial. Say we have ~170 small spills per 1MB batch — that's about 7-8 doublings from the default capacity of 10. Each doubling copies an array of object references (8 bytes each), so the total copy overhead is on the order of a few KB. Negligible compared to the MB of actual spill data being handled.
  3. Pre-allocating would require estimating how many spills fit in a batch (minSpillFileSize / avgSpillSize), but the spill size isn't known ahead of time — it depends on the data. So the estimate would be a guess anyway.

Copy link
Copy Markdown
Contributor

@jtuglu1 jtuglu1 May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine – I mention point #1 already as another reason against this, but I think #2 is really holding the weight here. #3 isn't really too relevant IMO because you can just do pendingSpillRuns.ensureCapacity(fileSize); (with the heuristic that files will be approximately similarly sized).

final Object[] deserializedValues = reusableEntry.getValues();
for (int i = 0; i < deserializedValues.length; i++) {
deserializedValues[i] = aggregatorFactories[i].deserialize(entry.getValues()[i]);
if (deserializedValues[i] instanceof Integer) {
Copy link
Copy Markdown
Contributor

@jtuglu1 jtuglu1 May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why are we coercing to long values here?

Copy link
Copy Markdown
Contributor Author

@maytasm maytasm May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is existing code. https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/SpillingGrouper.java#L319
I just refactored it to a common method to avoid repeating code / sharing between flushPendingRunsToDisk() and iterator().

Copy link
Copy Markdown
Contributor

@jtuglu1 jtuglu1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once the remaining comments addressed and/or resolved

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gianm might be able to comment on why this not put inside the try as well.

@maytasm
Copy link
Copy Markdown
Contributor Author

maytasm commented May 6, 2026

LGTM once the remaining comments addressed and/or resolved

Thanks for the review. All followup comments addressed.

Copy link
Copy Markdown
Member

@FrankChen021 FrankChen021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rechecked the follow-up and the latest code addresses the quota issue I raised: deleted staging spill files now refund the LimitedTemporaryStorage byte accounting, temp file naming remains safe after deletes, and the new regression coverage checks tracked bytes against actual disk bytes. No further inline reply from me is needed on that thread.


This is an automated review by Codex GPT-5

@maytasm maytasm merged commit 56a10f5 into apache:master May 7, 2026
63 of 64 checks passed
@maytasm maytasm deleted the spill_file_improvement branch May 7, 2026 17:59
@github-actions github-actions Bot added this to the 38.0.0 milestone May 7, 2026
317brian pushed a commit to 317brian/druid that referenced this pull request May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants