perf: Optimize SpillingGrouper to avoid unnecessary disk I/O for small spill runs#19439
Open
maytasm wants to merge 5 commits into
Open
perf: Optimize SpillingGrouper to avoid unnecessary disk I/O for small spill runs#19439maytasm wants to merge 5 commits into
maytasm wants to merge 5 commits into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR optimizes GroupBy spilling by introducing an output stream that buffers small spill runs in memory and only creates disk files when the serialized spill exceeds the configured threshold, reducing unnecessary file I/O for small spills.
Changes:
- Adds
SpillOutputStreamto switch from heap buffering toLimitedTemporaryStorageonly after the threshold is exceeded. - Refactors
SpillingGrouperspill serialization to use the new stream while preserving pending-run batching and disk-spill behavior. - Adds and updates unit tests for in-memory spill handling, threshold behavior, disk fallback, and storage-limit enforcement.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/SpillOutputStream.java |
Adds the threshold-aware spill output stream. |
processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/SpillingGrouper.java |
Routes grouper spill serialization through SpillOutputStream. |
processing/src/test/java/org/apache/druid/query/groupby/epinephelinae/SpillOutputStreamTest.java |
Adds unit coverage for the new stream behavior. |
processing/src/test/java/org/apache/druid/query/groupby/epinephelinae/SpillingGrouperTest.java |
Updates spilling tests for in-memory small-spill behavior and storage-limit scenarios. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Member
FrankChen021
left a comment
There was a problem hiding this comment.
I have reviewed the code for correctness, edge cases, concurrency, and integration risks; no issues found.
Reviewed 4 of 4 changed files.
This is an automated review by Codex GPT-5.5
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
perf: Optimize SpillingGrouper to avoid unnecessary disk I/O for small spill runs
Description
This is a followup to #19357
The spill batching logic (introduced to avoid thousands of tiny disk files) previously had to write to disk first and check the file size afterward, because the serialized size isn't known upfront — and if the serialized data turns out to be large, buffering it entirely in memory before deciding would risk OOM. So the safe path was: always write to a temp file, then read it back into memory only if it was small enough to batch.
This is correct but expensive for certain cases. When groupBy queries produce spill runs whose serialized size is much smaller than their in-memory buffer (e.g., HLL sketches in sparse/SET mode serialize to a fraction of their pre-allocated buffer), this creates thousands of unnecessary file create/write/read/delete cycles just to discover the data was small enough to batch in memory.
SpillOutputStream solves both concerns: it writes to a heap buffer first, and only when the buffer exceeds the threshold does it open a file and flush the accumulated bytes to disk. Large spills still go to disk (no OOM risk), but small spills never touch the filesystem. Peak extra heap is bounded to the threshold size (minSpillFileSize, default 1MB).
Key changed/added classes in this PR
Benchmarks result
Before this PR:
After this PR:
The existing queryMultiQueryableIndexWithSpilling/queryMultiQueryableIndexWithSpillingTTFR uses bufferGrouperMaxSize=4000 which produce reasonable spills of ~200kb. I have also added new benchmarks with similar idea but producing spill files on extremes ends for size. queryMultiQueryableIndexWithSmallSpilling/queryMultiQueryableIndexWithSmallSpillingTTFR sets bufferGrouperMaxSize=100, producing spill size ~6 KB. This would result in more batching. queryMultiQueryableIndexWithLargeSpilling/queryMultiQueryableIndexWithLargeSpillingTTFR sets bufferGrouperMaxSize=70000, producing spill size ~4 MB. This would skip batching. These new benchmarks are not added to the PR since they are really the same as queryMultiQueryableIndexWithSpilling/queryMultiQueryableIndexWithSpillingTTFR just with different config values.
Key changed/added classes in this PR
SpillingGrouperSpillOutputStreamThis PR has: