Add spill file count limit for GroupBy query by maytasm · Pull Request #19141 · apache/druid

maytasm · 2026-03-12T07:35:38Z

Add spill file count limit for GroupBy query

Description

GroupBy queries that group on high-cardinality dimensions can create a large number of spill files. This problem is more likely when queries contain many aggregators and/or aggregators with large memory footprints (e.g., DataSketch). This is because GroupBy can only hold a limited number of unique groupings in memory before flushing to disk — the exact limit depends on the size of each row, which is determined by the size of the aggregators. The issue arises when GroupBy attempts to merge all the spill files. Currently, GroupBy merges spill files by opening all of them simultaneously. Opening these files requires memory for objects such as MappingIterator, SmileParser, etc., which can cause historical nodes to OOM.

This PR fixes the issue by introducing a new property: druid.query.groupBy.maxSpillFileCount
The maximum number of spill files allowed per GroupBy query. When the limit is reached, the query fails with a ResourceLimitExceededException. This property can be used to prevent historical nodes from OOMing due to an excessive number of spill files being opened simultaneously during the merge phase. Defaults to Integer.MAX_VALUE (unlimited). Can also be set per query via the query context key maxSpillFileCount.

Note that this new config, maxSpillFileCount, is complementary to the existing maxOnDiskStorage. maxOnDiskStorage limits total bytes across all spill files, but cannot prevent a large number of tiny files — a query can create hundreds of thousands of spill files while staying well under the byte limit. maxSpillFileCount fills this gap by limiting file count directly, which bounds the number of simultaneously open file handles during the merge phase. This situation arises when aggregators like ThetaSketch pre-allocate a large fixed buffer per row in memory (e.g. ~131KB), causing the buffer to flush frequently with only a small number of rows; since each row corresponds to a unique grouping key in a high-cardinality dimension, each sketch has seen very few values at flush time and serializes to only a few bytes on disk using the sketch's compact format.

Release Notes

Added a new GroupBy query configuration property druid.query.groupBy.maxSpillFileCount to limit the maximum number of spill files created per query. When the limit is exceeded, the query fails with a clear error message instead of causing historical nodes to OOM during spill file merging. The limit can also be overridden per query via the query context maxSpillFileCount.

Followup

We can also add maxSpillFileCount to the broker dynamic config
We can fix the root cause, which is the merging of the spill files. Instead of the current single-pass open-all merge, we can do something like a N-way merging with a bounded fan-in.

Key changed/added classes in this PR

processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/LimitedTemporaryStorage.java
processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/SpillingGrouper.java

This PR has:

jtuglu1

LGTM - as we have been running internally in production for a while now, pretty confident this is a good change. Left a few nits RE custom exception (maybe this should extend InternalServerError so we can classify it? I see the other group by exception doesn't do this, but maybe that predates the exception classes) and providing context in docs as to why you would need this vs just the raw byte config.

jtuglu1 · 2026-03-12T08:39:17Z

+  public TemporaryStorageFileLimitException(final int fileCount)
  {
-    return s3Object;
+    super(StringUtils.format("Cannot write to disk, hit file count limit of %,d.", fileCount));


nit: spill file count

jtuglu1 · 2026-03-12T08:40:52Z

 |`druid.query.groupBy.maxSelectorDictionarySize`|Maximum amount of heap space (approximately) to use for per-segment string dictionaries. See [groupBy memory tuning and resource limits](../querying/groupbyquery.md#memory-tuning-and-resource-limits) for details.|100000000|
 |`druid.query.groupBy.maxMergingDictionarySize`|Maximum amount of heap space (approximately) to use for per-query string dictionaries. When the dictionary exceeds this size, a spill to disk will be triggered. See [groupBy memory tuning and resource limits](../querying/groupbyquery.md#memory-tuning-and-resource-limits) for details.|100000000|
 |`druid.query.groupBy.maxOnDiskStorage`|Maximum amount of disk space to use, per-query, for spilling result sets to disk when either the merging buffer or the dictionary fills up. Queries that exceed this limit will fail. Set to zero to disable disk spilling.|0 (disabled)|
+|`druid.query.groupBy.maxSpillFileCount`|Maximum number of spill files allowed per GroupBy query. Queries that exceed this limit will fail.|Integer.MAX_VALUE (unlimited)|


maybe also mention why you might want to use this, either a) to prevent OOM and/or to prevent excessive spill files, while the total spill byte size is under druid.query.groupBy.maxOnDiskStorage.

Im going to say "See groupBy memory tuning and resource limits for details." here and add more details to the memory-tuning-and-resource-limits section (so as to not repeat myself)

maytasm · 2026-03-12T22:46:09Z

maybe this should extend InternalServerError so we can classify it? I see the other group by exception doesn't do this, but maybe that predates the exception classes

That's not needed. SpillingGrouper.aggregate() doesn't throw the Exception. It returns AggregateResult with a failure reason. That will be handled upstream.

maytasm added 3 commits March 12, 2026 00:17

Add spill file count limit for GroupBy query

e01f035

Add spill file count limit for GroupBy query

8d68ea2

Add spill file count limit for GroupBy query

a49ae02

github-actions Bot added the Area - Documentation label Mar 12, 2026

Add spill file count limit for GroupBy query

6937308

jtuglu1 approved these changes Mar 12, 2026

View reviewed changes

maytasm added Area - Querying Release Notes labels Mar 12, 2026

maytasm added 2 commits March 12, 2026 15:59

Add spill file count limit for GroupBy query

e43cbaf

Add spill file count limit for GroupBy query

82da9fb

maytasm merged commit e22986e into apache:master Mar 13, 2026
37 checks passed

github-actions Bot added this to the 37.0.0 milestone Mar 13, 2026

maytasm deleted the spill_file_limit branch March 16, 2026 04:19

maytasm mentioned this pull request May 4, 2026

fix: Optimizes SpillingGrouper for high cardinality dimension(s) GroupBy with large memory footprint aggregators #19357

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add spill file count limit for GroupBy query#19141

Add spill file count limit for GroupBy query#19141
maytasm merged 6 commits into
apache:masterfrom
maytasm:spill_file_limit

maytasm commented Mar 12, 2026 •

edited

Loading

Uh oh!

jtuglu1 left a comment •

edited

Loading

Uh oh!

jtuglu1 Mar 12, 2026

Uh oh!

maytasm Mar 12, 2026

Uh oh!

jtuglu1 Mar 12, 2026 •

edited

Loading

Uh oh!

maytasm Mar 12, 2026 •

edited

Loading

Uh oh!

maytasm commented Mar 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

maytasm commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Release Notes

Followup

Key changed/added classes in this PR

Uh oh!

jtuglu1 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jtuglu1 Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

maytasm Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

jtuglu1 Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maytasm Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maytasm commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

maytasm commented Mar 12, 2026 •

edited

Loading

jtuglu1 left a comment •

edited

Loading

jtuglu1 Mar 12, 2026 •

edited

Loading

maytasm Mar 12, 2026 •

edited

Loading

maytasm commented Mar 12, 2026 •

edited

Loading