Skip to content

Add spill file count limit for GroupBy query#19141

Merged
maytasm merged 6 commits into
apache:masterfrom
maytasm:spill_file_limit
Mar 13, 2026
Merged

Add spill file count limit for GroupBy query#19141
maytasm merged 6 commits into
apache:masterfrom
maytasm:spill_file_limit

Conversation

@maytasm
Copy link
Copy Markdown
Contributor

@maytasm maytasm commented Mar 12, 2026

Add spill file count limit for GroupBy query

Description

GroupBy queries that group on high-cardinality dimensions can create a large number of spill files. This problem is more likely when queries contain many aggregators and/or aggregators with large memory footprints (e.g., DataSketch). This is because GroupBy can only hold a limited number of unique groupings in memory before flushing to disk — the exact limit depends on the size of each row, which is determined by the size of the aggregators. The issue arises when GroupBy attempts to merge all the spill files. Currently, GroupBy merges spill files by opening all of them simultaneously. Opening these files requires memory for objects such as MappingIterator, SmileParser, etc., which can cause historical nodes to OOM.

This PR fixes the issue by introducing a new property: druid.query.groupBy.maxSpillFileCount
The maximum number of spill files allowed per GroupBy query. When the limit is reached, the query fails with a ResourceLimitExceededException. This property can be used to prevent historical nodes from OOMing due to an excessive number of spill files being opened simultaneously during the merge phase. Defaults to Integer.MAX_VALUE (unlimited). Can also be set per query via the query context key maxSpillFileCount.

Note that this new config, maxSpillFileCount, is complementary to the existing maxOnDiskStorage. maxOnDiskStorage limits total bytes across all spill files, but cannot prevent a large number of tiny files — a query can create hundreds of thousands of spill files while staying well under the byte limit. maxSpillFileCount fills this gap by limiting file count directly, which bounds the number of simultaneously open file handles during the merge phase. This situation arises when aggregators like ThetaSketch pre-allocate a large fixed buffer per row in memory (e.g. ~131KB), causing the buffer to flush frequently with only a small number of rows; since each row corresponds to a unique grouping key in a high-cardinality dimension, each sketch has seen very few values at flush time and serializes to only a few bytes on disk using the sketch's compact format.


Release Notes

  • Added a new GroupBy query configuration property druid.query.groupBy.maxSpillFileCount to limit the maximum number of spill files created per query. When the limit is exceeded, the query fails with a clear error message instead of causing historical nodes to OOM during spill file merging. The limit can also be overridden per query via the query context maxSpillFileCount.

Followup

  • We can also add maxSpillFileCount to the broker dynamic config
  • We can fix the root cause, which is the merging of the spill files. Instead of the current single-pass open-all merge, we can do something like a N-way merging with a bounded fan-in.
Key changed/added classes in this PR
  • processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/LimitedTemporaryStorage.java
  • processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/SpillingGrouper.java

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Copy link
Copy Markdown
Contributor

@jtuglu1 jtuglu1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - as we have been running internally in production for a while now, pretty confident this is a good change. Left a few nits RE custom exception (maybe this should extend InternalServerError so we can classify it? I see the other group by exception doesn't do this, but maybe that predates the exception classes) and providing context in docs as to why you would need this vs just the raw byte config.

public TemporaryStorageFileLimitException(final int fileCount)
{
return s3Object;
super(StringUtils.format("Cannot write to disk, hit file count limit of %,d.", fileCount));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: spill file count

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread docs/configuration/index.md Outdated
|`druid.query.groupBy.maxSelectorDictionarySize`|Maximum amount of heap space (approximately) to use for per-segment string dictionaries. See [groupBy memory tuning and resource limits](../querying/groupbyquery.md#memory-tuning-and-resource-limits) for details.|100000000|
|`druid.query.groupBy.maxMergingDictionarySize`|Maximum amount of heap space (approximately) to use for per-query string dictionaries. When the dictionary exceeds this size, a spill to disk will be triggered. See [groupBy memory tuning and resource limits](../querying/groupbyquery.md#memory-tuning-and-resource-limits) for details.|100000000|
|`druid.query.groupBy.maxOnDiskStorage`|Maximum amount of disk space to use, per-query, for spilling result sets to disk when either the merging buffer or the dictionary fills up. Queries that exceed this limit will fail. Set to zero to disable disk spilling.|0 (disabled)|
|`druid.query.groupBy.maxSpillFileCount`|Maximum number of spill files allowed per GroupBy query. Queries that exceed this limit will fail.|Integer.MAX_VALUE (unlimited)|
Copy link
Copy Markdown
Contributor

@jtuglu1 jtuglu1 Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe also mention why you might want to use this, either a) to prevent OOM and/or to prevent excessive spill files, while the total spill byte size is under druid.query.groupBy.maxOnDiskStorage.

Copy link
Copy Markdown
Contributor Author

@maytasm maytasm Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im going to say "See groupBy memory tuning and resource limits for details." here and add more details to the memory-tuning-and-resource-limits section (so as to not repeat myself)

@maytasm
Copy link
Copy Markdown
Contributor Author

maytasm commented Mar 12, 2026

maybe this should extend InternalServerError so we can classify it? I see the other group by exception doesn't do this, but maybe that predates the exception classes

That's not needed. SpillingGrouper.aggregate() doesn't throw the Exception. It returns AggregateResult with a failure reason. That will be handled upstream.

@maytasm maytasm merged commit e22986e into apache:master Mar 13, 2026
37 checks passed
@github-actions github-actions Bot added this to the 37.0.0 milestone Mar 13, 2026
@maytasm maytasm deleted the spill_file_limit branch March 16, 2026 04:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants