Optimize Dictionary groupings#21765
Conversation
|
If your interested in some of the discussions made leading up to this PR please view #21589 |
|
@alamb this PR should resolve the initial regression that was caused (described here) as well as provide performance boost. I would love to hear your thoughts this approach
seeing as how this PR is already large, I think this would be a nice follow up PR since as far as i know this work was never finished since #9017 was closed due to inactivity |
|
Do we have any performance benchmarks for this PR? |
|
(basically it is hard to justify an optimization without benchmark results, even if they need to be run manually at first) |
|
There were also micro benchmarks linked in the previous PR |
837b382 to
6675b68
Compare
|
Removing any extra allocations like .to_vec() and repeated hash look ups into map |
|
speed up : 24 approachnormalize_dict_hash() is the path for arrays with values that arent expected to fit in cpu caches, that gives this implementation a guide but a loose one.
with that being said I think these results show a generally a large improvement over the current approach to dealing with dictionary encoded columns in data-fusion. the update to normalize_dict_hash() is minor compared to the PR as a whole. @alamb |
|
Pre-allocating the values buffer causes a regression |
|
@Rich-T-kid are you sure tpch-1 looks at the data at all in your benchmarks? It looks suspicially fast, I think it might only plan/setup the queries but not look at the data. |
|
reran the benchmarks after generating data. |
|
@adriangb these benchmarks reflect what my design doc mentions the more rows/greater scale the better it should perform. these were all being run on a mac-book maybe running it with the |
e30d6d1 to
5d6a9b0
Compare
|
Are these queries even using dictionary data types? |
|
run benchmarks |
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
1 similar comment
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
|
Yea it seems that no benchmarks currently perform a group by using dictionary encodeded columns. I created this PR to address that. @Dandandan could you please take a look? #21860 |
|
run benchmark dictionary_group_values |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing rich-t-kid/optimize-dictionary-grouping (080daa9) to 9b81ff8 (merge-base) diff using: dictionary_group_values File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagedictionary_group_values — base (merge-base)
dictionary_group_values — branch
File an issue against this benchmark runner |
|
@Rich-T-kid we are seeing regression in the high cardinality but low cardinality is excellent |
|
@kumarUjjawal that is expected. In the case where every value is unique we are adding overhead. a regression occurs when the values & keys are 1:1 |
|
This provides a performance increase in every case where dictionary arrays are expected to be used — low to medium cardinality. If the data is extremely high cardinality, dictionary arrays are the wrong data type. I don't think it's possible to get the best of both worlds in this case. |
Is there any way to reduce the overhead so the difference is not as much? |
|
I've experimented with switching between |
|
This looks very cool! I have a background question: how do these dictionary arrays enter the aggregation operator? Do we have a way to read dictionary arrays directly from Parquet, or to preserve them through |
|
When arrow-rs reads RLE-encoded strings from a Parquet file, it decodes them into dictionary arrays. These flow through operators like any other array. For operators like |
080daa9 to
bf62c82
Compare
bf62c82 to
81a5c1b
Compare
|
local results of benchmarks for the 1:1 cardinality case (ran against main) @kumarUjjawal could you run the benchmarks again? |
I think you need to configure the parquet reader with a schema that specifies Dictionary as the data type when reading such a column from parquet files (we use this at InfluxData). However, I don't think there is way to do this via SQL today in DataFusion, though you can do it programmatically (by providing a schema to the table provider) |
|
@alamb could you re-run the micro-benchmarks? |
|
run benchmark dictionary_group_values |
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
|
@alamb @adriangbot failed |
|
run benchmark dictionary_group_values |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing rich-t-kid/optimize-dictionary-grouping (d440824) to 96a6096 (merge-base) diff using: dictionary_group_values File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagedictionary_group_values — base (merge-base)
dictionary_group_values — branch
File an issue against this benchmark runner |
|
@kumarUjjawal @alamb with #22078's performance gains, there are now no regressions even in the worst case 🚀 |
|
@alamb This looks like a good improvement to me, what do you think? |
|
FYI @zhuqi-lucas -- perhaps is this part of your work to complete type support of Dictionaries? |
alamb
left a comment
There was a problem hiding this comment.
Thanks for working on this @Rich-T-kid and @kumarUjjawal -- I think this is a valuable direction to be pushing, but I think we should try and get something more holistic in place (e.g. have a plan for nested types in general not just Dictionary)
| .unwrap_or(self.row_buffer.len()); | ||
| Some(&self.row_buffer[start..end]) | ||
| }); | ||
| match &self.value_dt { |
There was a problem hiding this comment.
In general I am worried about needing special code for different element types in the dictionary
In general, this seems like:
- It limits the types of DictionaryArray that can be supported (e.g. this doesn't support Dicts with struct elements)
- It will have substantial amounts of code (b/c each type of dictionary now gets its own branch in this match statement)
What I think @zhuqi-lucas is trying to do as part of
And some related PRs is to make a generic way to handle nested types that doesn't require generating code for all possible type combinations.
There was a problem hiding this comment.
thank you for the review @alamb
I think it makes sense to only support Dictionary<_, Utf8/Utf8View>, this would significantly simplify the code paths while delivering a large performance boost for production-like data shapes (low-cardinality group bys). Similar to other single-column group by specializations, it leaves a viable fallback that loses nothing when the fast path isn't hit.
With that in mind, @zhuqi-lucas's work looks very interesting -- I had a similar idea with #22891 and #21878. The issue was that multi-column group-bys on dictionaries fell through to GroupValuesRows, which was slow. I'm actually working on a multi-column dictionary group by in #22983, but I'm running into similar issues around handling multiple distinct types. For the multi-column case it makes more sense to build on @zhuqi-lucas's work, since the number of types that need to be supported explodes and there are diminishing returns to exploiting dictionary properties across multiple columns and intern calls (key space cache, arc ptr cache, compute keys once and reuse).
As for this PR specifically, I do think it adds significant value. Would restricting the supported value types be a viable path forward? It would keep the type combinations from exploding while still letting the community benefit from the speedup.
| DROP TABLE dict_hash_src; | ||
|
|
||
| query TI | ||
| SELECT |
There was a problem hiding this comment.
there is a bunch of code to handle many more types of dictionaries such as Lists, etc. -- can we please add more tests to cover those cases?
There was a problem hiding this comment.
this PR is already at 1k LOC, I'll open a separate PR







Which issue does this PR close?
This PR make an effort towards #7000 & closing #7647 + #21466
This PR aim to close half of #7647
A separate follow up PR aims to close the multi-column + dictionary column case
Rationale for this change
Issue #7647 (Materialize Dictionaries in Group Keys) identified that DataFusion was not taking advantage of dictionary encoding during hash aggregation, instead of operating on the compact dictionary representation, it was deserializing dictionary arrays into a generic row-based format, throwing away all the encoding benefits. An initial attempt was made to fix this but it caused regressions and was ultimately rolled back.
This PR
This PR takes a different approach. Rather than materializing the dictionary into a generic representation, it introduces GroupValuesDictionary, a specialized implementation that operates directly on the dictionary's structure. The key insight is that dictionary encoded columns have inherently low cardinality by design, meaning the same values repeat frequently across rows. Instead of hashing and comparing every row independently, we maintain a mapping of unique value hashes to group IDs so that repeated values are resolved in O(1) after the first encounter. Null values are similarly cached so that subsequent null rows never pay the lookup cost again. For a more detailed explanation of the implementation see this design doc (needs to be updated but the high level idea is still present).
What changes are included in this PR?
Update the match statement in new_group_values to include a custom dictionary encoding branch that works for single fields that are of type Dictionary array
Are these changes tested?
Yes, about half of the code in this PR is about testing edge cases.
Are there any user-facing changes?
No