Optimize Dictionary groupings by Rich-T-kid · Pull Request #21765 · apache/datafusion

Rich-T-kid · 2026-04-21T14:28:54Z

Which issue does this PR close?

This PR make an effort towards #7000 & closing #7647 + #21466

This PR aim to close half of #7647
A separate follow up PR aims to close the multi-column + dictionary column case

Closes Materialize Dictionaries in Group Keys #7647.

Rationale for this change

Issue #7647 (Materialize Dictionaries in Group Keys) identified that DataFusion was not taking advantage of dictionary encoding during hash aggregation, instead of operating on the compact dictionary representation, it was deserializing dictionary arrays into a generic row-based format, throwing away all the encoding benefits. An initial attempt was made to fix this but it caused regressions and was ultimately rolled back.

This PR

This PR takes a different approach. Rather than materializing the dictionary into a generic representation, it introduces GroupValuesDictionary, a specialized implementation that operates directly on the dictionary's structure. The key insight is that dictionary encoded columns have inherently low cardinality by design, meaning the same values repeat frequently across rows. Instead of hashing and comparing every row independently, we maintain a mapping of unique value hashes to group IDs so that repeated values are resolved in O(1) after the first encounter. Null values are similarly cached so that subsequent null rows never pay the lookup cost again. For a more detailed explanation of the implementation see this design doc (needs to be updated but the high level idea is still present).

What changes are included in this PR?

Update the match statement in new_group_values to include a custom dictionary encoding branch that works for single fields that are of type Dictionary array

Are these changes tested?

Yes, about half of the code in this PR is about testing edge cases.

Are there any user-facing changes?

No

Rich-T-kid · 2026-04-21T14:30:39Z

If your interested in some of the discussions made leading up to this PR please view #21589

Rich-T-kid · 2026-04-21T17:38:52Z

@alamb this PR should resolve the initial regression that was caused (described here) as well as provide performance boost. I would love to hear your thoughts this approach

I think we need to create a benchmark that does aggregation queries on dictionary encoded string columns (I know the existing end to end TPCH, and ClickBench benchmarks do not do this). I will file a ticket shortly about this
source

seeing as how this PR is already large, I think this would be a nice follow up PR since as far as i know this work was never finished since #9017 was closed due to inactivity

alamb · 2026-04-21T18:24:39Z

Do we have any performance benchmarks for this PR?

alamb · 2026-04-21T18:25:49Z

(basically it is hard to justify an optimization without benchmark results, even if they need to be run manually at first)

Rich-T-kid · 2026-04-21T20:40:04Z

There were also micro benchmarks linked in the previous PR
Faster : 25
slower : 17
no-change : 24
I didn't expect to see the tpc-h sf1 have this many regressions. I think I have an idea as to what is causing this. currently working on a fix

Rich-T-kid · 2026-04-22T03:43:29Z

Removing any extra allocations like .to_vec() and repeated hash look ups into map

Rich-T-kid · 2026-04-22T03:57:28Z

speed up : 24
Slower : 6
no change : 36
last optimize that comes to mind is updating normalize_dict_hash() by eliminating per-insert allocations.
Currently, normalize_dict_hash() creates a HashMap<Vec, usize> where each unique key's raw bytes are heap allocated via .to_vec() before insertion. This allocation occurs once per unique value but is unnecessary since the underlying Arrow buffer already owns the bytes.
The plan is to pre-compute a Vec<Option<Cow<[u8]>>> of raw byte slices for all accessed values upfront, allowing the hashmap to store &[u8] references instead of owned Vec keys, eliminating the per-insert allocations entirely.

approach

normalize_dict_hash() is the path for arrays with values that arent expected to fit in cpu caches, that gives this implementation a guide but a loose one.
main three cases

keys outnumber the values & there is some benefit to this pre-allocation step as otherwise we'd be creating exactly d vectors where d is the number of unique elements in the values array that the keys array indexes into. we save by never allocating
the values array has very high cardinality. Here we save alot of compute and space by not creating a new allocation for each value. we only need the &[u8] for comparisons to determine a cache hit we save out by not creating extra vectors that are dropped at the end of the function
Values array has many unreferenced entries (bloated/non-canonicalized dictionary)
would only call get_raw_bytes() for values actually referenced by keys, so a values array with 10,000 entries where keys only reference 50 of them means you only build 50 slices instead of 10,000. The pre-allocation approach wins significantly here since the alternative would have allocated and dropped 50 Vec unnecessarily.
In each case I think its a win, in a perfect world I would benchmark this to see if eliminating the .to_vec() allocations outweighs the cost of the upfront pass over the accessed value indices to build the slice cache. I'll implement both approaches, benchmark them, and push the more optimized version.

with that being said I think these results show a generally a large improvement over the current approach to dealing with dictionary encoded columns in data-fusion. the update to normalize_dict_hash() is minor compared to the PR as a whole. @alamb

Rich-T-kid · 2026-04-22T04:42:58Z

Pre-allocating the values buffer causes a regression

Dandandan · 2026-04-22T08:02:14Z

@Rich-T-kid are you sure tpch-1 looks at the data at all in your benchmarks? It looks suspicially fast, I think it might only plan/setup the queries but not look at the data.
Perhaps the path is wrong or using it from the wrong directory?

Rich-T-kid · 2026-04-22T21:02:06Z

reran the benchmarks after generating data.

Rich-T-kid · 2026-04-22T21:02:37Z

tpch_sft1

Rich-T-kid · 2026-04-22T21:03:20Z

tpch_sf10

Rich-T-kid · 2026-04-22T21:03:36Z

tpch_mem_sf1

Rich-T-kid · 2026-04-22T21:04:06Z

tpch_mem_sf10

Rich-T-kid · 2026-04-22T21:09:00Z

@adriangb these benchmarks reflect what my design doc mentions the more rows/greater scale the better it should perform. these were all being run on a mac-book maybe running it with the run benchmarks command may yield slightly different results

Dandandan · 2026-04-23T04:10:59Z

Are these queries even using dictionary data types?
I think they are string views and primitives

Dandandan · 2026-04-23T04:11:11Z

run benchmarks

adriangbot · 2026-04-23T04:12:02Z

Benchmark for this request failed.

Last 20 lines of output:

Click to expand

 * [new tag]             48.0.0-rc1     -> 48.0.0-rc1
 * [new tag]             48.0.0-rc2     -> 48.0.0-rc2
 * [new tag]             5.0.0          -> 5.0.0
 * [new tag]             5.0.0-rc1      -> 5.0.0-rc1
 * [new tag]             5.0.0-rc3      -> 5.0.0-rc3
 * [new tag]             6.0.0          -> 6.0.0
 * [new tag]             6.0.0-rc0      -> 6.0.0-rc0
 * [new tag]             7.0.0          -> 7.0.0
 * [new tag]             7.0.0-rc2      -> 7.0.0-rc2
 * [new tag]             8.0.0          -> 8.0.0
 * [new tag]             8.0.0-rc1      -> 8.0.0-rc1
 * [new tag]             8.0.0-rc2      -> 8.0.0-rc2
 * [new tag]             9.0.0          -> 9.0.0
 * [new tag]             9.0.0-rc1      -> 9.0.0-rc1
 * [new tag]             ballista-0.5.0 -> ballista-0.5.0
 * [new tag]             ballista-0.6.0 -> ballista-0.6.0
 * [new tag]             ballista-0.7.0 -> ballista-0.7.0
 * [new tag]             python-0.3.0   -> python-0.3.0
 * [new tag]             python-0.4.0   -> python-0.4.0
Switched to branch 'rich-t-kid/optimize-dictionary-grouping'

File an issue against this benchmark runner

Dandandan · 2026-04-23T04:12:23Z

tpch_sf10

Still does not look correct

adriangbot · 2026-04-23T04:12:45Z

Benchmark for this request failed.

Last 20 lines of output:

Click to expand

 * [new tag]             48.0.0-rc1     -> 48.0.0-rc1
 * [new tag]             48.0.0-rc2     -> 48.0.0-rc2
 * [new tag]             5.0.0          -> 5.0.0
 * [new tag]             5.0.0-rc1      -> 5.0.0-rc1
 * [new tag]             5.0.0-rc3      -> 5.0.0-rc3
 * [new tag]             6.0.0          -> 6.0.0
 * [new tag]             6.0.0-rc0      -> 6.0.0-rc0
 * [new tag]             7.0.0          -> 7.0.0
 * [new tag]             7.0.0-rc2      -> 7.0.0-rc2
 * [new tag]             8.0.0          -> 8.0.0
 * [new tag]             8.0.0-rc1      -> 8.0.0-rc1
 * [new tag]             8.0.0-rc2      -> 8.0.0-rc2
 * [new tag]             9.0.0          -> 9.0.0
 * [new tag]             9.0.0-rc1      -> 9.0.0-rc1
 * [new tag]             ballista-0.5.0 -> ballista-0.5.0
 * [new tag]             ballista-0.6.0 -> ballista-0.6.0
 * [new tag]             ballista-0.7.0 -> ballista-0.7.0
 * [new tag]             python-0.3.0   -> python-0.3.0
 * [new tag]             python-0.4.0   -> python-0.4.0
Switched to branch 'rich-t-kid/optimize-dictionary-grouping'

File an issue against this benchmark runner

adriangbot · 2026-04-23T04:13:44Z

Benchmark for this request failed.

Last 20 lines of output:

Click to expand

 * [new tag]             48.0.0-rc1     -> 48.0.0-rc1
 * [new tag]             48.0.0-rc2     -> 48.0.0-rc2
 * [new tag]             5.0.0          -> 5.0.0
 * [new tag]             5.0.0-rc1      -> 5.0.0-rc1
 * [new tag]             5.0.0-rc3      -> 5.0.0-rc3
 * [new tag]             6.0.0          -> 6.0.0
 * [new tag]             6.0.0-rc0      -> 6.0.0-rc0
 * [new tag]             7.0.0          -> 7.0.0
 * [new tag]             7.0.0-rc2      -> 7.0.0-rc2
 * [new tag]             8.0.0          -> 8.0.0
 * [new tag]             8.0.0-rc1      -> 8.0.0-rc1
 * [new tag]             8.0.0-rc2      -> 8.0.0-rc2
 * [new tag]             9.0.0          -> 9.0.0
 * [new tag]             9.0.0-rc1      -> 9.0.0-rc1
 * [new tag]             ballista-0.5.0 -> ballista-0.5.0
 * [new tag]             ballista-0.6.0 -> ballista-0.6.0
 * [new tag]             ballista-0.7.0 -> ballista-0.7.0
 * [new tag]             python-0.3.0   -> python-0.3.0
 * [new tag]             python-0.4.0   -> python-0.4.0
Switched to branch 'rich-t-kid/optimize-dictionary-grouping'

File an issue against this benchmark runner

Rich-T-kid · 2026-04-26T18:54:52Z

Yea it seems that no benchmarks currently perform a group by using dictionary encodeded columns. I created this PR to address that. @Dandandan could you please take a look? #21860

kumarUjjawal · 2026-06-09T17:17:19Z

run benchmark dictionary_group_values

adriangbot · 2026-06-09T17:20:32Z

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4662199962-505-5wtgp 6.12.68+ #1 SMP Sat May 2 07:49:07 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing rich-t-kid/optimize-dictionary-grouping (080daa9) to 9b81ff8 (merge-base) diff using: dictionary_group_values
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-06-09T17:38:35Z

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

group                                                                             HEAD                                   rich-t-kid_optimize-dictionary-grouping
-----                                                                             ----                                   ---------------------------------------
dict_intern_emit/intern_emit/size_65536_card_1000_null_0.00                       3.39    885.3±8.93µs 70.6 MElem/sec    1.00    261.3±2.07µs 239.2 MElem/sec
dict_intern_emit/intern_emit/size_65536_card_20_null_0.00                         5.81    793.8±6.22µs 78.7 MElem/sec    1.00    136.6±1.03µs 457.6 MElem/sec
dict_intern_emit/intern_emit/size_65536_card_300_null_0.00                        4.83    827.4±8.79µs 75.5 MElem/sec    1.00    171.2±0.82µs 365.0 MElem/sec
dict_intern_emit/intern_emit/size_65536_card_65536_null_0.00                      1.00      6.3±0.01ms 10.0 MElem/sec    2.50     15.6±0.12ms  4.0 MElem/sec
dict_intern_emit/intern_emit/size_65536_card_75_null_0.00                         5.70   817.0±26.55µs 76.5 MElem/sec    1.00    143.4±0.41µs 435.9 MElem/sec
dict_intern_emit/intern_emit/size_8192_card_1000_null_0.00                        1.12    162.2±1.16µs 48.2 MElem/sec    1.00    144.7±2.12µs 54.0 MElem/sec
dict_intern_emit/intern_emit/size_8192_card_20_null_0.00                          5.72   110.4±47.55µs 70.8 MElem/sec    1.00     19.3±0.17µs 404.7 MElem/sec
dict_intern_emit/intern_emit/size_8192_card_300_null_0.00                         2.25    122.8±0.98µs 63.6 MElem/sec    1.00     54.6±0.51µs 143.1 MElem/sec
dict_intern_emit/intern_emit/size_8192_card_75_null_0.00                          4.17    111.3±1.12µs 70.2 MElem/sec    1.00     26.7±0.16µs 292.6 MElem/sec
dict_intern_emit/intern_emit/size_8192_card_8192_null_0.00                        1.00    671.0±2.56µs 11.6 MElem/sec    1.66  1112.2±18.96µs  7.0 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_65536_card_1000_null_0.10     4.54      4.4±0.03ms 56.9 MElem/sec    1.00    967.0±4.56µs 258.5 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_65536_card_20_null_0.10       5.52      4.2±0.03ms 60.1 MElem/sec    1.00    754.1±2.08µs 331.5 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_65536_card_300_null_0.10      5.25      4.2±0.03ms 59.0 MElem/sec    1.00    806.3±1.24µs 310.0 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_65536_card_65536_null_0.10    1.00     17.3±0.07ms 14.5 MElem/sec    1.63     28.2±0.26ms  8.9 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_65536_card_75_null_0.10       5.53      4.2±0.03ms 59.2 MElem/sec    1.00    764.0±1.13µs 327.2 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_8192_card_1000_null_0.10      2.19    611.9±3.09µs 51.1 MElem/sec    1.00    279.8±1.90µs 111.7 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_8192_card_20_null_0.10        5.54    503.4±4.10µs 62.1 MElem/sec    1.00     90.9±0.22µs 343.8 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_8192_card_300_null_0.10       3.75    540.5±5.84µs 57.8 MElem/sec    1.00    144.2±1.07µs 216.7 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_8192_card_75_null_0.10        5.23    518.9±6.50µs 60.2 MElem/sec    1.00     99.2±0.35µs 315.1 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_8192_card_8192_null_0.10      1.00   1478.4±6.62µs 21.1 MElem/sec    1.22  1806.9±20.28µs 17.3 MElem/sec

Resource Usage

dictionary_group_values — base (merge-base)

Metric	Value
Wall time	535.1s
Peak memory	617.0 MiB
Avg memory	74.9 MiB
CPU user	231.1s
CPU sys	15.4s
Peak spill	0 B

dictionary_group_values — branch

Metric	Value
Wall time	535.1s
Peak memory	895.4 MiB
Avg memory	176.9 MiB
CPU user	255.4s
CPU sys	3.9s
Peak spill	0 B

File an issue against this benchmark runner

kumarUjjawal · 2026-06-09T17:45:39Z

@Rich-T-kid we are seeing regression in the high cardinality but low cardinality is excellent

Rich-T-kid · 2026-06-09T17:53:23Z

@kumarUjjawal that is expected. In the case where every value is unique we are adding overhead. a regression occurs when the values & keys are 1:1

Rich-T-kid · 2026-06-09T17:56:46Z

This provides a performance increase in every case where dictionary arrays are expected to be used — low to medium cardinality. If the data is extremely high cardinality, dictionary arrays are the wrong data type. I don't think it's possible to get the best of both worlds in this case.

alamb · 2026-06-10T18:30:11Z

This provides a performance increase in every case where dictionary arrays are expected to be used — low to medium cardinality. If the data is extremely high cardinality, dictionary arrays are the wrong data type. I don't think it's possible to get the best of both worlds in this case.

Is there any way to reduce the overhead so the difference is not as much?

Rich-T-kid · 2026-06-10T18:51:59Z

I've experimented with switching between GroupValuesRows and GroupValuesDictionary at runtime depending on the size of the dictionary values array, but this isn't a good approach. DataFusion is a streaming engine, so the first batch of data we see isn't necessarily an indicator of future batches. If we decide to use GroupValuesRows due to high cardinality in the first incoming RecordBatch and each subsequent batch turns out to be low cardinality, we lose out on the benefits GroupValuesDictionary provides (a possible 2-5x speedup). The inverse is also true.
Switching between the two approaches mid-aggregation wouldn't be effective either. Having to Emit(EmitTo::All) from one approach and then intern() into another would cause more overhead than any savings it could produce.
I think the best case to trust users to use the correct data types to represent their data, with GroupValuesDictionary it give them another option besides GroupValuesRows in case they do have low-medium cardinality.
I'm open to hearing other possible approaches @alamb

2010YOUY01 · 2026-06-11T01:18:36Z

This looks very cool! I have a background question: how do these dictionary arrays enter the aggregation operator?

Do we have a way to read dictionary arrays directly from Parquet, or to preserve them through FilterExec? I might have missed some recent progress here, so I’d appreciate any pointers.

Rich-T-kid · 2026-06-11T03:43:09Z

When arrow-rs reads RLE-encoded strings from a Parquet file, it decodes them into dictionary arrays. These flow through operators like any other array. For operators like AggregateExec, the dictionary is hydrated during grouping: the GroupValuesRows grouping implementation uses RowConverter, which hydrates each row's dictionary key to its pre-encoded value bytes, copying the equivalent of a full expanded array's worth of bytes into the row buffer. On the emit path, the row data must be cast back to a dictionary array. See the blog on the row format for more detail.

Rich-T-kid · 2026-06-16T20:09:02Z

local results of benchmarks for the 1:1 cardinality case (ran against main)

@kumarUjjawal could you run the benchmarks again?

alamb · 2026-06-16T20:34:07Z

When arrow-rs reads RLE-encoded strings from a Parquet file, it decodes them into dictionary arrays.

I think you need to configure the parquet reader with a schema that specifies Dictionary as the data type when reading such a column from parquet files (we use this at InfluxData). However, I don't think there is way to do this via SQL today in DataFusion, though you can do it programmatically (by providing a schema to the table provider)

Rich-T-kid · 2026-06-16T20:41:13Z

@alamb could you re-run the micro-benchmarks?

alamb · 2026-06-16T21:37:50Z

run benchmark dictionary_group_values

adriangbot · 2026-06-16T21:40:46Z

Benchmark for this request failed.

Last 20 lines of output:

Click to expand

 * [new tag]             48.0.0-rc1     -> 48.0.0-rc1
 * [new tag]             48.0.0-rc2     -> 48.0.0-rc2
 * [new tag]             5.0.0          -> 5.0.0
 * [new tag]             5.0.0-rc1      -> 5.0.0-rc1
 * [new tag]             5.0.0-rc3      -> 5.0.0-rc3
 * [new tag]             6.0.0          -> 6.0.0
 * [new tag]             6.0.0-rc0      -> 6.0.0-rc0
 * [new tag]             7.0.0          -> 7.0.0
 * [new tag]             7.0.0-rc2      -> 7.0.0-rc2
 * [new tag]             8.0.0          -> 8.0.0
 * [new tag]             8.0.0-rc1      -> 8.0.0-rc1
 * [new tag]             8.0.0-rc2      -> 8.0.0-rc2
 * [new tag]             9.0.0          -> 9.0.0
 * [new tag]             9.0.0-rc1      -> 9.0.0-rc1
 * [new tag]             ballista-0.5.0 -> ballista-0.5.0
 * [new tag]             ballista-0.6.0 -> ballista-0.6.0
 * [new tag]             ballista-0.7.0 -> ballista-0.7.0
 * [new tag]             python-0.3.0   -> python-0.3.0
 * [new tag]             python-0.4.0   -> python-0.4.0
Switched to branch 'rich-t-kid/optimize-dictionary-grouping'

File an issue against this benchmark runner

Rich-T-kid · 2026-06-16T22:02:50Z

@alamb @adriangbot failed

kumarUjjawal · 2026-06-17T04:51:05Z

run benchmark dictionary_group_values

adriangbot · 2026-06-17T04:58:22Z

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4726105593-579-2xdsl 6.12.68+ #1 SMP Sat May 2 07:49:07 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing rich-t-kid/optimize-dictionary-grouping (d440824) to 96a6096 (merge-base) diff using: dictionary_group_values
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-06-17T05:13:14Z

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

group                                                                             HEAD                                   rich-t-kid_optimize-dictionary-grouping
-----                                                                             ----                                   ---------------------------------------
dict_intern_emit/intern_emit/size_65536_card_1000_null_0.00                       5.04    875.0±5.77µs 71.4 MElem/sec    1.00    173.7±1.01µs 359.8 MElem/sec
dict_intern_emit/intern_emit/size_65536_card_20_null_0.00                         5.78    788.0±5.09µs 79.3 MElem/sec    1.00    136.3±0.67µs 458.5 MElem/sec
dict_intern_emit/intern_emit/size_65536_card_300_null_0.00                        5.54    817.0±5.09µs 76.5 MElem/sec    1.00    147.4±0.45µs 424.0 MElem/sec
dict_intern_emit/intern_emit/size_65536_card_65536_null_0.00                      1.54      6.3±0.01ms  9.9 MElem/sec    1.00      4.1±0.03ms 15.3 MElem/sec
dict_intern_emit/intern_emit/size_65536_card_75_null_0.00                         5.83    806.4±6.12µs 77.5 MElem/sec    1.00    138.4±0.50µs 451.5 MElem/sec
dict_intern_emit/intern_emit/size_8192_card_1000_null_0.00                        2.91    161.6±1.17µs 48.3 MElem/sec    1.00     55.6±0.39µs 140.6 MElem/sec
dict_intern_emit/intern_emit/size_8192_card_20_null_0.00                          5.59    105.0±1.02µs 74.4 MElem/sec    1.00     18.8±0.14µs 416.3 MElem/sec
dict_intern_emit/intern_emit/size_8192_card_300_null_0.00                         4.14    122.7±1.44µs 63.7 MElem/sec    1.00     29.7±0.22µs 263.3 MElem/sec
dict_intern_emit/intern_emit/size_8192_card_75_null_0.00                          5.32    111.4±1.35µs 70.1 MElem/sec    1.00     20.9±0.21µs 373.1 MElem/sec
dict_intern_emit/intern_emit/size_8192_card_8192_null_0.00                        1.89    667.5±2.09µs 11.7 MElem/sec    1.00    353.7±4.95µs 22.1 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_65536_card_1000_null_0.10     5.22      4.3±0.03ms 57.6 MElem/sec    1.00    832.5±3.83µs 300.3 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_65536_card_20_null_0.10       5.67      4.1±0.03ms 60.3 MElem/sec    1.00    730.8±4.59µs 342.1 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_65536_card_300_null_0.10      5.55      4.2±0.03ms 59.3 MElem/sec    1.00    760.2±4.13µs 328.9 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_65536_card_65536_null_0.10    1.13     17.2±0.05ms 14.5 MElem/sec    1.00     15.2±0.08ms 16.5 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_65536_card_75_null_0.10       5.70      4.2±0.02ms 59.3 MElem/sec    1.00    739.7±4.35µs 338.0 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_8192_card_1000_null_0.10      3.23    605.4±3.21µs 51.6 MElem/sec    1.00    187.4±1.25µs 166.7 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_8192_card_20_null_0.10        5.42    504.4±3.93µs 62.0 MElem/sec    1.00     93.0±0.34µs 336.1 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_8192_card_300_null_0.10       4.61    536.1±3.02µs 58.3 MElem/sec    1.00    116.4±0.70µs 268.5 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_8192_card_75_null_0.10        5.49    516.1±3.25µs 60.5 MElem/sec    1.00     94.0±0.44µs 332.6 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_8192_card_8192_null_0.10      1.67   1472.0±7.21µs 21.2 MElem/sec    1.00   881.4±13.47µs 35.5 MElem/sec

Resource Usage

dictionary_group_values — base (merge-base)

Metric	Value
Wall time	450.1s
Peak memory	598.7 MiB
Avg memory	83.5 MiB
CPU user	230.5s
CPU sys	15.1s
Peak spill	0 B

dictionary_group_values — branch

Metric	Value
Wall time	430.1s
Peak memory	856.6 MiB
Avg memory	50.5 MiB
CPU user	214.0s
CPU sys	19.4s
Peak spill	0 B

File an issue against this benchmark runner

Rich-T-kid · 2026-06-17T12:46:07Z

@kumarUjjawal @alamb with #22078's performance gains, there are now no regressions even in the worst case 🚀

Rich-T-kid · 2026-06-17T14:01:41Z

re-ran the E2E benchmarks as well in benchmarks/src/dict.rs

kumarUjjawal · 2026-06-18T13:53:01Z

@alamb This looks like a good improvement to me, what do you think?

alamb · 2026-06-22T17:53:12Z

FYI @zhuqi-lucas -- perhaps is this part of your work to complete type support of Dictionaries?

EPIC: complete GroupValuesColumn type coverage (nested types + remaining primitives) #22715

alamb

Thanks for working on this @Rich-T-kid and @kumarUjjawal -- I think this is a valuable direction to be pushing, but I think we should try and get something more holistic in place (e.g. have a plan for nested types in general not just Dictionary)

alamb · 2026-06-22T17:58:29Z

+                .unwrap_or(self.row_buffer.len());
+            Some(&self.row_buffer[start..end])
+        });
+        match &self.value_dt {


In general I am worried about needing special code for different element types in the dictionary

In general, this seems like:

It limits the types of DictionaryArray that can be supported (e.g. this doesn't support Dicts with struct elements)

It will have substantial amounts of code (b/c each type of dictionary now gets its own branch in this match statement)

What I think @zhuqi-lucas is trying to do as part of

feat: add GroupColumn support for List / Struct / Map / FixedSizeList in multi-column GROUP BY #22682

And some related PRs is to make a generic way to handle nested types that doesn't require generating code for all possible type combinations.

thank you for the review @alamb

I think it makes sense to only support Dictionary<_, Utf8/Utf8View>, this would significantly simplify the code paths while delivering a large performance boost for production-like data shapes (low-cardinality group bys). Similar to other single-column group by specializations, it leaves a viable fallback that loses nothing when the fast path isn't hit.

With that in mind, @zhuqi-lucas's work looks very interesting -- I had a similar idea with #22891 and #21878. The issue was that multi-column group-bys on dictionaries fell through to GroupValuesRows, which was slow. I'm actually working on a multi-column dictionary group by in #22983, but I'm running into similar issues around handling multiple distinct types. For the multi-column case it makes more sense to build on @zhuqi-lucas's work, since the number of types that need to be supported explodes and there are diminishing returns to exploiting dictionary properties across multiple columns and intern calls (key space cache, arc ptr cache, compute keys once and reuse).

As for this PR specifically, I do think it adds significant value. Would restricting the supported value types be a viable path forward? It would keep the type combinations from exploding while still letting the community benefit from the speedup.

alamb · 2026-06-22T17:59:05Z

 DROP TABLE dict_hash_src;
+
+query TI
+SELECT


there is a bunch of code to handle many more types of dictionaries such as Lists, etc. -- can we please add more tests to cover those cases?

this PR is already at 1k LOC, I'll open a separate PR

Rich-T-kid mentioned this pull request Apr 21, 2026

Rich t kid/dictionary encoding hash optmize #21589

Closed

github-actions Bot added the physical-plan Changes to the physical-plan crate label Apr 21, 2026

Rich-T-kid force-pushed the rich-t-kid/optimize-dictionary-grouping branch 2 times, most recently from 837b382 to 6675b68 Compare April 22, 2026 03:19

Rich-T-kid force-pushed the rich-t-kid/optimize-dictionary-grouping branch 2 times, most recently from e30d6d1 to 5d6a9b0 Compare April 23, 2026 02:54

Rich-T-kid mentioned this pull request Apr 26, 2026

Rich t kid/introduce dict benchmarks #21860

Merged

Rich-T-kid mentioned this pull request Apr 27, 2026

Optimize Hash Aggregation for Multiple Dictionary-Encoded Group Keys #21878

Open

This was referenced Jun 10, 2026

Physical planning: cast low-medium cardinality columns to dictionary arrays before aggregation #22891

Open

multi dictionary benchmarks #22859

Open

Rich-T-kid force-pushed the rich-t-kid/optimize-dictionary-grouping branch from 080daa9 to bf62c82 Compare June 16, 2026 19:56

replace multiple buffers with single flat buffer. closes apache#22078

81a5c1b

Rich-T-kid force-pushed the rich-t-kid/optimize-dictionary-grouping branch from bf62c82 to 81a5c1b Compare June 16, 2026 20:03

Merge branch 'main' into rich-t-kid/optimize-dictionary-grouping

d440824

alamb reviewed Jun 22, 2026

View reviewed changes

Conversation

Rich-T-kid commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

This PR

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Rich-T-kid commented Apr 21, 2026

Uh oh!

Rich-T-kid commented Apr 21, 2026

Uh oh!

alamb commented Apr 21, 2026

Uh oh!

alamb commented Apr 21, 2026

Uh oh!

Rich-T-kid commented Apr 21, 2026

Uh oh!

Rich-T-kid commented Apr 22, 2026

Uh oh!

Rich-T-kid commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

approach

Uh oh!

Rich-T-kid commented Apr 22, 2026

Uh oh!

Dandandan commented Apr 22, 2026

Uh oh!

Rich-T-kid commented Apr 22, 2026

Uh oh!

Rich-T-kid commented Apr 22, 2026

tpch_sft1

Uh oh!

Rich-T-kid commented Apr 22, 2026

tpch_sf10

Uh oh!

Rich-T-kid commented Apr 22, 2026

tpch_mem_sf1

Uh oh!

Rich-T-kid commented Apr 22, 2026

tpch_mem_sf10

Uh oh!

Rich-T-kid commented Apr 22, 2026

Uh oh!

Dandandan commented Apr 23, 2026

Uh oh!

Dandandan commented Apr 23, 2026

Uh oh!

adriangbot commented Apr 23, 2026

Uh oh!

Dandandan commented Apr 23, 2026

tpch_sf10

Uh oh!

adriangbot commented Apr 23, 2026

Uh oh!

adriangbot commented Apr 23, 2026

Uh oh!

Rich-T-kid commented Apr 26, 2026

Uh oh!

kumarUjjawal commented Jun 9, 2026

Uh oh!

adriangbot commented Jun 9, 2026

Uh oh!

adriangbot commented Jun 9, 2026

Uh oh!

kumarUjjawal commented Jun 9, 2026

Uh oh!

Rich-T-kid commented Jun 9, 2026

Uh oh!

Rich-T-kid commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Jun 10, 2026

Uh oh!

Rich-T-kid commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

2010YOUY01 commented Jun 11, 2026

Rich-T-kid commented Apr 21, 2026 •

edited

Loading

Rich-T-kid commented Apr 22, 2026 •

edited

Loading

Rich-T-kid commented Jun 9, 2026 •

edited

Loading

Rich-T-kid commented Jun 10, 2026 •

edited

Loading

Rich-T-kid commented Jun 11, 2026 •

edited

Loading

Rich-T-kid commented Jun 16, 2026 •

edited

Loading