Skip to content

Optimize Dictionary groupings#21765

Open
Rich-T-kid wants to merge 8 commits into
apache:mainfrom
Rich-T-kid:rich-t-kid/optimize-dictionary-grouping
Open

Optimize Dictionary groupings#21765
Rich-T-kid wants to merge 8 commits into
apache:mainfrom
Rich-T-kid:rich-t-kid/optimize-dictionary-grouping

Conversation

@Rich-T-kid

@Rich-T-kid Rich-T-kid commented Apr 21, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

This PR make an effort towards #7000 & closing #7647 + #21466

This PR aim to close half of #7647
A separate follow up PR aims to close the multi-column + dictionary column case

Rationale for this change

Issue #7647 (Materialize Dictionaries in Group Keys) identified that DataFusion was not taking advantage of dictionary encoding during hash aggregation, instead of operating on the compact dictionary representation, it was deserializing dictionary arrays into a generic row-based format, throwing away all the encoding benefits. An initial attempt was made to fix this but it caused regressions and was ultimately rolled back.

This PR

This PR takes a different approach. Rather than materializing the dictionary into a generic representation, it introduces GroupValuesDictionary, a specialized implementation that operates directly on the dictionary's structure. The key insight is that dictionary encoded columns have inherently low cardinality by design, meaning the same values repeat frequently across rows. Instead of hashing and comparing every row independently, we maintain a mapping of unique value hashes to group IDs so that repeated values are resolved in O(1) after the first encounter. Null values are similarly cached so that subsequent null rows never pay the lookup cost again. For a more detailed explanation of the implementation see this design doc (needs to be updated but the high level idea is still present).

Image 5-18-26 at 2 44 PM

What changes are included in this PR?

Update the match statement in new_group_values to include a custom dictionary encoding branch that works for single fields that are of type Dictionary array

Are these changes tested?

Yes, about half of the code in this PR is about testing edge cases.

Are there any user-facing changes?

No

@github-actions github-actions Bot added the physical-plan Changes to the physical-plan crate label Apr 21, 2026
@Rich-T-kid

Copy link
Copy Markdown
Contributor Author

If your interested in some of the discussions made leading up to this PR please view #21589

@Rich-T-kid

Copy link
Copy Markdown
Contributor Author

@alamb this PR should resolve the initial regression that was caused (described here) as well as provide performance boost. I would love to hear your thoughts this approach

I think we need to create a benchmark that does aggregation queries on dictionary encoded string columns (I know the existing end to end TPCH, and ClickBench benchmarks do not do this). I will file a ticket shortly about this
source

seeing as how this PR is already large, I think this would be a nice follow up PR since as far as i know this work was never finished since #9017 was closed due to inactivity

@alamb

alamb commented Apr 21, 2026

Copy link
Copy Markdown
Contributor

Do we have any performance benchmarks for this PR?

@alamb

alamb commented Apr 21, 2026

Copy link
Copy Markdown
Contributor

(basically it is hard to justify an optimization without benchmark results, even if they need to be run manually at first)

@Rich-T-kid

Copy link
Copy Markdown
Contributor Author

There were also micro benchmarks linked in the previous PR
Faster : 25
slower : 17
no-change : 24
I didn't expect to see the tpc-h sf1 have this many regressions. I think I have an idea as to what is causing this. currently working on a fix

@Rich-T-kid Rich-T-kid force-pushed the rich-t-kid/optimize-dictionary-grouping branch 2 times, most recently from 837b382 to 6675b68 Compare April 22, 2026 03:19
@Rich-T-kid

Copy link
Copy Markdown
Contributor Author

Removing any extra allocations like .to_vec() and repeated hash look ups into map

@Rich-T-kid

Rich-T-kid commented Apr 22, 2026

Copy link
Copy Markdown
Contributor Author

speed up : 24
Slower : 6
no change : 36
last optimize that comes to mind is updating normalize_dict_hash() by eliminating per-insert allocations.
Currently, normalize_dict_hash() creates a HashMap<Vec, usize> where each unique key's raw bytes are heap allocated via .to_vec() before insertion. This allocation occurs once per unique value but is unnecessary since the underlying Arrow buffer already owns the bytes.
The plan is to pre-compute a Vec<Option<Cow<[u8]>>> of raw byte slices for all accessed values upfront, allowing the hashmap to store &[u8] references instead of owned Vec keys, eliminating the per-insert allocations entirely.

approach

normalize_dict_hash() is the path for arrays with values that arent expected to fit in cpu caches, that gives this implementation a guide but a loose one.
main three cases

  1. keys outnumber the values & there is some benefit to this pre-allocation step as otherwise we'd be creating exactly d vectors where d is the number of unique elements in the values array that the keys array indexes into. we save by never allocating
  2. the values array has very high cardinality. Here we save alot of compute and space by not creating a new allocation for each value. we only need the &[u8] for comparisons to determine a cache hit we save out by not creating extra vectors that are dropped at the end of the function
  3. Values array has many unreferenced entries (bloated/non-canonicalized dictionary)
    would only call get_raw_bytes() for values actually referenced by keys, so a values array with 10,000 entries where keys only reference 50 of them means you only build 50 slices instead of 10,000. The pre-allocation approach wins significantly here since the alternative would have allocated and dropped 50 Vec unnecessarily.
    In each case I think its a win, in a perfect world I would benchmark this to see if eliminating the .to_vec() allocations outweighs the cost of the upfront pass over the accessed value indices to build the slice cache. I'll implement both approaches, benchmark them, and push the more optimized version.

with that being said I think these results show a generally a large improvement over the current approach to dealing with dictionary encoded columns in data-fusion. the update to normalize_dict_hash() is minor compared to the PR as a whole. @alamb

@Rich-T-kid

Copy link
Copy Markdown
Contributor Author

Pre-allocating the values buffer causes a regression

@Dandandan

Copy link
Copy Markdown
Contributor

@Rich-T-kid are you sure tpch-1 looks at the data at all in your benchmarks? It looks suspicially fast, I think it might only plan/setup the queries but not look at the data.
Perhaps the path is wrong or using it from the wrong directory?

@Rich-T-kid

Copy link
Copy Markdown
Contributor Author

reran the benchmarks after generating data.

@Rich-T-kid

Copy link
Copy Markdown
Contributor Author

tpch_sft1

Image 4-22-26 at 5 02 PM

@Rich-T-kid

Copy link
Copy Markdown
Contributor Author

tpch_sf10

Image 4-22-26 at 5 02 PM

@Rich-T-kid

Copy link
Copy Markdown
Contributor Author

tpch_mem_sf1

Image 4-22-26 at 5 03 PM

@Rich-T-kid

Copy link
Copy Markdown
Contributor Author

tpch_mem_sf10

Image 4-22-26 at 5 03 PM

@Rich-T-kid

Copy link
Copy Markdown
Contributor Author

@adriangb these benchmarks reflect what my design doc mentions the more rows/greater scale the better it should perform. these were all being run on a mac-book maybe running it with the run benchmarks command may yield slightly different results

@Rich-T-kid Rich-T-kid force-pushed the rich-t-kid/optimize-dictionary-grouping branch 2 times, most recently from e30d6d1 to 5d6a9b0 Compare April 23, 2026 02:54
@Dandandan

Copy link
Copy Markdown
Contributor

Are these queries even using dictionary data types?
I think they are string views and primitives

@Dandandan

Copy link
Copy Markdown
Contributor

run benchmarks

@adriangbot

Copy link
Copy Markdown

Benchmark for this request failed.

Last 20 lines of output:

Click to expand
 * [new tag]             48.0.0-rc1     -> 48.0.0-rc1
 * [new tag]             48.0.0-rc2     -> 48.0.0-rc2
 * [new tag]             5.0.0          -> 5.0.0
 * [new tag]             5.0.0-rc1      -> 5.0.0-rc1
 * [new tag]             5.0.0-rc3      -> 5.0.0-rc3
 * [new tag]             6.0.0          -> 6.0.0
 * [new tag]             6.0.0-rc0      -> 6.0.0-rc0
 * [new tag]             7.0.0          -> 7.0.0
 * [new tag]             7.0.0-rc2      -> 7.0.0-rc2
 * [new tag]             8.0.0          -> 8.0.0
 * [new tag]             8.0.0-rc1      -> 8.0.0-rc1
 * [new tag]             8.0.0-rc2      -> 8.0.0-rc2
 * [new tag]             9.0.0          -> 9.0.0
 * [new tag]             9.0.0-rc1      -> 9.0.0-rc1
 * [new tag]             ballista-0.5.0 -> ballista-0.5.0
 * [new tag]             ballista-0.6.0 -> ballista-0.6.0
 * [new tag]             ballista-0.7.0 -> ballista-0.7.0
 * [new tag]             python-0.3.0   -> python-0.3.0
 * [new tag]             python-0.4.0   -> python-0.4.0
Switched to branch 'rich-t-kid/optimize-dictionary-grouping'

File an issue against this benchmark runner

@Dandandan

Copy link
Copy Markdown
Contributor

tpch_sf10

Image 4-22-26 at 5 02 PM

Still does not look correct

@adriangbot

Copy link
Copy Markdown

Benchmark for this request failed.

Last 20 lines of output:

Click to expand
 * [new tag]             48.0.0-rc1     -> 48.0.0-rc1
 * [new tag]             48.0.0-rc2     -> 48.0.0-rc2
 * [new tag]             5.0.0          -> 5.0.0
 * [new tag]             5.0.0-rc1      -> 5.0.0-rc1
 * [new tag]             5.0.0-rc3      -> 5.0.0-rc3
 * [new tag]             6.0.0          -> 6.0.0
 * [new tag]             6.0.0-rc0      -> 6.0.0-rc0
 * [new tag]             7.0.0          -> 7.0.0
 * [new tag]             7.0.0-rc2      -> 7.0.0-rc2
 * [new tag]             8.0.0          -> 8.0.0
 * [new tag]             8.0.0-rc1      -> 8.0.0-rc1
 * [new tag]             8.0.0-rc2      -> 8.0.0-rc2
 * [new tag]             9.0.0          -> 9.0.0
 * [new tag]             9.0.0-rc1      -> 9.0.0-rc1
 * [new tag]             ballista-0.5.0 -> ballista-0.5.0
 * [new tag]             ballista-0.6.0 -> ballista-0.6.0
 * [new tag]             ballista-0.7.0 -> ballista-0.7.0
 * [new tag]             python-0.3.0   -> python-0.3.0
 * [new tag]             python-0.4.0   -> python-0.4.0
Switched to branch 'rich-t-kid/optimize-dictionary-grouping'

File an issue against this benchmark runner

1 similar comment
@adriangbot

Copy link
Copy Markdown

Benchmark for this request failed.

Last 20 lines of output:

Click to expand
 * [new tag]             48.0.0-rc1     -> 48.0.0-rc1
 * [new tag]             48.0.0-rc2     -> 48.0.0-rc2
 * [new tag]             5.0.0          -> 5.0.0
 * [new tag]             5.0.0-rc1      -> 5.0.0-rc1
 * [new tag]             5.0.0-rc3      -> 5.0.0-rc3
 * [new tag]             6.0.0          -> 6.0.0
 * [new tag]             6.0.0-rc0      -> 6.0.0-rc0
 * [new tag]             7.0.0          -> 7.0.0
 * [new tag]             7.0.0-rc2      -> 7.0.0-rc2
 * [new tag]             8.0.0          -> 8.0.0
 * [new tag]             8.0.0-rc1      -> 8.0.0-rc1
 * [new tag]             8.0.0-rc2      -> 8.0.0-rc2
 * [new tag]             9.0.0          -> 9.0.0
 * [new tag]             9.0.0-rc1      -> 9.0.0-rc1
 * [new tag]             ballista-0.5.0 -> ballista-0.5.0
 * [new tag]             ballista-0.6.0 -> ballista-0.6.0
 * [new tag]             ballista-0.7.0 -> ballista-0.7.0
 * [new tag]             python-0.3.0   -> python-0.3.0
 * [new tag]             python-0.4.0   -> python-0.4.0
Switched to branch 'rich-t-kid/optimize-dictionary-grouping'

File an issue against this benchmark runner

@Rich-T-kid

Copy link
Copy Markdown
Contributor Author

Yea it seems that no benchmarks currently perform a group by using dictionary encodeded columns. I created this PR to address that. @Dandandan could you please take a look? #21860

@kumarUjjawal

Copy link
Copy Markdown
Contributor

run benchmark dictionary_group_values

@adriangbot

Copy link
Copy Markdown

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4662199962-505-5wtgp 6.12.68+ #1 SMP Sat May 2 07:49:07 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing rich-t-kid/optimize-dictionary-grouping (080daa9) to 9b81ff8 (merge-base) diff using: dictionary_group_values
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot

Copy link
Copy Markdown

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                                                                             HEAD                                   rich-t-kid_optimize-dictionary-grouping
-----                                                                             ----                                   ---------------------------------------
dict_intern_emit/intern_emit/size_65536_card_1000_null_0.00                       3.39    885.3±8.93µs 70.6 MElem/sec    1.00    261.3±2.07µs 239.2 MElem/sec
dict_intern_emit/intern_emit/size_65536_card_20_null_0.00                         5.81    793.8±6.22µs 78.7 MElem/sec    1.00    136.6±1.03µs 457.6 MElem/sec
dict_intern_emit/intern_emit/size_65536_card_300_null_0.00                        4.83    827.4±8.79µs 75.5 MElem/sec    1.00    171.2±0.82µs 365.0 MElem/sec
dict_intern_emit/intern_emit/size_65536_card_65536_null_0.00                      1.00      6.3±0.01ms 10.0 MElem/sec    2.50     15.6±0.12ms  4.0 MElem/sec
dict_intern_emit/intern_emit/size_65536_card_75_null_0.00                         5.70   817.0±26.55µs 76.5 MElem/sec    1.00    143.4±0.41µs 435.9 MElem/sec
dict_intern_emit/intern_emit/size_8192_card_1000_null_0.00                        1.12    162.2±1.16µs 48.2 MElem/sec    1.00    144.7±2.12µs 54.0 MElem/sec
dict_intern_emit/intern_emit/size_8192_card_20_null_0.00                          5.72   110.4±47.55µs 70.8 MElem/sec    1.00     19.3±0.17µs 404.7 MElem/sec
dict_intern_emit/intern_emit/size_8192_card_300_null_0.00                         2.25    122.8±0.98µs 63.6 MElem/sec    1.00     54.6±0.51µs 143.1 MElem/sec
dict_intern_emit/intern_emit/size_8192_card_75_null_0.00                          4.17    111.3±1.12µs 70.2 MElem/sec    1.00     26.7±0.16µs 292.6 MElem/sec
dict_intern_emit/intern_emit/size_8192_card_8192_null_0.00                        1.00    671.0±2.56µs 11.6 MElem/sec    1.66  1112.2±18.96µs  7.0 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_65536_card_1000_null_0.10     4.54      4.4±0.03ms 56.9 MElem/sec    1.00    967.0±4.56µs 258.5 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_65536_card_20_null_0.10       5.52      4.2±0.03ms 60.1 MElem/sec    1.00    754.1±2.08µs 331.5 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_65536_card_300_null_0.10      5.25      4.2±0.03ms 59.0 MElem/sec    1.00    806.3±1.24µs 310.0 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_65536_card_65536_null_0.10    1.00     17.3±0.07ms 14.5 MElem/sec    1.63     28.2±0.26ms  8.9 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_65536_card_75_null_0.10       5.53      4.2±0.03ms 59.2 MElem/sec    1.00    764.0±1.13µs 327.2 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_8192_card_1000_null_0.10      2.19    611.9±3.09µs 51.1 MElem/sec    1.00    279.8±1.90µs 111.7 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_8192_card_20_null_0.10        5.54    503.4±4.10µs 62.1 MElem/sec    1.00     90.9±0.22µs 343.8 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_8192_card_300_null_0.10       3.75    540.5±5.84µs 57.8 MElem/sec    1.00    144.2±1.07µs 216.7 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_8192_card_75_null_0.10        5.23    518.9±6.50µs 60.2 MElem/sec    1.00     99.2±0.35µs 315.1 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_8192_card_8192_null_0.10      1.00   1478.4±6.62µs 21.1 MElem/sec    1.22  1806.9±20.28µs 17.3 MElem/sec

Resource Usage

dictionary_group_values — base (merge-base)

Metric Value
Wall time 535.1s
Peak memory 617.0 MiB
Avg memory 74.9 MiB
CPU user 231.1s
CPU sys 15.4s
Peak spill 0 B

dictionary_group_values — branch

Metric Value
Wall time 535.1s
Peak memory 895.4 MiB
Avg memory 176.9 MiB
CPU user 255.4s
CPU sys 3.9s
Peak spill 0 B

File an issue against this benchmark runner

@kumarUjjawal

Copy link
Copy Markdown
Contributor

@Rich-T-kid we are seeing regression in the high cardinality but low cardinality is excellent

@Rich-T-kid

Copy link
Copy Markdown
Contributor Author

@kumarUjjawal that is expected. In the case where every value is unique we are adding overhead. a regression occurs when the values & keys are 1:1

@Rich-T-kid

Rich-T-kid commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

This provides a performance increase in every case where dictionary arrays are expected to be used — low to medium cardinality. If the data is extremely high cardinality, dictionary arrays are the wrong data type. I don't think it's possible to get the best of both worlds in this case.

@alamb

alamb commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

This provides a performance increase in every case where dictionary arrays are expected to be used — low to medium cardinality. If the data is extremely high cardinality, dictionary arrays are the wrong data type. I don't think it's possible to get the best of both worlds in this case.

Is there any way to reduce the overhead so the difference is not as much?

@Rich-T-kid

Rich-T-kid commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

I've experimented with switching between GroupValuesRows and GroupValuesDictionary at runtime depending on the size of the dictionary values array, but this isn't a good approach. DataFusion is a streaming engine, so the first batch of data we see isn't necessarily an indicator of future batches. If we decide to use GroupValuesRows due to high cardinality in the first incoming RecordBatch and each subsequent batch turns out to be low cardinality, we lose out on the benefits GroupValuesDictionary provides (a possible 2-5x speedup). The inverse is also true.
Switching between the two approaches mid-aggregation wouldn't be effective either. Having to Emit(EmitTo::All) from one approach and then intern() into another would cause more overhead than any savings it could produce.
I think the best case to trust users to use the correct data types to represent their data, with GroupValuesDictionary it give them another option besides GroupValuesRows in case they do have low-medium cardinality.
I'm open to hearing other possible approaches @alamb

@2010YOUY01

Copy link
Copy Markdown
Contributor

This looks very cool! I have a background question: how do these dictionary arrays enter the aggregation operator?

Do we have a way to read dictionary arrays directly from Parquet, or to preserve them through FilterExec? I might have missed some recent progress here, so I’d appreciate any pointers.

@Rich-T-kid

Rich-T-kid commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

When arrow-rs reads RLE-encoded strings from a Parquet file, it decodes them into dictionary arrays. These flow through operators like any other array. For operators like AggregateExec, the dictionary is hydrated during grouping: the GroupValuesRows grouping implementation uses RowConverter, which hydrates each row's dictionary key to its pre-encoded value bytes, copying the equivalent of a full expanded array's worth of bytes into the row buffer. On the emit path, the row data must be cast back to a dictionary array. See the blog on the row format for more detail.

@Rich-T-kid Rich-T-kid force-pushed the rich-t-kid/optimize-dictionary-grouping branch from 080daa9 to bf62c82 Compare June 16, 2026 19:56
@Rich-T-kid Rich-T-kid force-pushed the rich-t-kid/optimize-dictionary-grouping branch from bf62c82 to 81a5c1b Compare June 16, 2026 20:03
@Rich-T-kid

Rich-T-kid commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

local results of benchmarks for the 1:1 cardinality case (ran against main)
Image 6-16-26 at 4 06 PM

@kumarUjjawal could you run the benchmarks again?

@alamb

alamb commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

When arrow-rs reads RLE-encoded strings from a Parquet file, it decodes them into dictionary arrays.

I think you need to configure the parquet reader with a schema that specifies Dictionary as the data type when reading such a column from parquet files (we use this at InfluxData). However, I don't think there is way to do this via SQL today in DataFusion, though you can do it programmatically (by providing a schema to the table provider)

@Rich-T-kid

Copy link
Copy Markdown
Contributor Author

@alamb could you re-run the micro-benchmarks?

@alamb

alamb commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

run benchmark dictionary_group_values

@adriangbot

Copy link
Copy Markdown

Benchmark for this request failed.

Last 20 lines of output:

Click to expand
 * [new tag]             48.0.0-rc1     -> 48.0.0-rc1
 * [new tag]             48.0.0-rc2     -> 48.0.0-rc2
 * [new tag]             5.0.0          -> 5.0.0
 * [new tag]             5.0.0-rc1      -> 5.0.0-rc1
 * [new tag]             5.0.0-rc3      -> 5.0.0-rc3
 * [new tag]             6.0.0          -> 6.0.0
 * [new tag]             6.0.0-rc0      -> 6.0.0-rc0
 * [new tag]             7.0.0          -> 7.0.0
 * [new tag]             7.0.0-rc2      -> 7.0.0-rc2
 * [new tag]             8.0.0          -> 8.0.0
 * [new tag]             8.0.0-rc1      -> 8.0.0-rc1
 * [new tag]             8.0.0-rc2      -> 8.0.0-rc2
 * [new tag]             9.0.0          -> 9.0.0
 * [new tag]             9.0.0-rc1      -> 9.0.0-rc1
 * [new tag]             ballista-0.5.0 -> ballista-0.5.0
 * [new tag]             ballista-0.6.0 -> ballista-0.6.0
 * [new tag]             ballista-0.7.0 -> ballista-0.7.0
 * [new tag]             python-0.3.0   -> python-0.3.0
 * [new tag]             python-0.4.0   -> python-0.4.0
Switched to branch 'rich-t-kid/optimize-dictionary-grouping'

File an issue against this benchmark runner

@Rich-T-kid

Copy link
Copy Markdown
Contributor Author

@alamb @adriangbot failed

@kumarUjjawal

Copy link
Copy Markdown
Contributor

run benchmark dictionary_group_values

@adriangbot

Copy link
Copy Markdown

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4726105593-579-2xdsl 6.12.68+ #1 SMP Sat May 2 07:49:07 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing rich-t-kid/optimize-dictionary-grouping (d440824) to 96a6096 (merge-base) diff using: dictionary_group_values
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot

Copy link
Copy Markdown

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                                                                             HEAD                                   rich-t-kid_optimize-dictionary-grouping
-----                                                                             ----                                   ---------------------------------------
dict_intern_emit/intern_emit/size_65536_card_1000_null_0.00                       5.04    875.0±5.77µs 71.4 MElem/sec    1.00    173.7±1.01µs 359.8 MElem/sec
dict_intern_emit/intern_emit/size_65536_card_20_null_0.00                         5.78    788.0±5.09µs 79.3 MElem/sec    1.00    136.3±0.67µs 458.5 MElem/sec
dict_intern_emit/intern_emit/size_65536_card_300_null_0.00                        5.54    817.0±5.09µs 76.5 MElem/sec    1.00    147.4±0.45µs 424.0 MElem/sec
dict_intern_emit/intern_emit/size_65536_card_65536_null_0.00                      1.54      6.3±0.01ms  9.9 MElem/sec    1.00      4.1±0.03ms 15.3 MElem/sec
dict_intern_emit/intern_emit/size_65536_card_75_null_0.00                         5.83    806.4±6.12µs 77.5 MElem/sec    1.00    138.4±0.50µs 451.5 MElem/sec
dict_intern_emit/intern_emit/size_8192_card_1000_null_0.00                        2.91    161.6±1.17µs 48.3 MElem/sec    1.00     55.6±0.39µs 140.6 MElem/sec
dict_intern_emit/intern_emit/size_8192_card_20_null_0.00                          5.59    105.0±1.02µs 74.4 MElem/sec    1.00     18.8±0.14µs 416.3 MElem/sec
dict_intern_emit/intern_emit/size_8192_card_300_null_0.00                         4.14    122.7±1.44µs 63.7 MElem/sec    1.00     29.7±0.22µs 263.3 MElem/sec
dict_intern_emit/intern_emit/size_8192_card_75_null_0.00                          5.32    111.4±1.35µs 70.1 MElem/sec    1.00     20.9±0.21µs 373.1 MElem/sec
dict_intern_emit/intern_emit/size_8192_card_8192_null_0.00                        1.89    667.5±2.09µs 11.7 MElem/sec    1.00    353.7±4.95µs 22.1 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_65536_card_1000_null_0.10     5.22      4.3±0.03ms 57.6 MElem/sec    1.00    832.5±3.83µs 300.3 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_65536_card_20_null_0.10       5.67      4.1±0.03ms 60.3 MElem/sec    1.00    730.8±4.59µs 342.1 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_65536_card_300_null_0.10      5.55      4.2±0.03ms 59.3 MElem/sec    1.00    760.2±4.13µs 328.9 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_65536_card_65536_null_0.10    1.13     17.2±0.05ms 14.5 MElem/sec    1.00     15.2±0.08ms 16.5 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_65536_card_75_null_0.10       5.70      4.2±0.02ms 59.3 MElem/sec    1.00    739.7±4.35µs 338.0 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_8192_card_1000_null_0.10      3.23    605.4±3.21µs 51.6 MElem/sec    1.00    187.4±1.25µs 166.7 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_8192_card_20_null_0.10        5.42    504.4±3.93µs 62.0 MElem/sec    1.00     93.0±0.34µs 336.1 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_8192_card_300_null_0.10       4.61    536.1±3.02µs 58.3 MElem/sec    1.00    116.4±0.70µs 268.5 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_8192_card_75_null_0.10        5.49    516.1±3.25µs 60.5 MElem/sec    1.00     94.0±0.44µs 332.6 MElem/sec
dict_repeated_intern_emit/repeated_intern_emit/size_8192_card_8192_null_0.10      1.67   1472.0±7.21µs 21.2 MElem/sec    1.00   881.4±13.47µs 35.5 MElem/sec

Resource Usage

dictionary_group_values — base (merge-base)

Metric Value
Wall time 450.1s
Peak memory 598.7 MiB
Avg memory 83.5 MiB
CPU user 230.5s
CPU sys 15.1s
Peak spill 0 B

dictionary_group_values — branch

Metric Value
Wall time 430.1s
Peak memory 856.6 MiB
Avg memory 50.5 MiB
CPU user 214.0s
CPU sys 19.4s
Peak spill 0 B

File an issue against this benchmark runner

@Rich-T-kid

Copy link
Copy Markdown
Contributor Author

@kumarUjjawal @alamb with #22078's performance gains, there are now no regressions even in the worst case 🚀

@Rich-T-kid

Copy link
Copy Markdown
Contributor Author

re-ran the E2E benchmarks as well in benchmarks/src/dict.rs
Image 6-17-26 at 10 00 AM

@kumarUjjawal

Copy link
Copy Markdown
Contributor

@alamb This looks like a good improvement to me, what do you think?

@alamb

alamb commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

FYI @zhuqi-lucas -- perhaps is this part of your work to complete type support of Dictionaries?

@alamb alamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @Rich-T-kid and @kumarUjjawal -- I think this is a valuable direction to be pushing, but I think we should try and get something more holistic in place (e.g. have a plan for nested types in general not just Dictionary)

.unwrap_or(self.row_buffer.len());
Some(&self.row_buffer[start..end])
});
match &self.value_dt {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I am worried about needing special code for different element types in the dictionary

In general, this seems like:

  1. It limits the types of DictionaryArray that can be supported (e.g. this doesn't support Dicts with struct elements)
  2. It will have substantial amounts of code (b/c each type of dictionary now gets its own branch in this match statement)

What I think @zhuqi-lucas is trying to do as part of

And some related PRs is to make a generic way to handle nested types that doesn't require generating code for all possible type combinations.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the review @alamb

I think it makes sense to only support Dictionary<_, Utf8/Utf8View>, this would significantly simplify the code paths while delivering a large performance boost for production-like data shapes (low-cardinality group bys). Similar to other single-column group by specializations, it leaves a viable fallback that loses nothing when the fast path isn't hit.

With that in mind, @zhuqi-lucas's work looks very interesting -- I had a similar idea with #22891 and #21878. The issue was that multi-column group-bys on dictionaries fell through to GroupValuesRows, which was slow. I'm actually working on a multi-column dictionary group by in #22983, but I'm running into similar issues around handling multiple distinct types. For the multi-column case it makes more sense to build on @zhuqi-lucas's work, since the number of types that need to be supported explodes and there are diminishing returns to exploiting dictionary properties across multiple columns and intern calls (key space cache, arc ptr cache, compute keys once and reuse).

As for this PR specifically, I do think it adds significant value. Would restricting the supported value types be a viable path forward? It would keep the type combinations from exploding while still letting the community benefit from the speedup.

DROP TABLE dict_hash_src;

query TI
SELECT

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a bunch of code to handle many more types of dictionaries such as Lists, etc. -- can we please add more tests to cover those cases?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this PR is already at 1k LOC, I'll open a separate PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Make DataFusion faster physical-plan Changes to the physical-plan crate review:waiting Ready for an initial review by a committer sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Materialize Dictionaries in Group Keys

8 participants