multi dictionary benchmarks by Rich-T-kid · Pull Request #22859 · apache/datafusion

Rich-T-kid · 2026-06-09T19:59:39Z

Which issue does this PR close?

Works towards closing #21878

Rationale for this change

There is currently no benchmarks that cover GroupValueRow() when dictionary arrays are passed in. #22004 adds benchmarks for the single single column case. This PR adds benchmarks for the multi column case

What changes are included in this PR?

Adds a new Criterion harness for GroupValues over multi-column Dictionary<UInt64, Utf8> GROUP BY workloads, covering 4 and 8 group-by columns across batch sizes of 8 KiB and 64 KiB rows. It tests four realistic cardinalities (20, 100, 500, 1 000 distinct values) with per-column variance introduced by sampling each column's distinct count from a ±5% window around the target, so columns within the same batch have slightly different cardinalities. Two benchmark variants are included — repeated intern+emit and a partial-emit pattern that spills half the accumulated groups after each intern — to stress both the steady-state and incremental-flush paths of the group values implementation.

Are these changes tested?

n/a

Are there any user-facing changes?

No

Rich-T-kid

going to take another review tomorrow before pinging a reviewer to take a look!

Rich-T-kid · 2026-06-09T20:01:52Z

+    Arc::new(DictionaryArray::<UInt64Type>::try_new(key_array, values).unwrap())
+}
+
+fn make_batch(


I think we should add a value type as a parameter here. strings are currently covered, in the future it may make sense for other types to also be covered. I left that out for this PR but I can introduce it if needed.

I think that types that need to be benchmarks include (utf8, List<utf8>,Binary). I don't think their variants. (utf8View,List,binaryView...ect>,binaryView,LargeBinary,FixedSizeBinary) add enough difference to the point that there be any meaningful performance differences.
is there any recommendation on how to do this without multiplying the number of benchmarks we need to run by 3x?

Rich-T-kid · 2026-06-09T20:04:01Z

+/// each column's distinct count is sampled from [target*0.95, target*1.05].
+const CARDINALITY_RANGE: usize = 10;
+
+fn schema_for_cols(n_cols: usize) -> SchemaRef {


if/when we introduce other value data types, this function needs to change

Rich-T-kid · 2026-06-10T21:03:50Z

+
+const SIZES: [usize; 2] = [8 * 1024, 64 * 1024];
+const N_COLS: [usize; 2] = [4, 8];
+const CARDS: [usize; 4] = [20, 100, 500, 1_000];


Should I also include a 1-1 cardinality ratio similar to #21765?

Rich-T-kid · 2026-06-15T01:51:06Z

@kumarUjjawal could you take a look a this and #22888 when you get a chance? Thank you

kumarUjjawal

Thank you @Rich-T-kid

Left couple of comments, let me know what you think.

kumarUjjawal · 2026-06-15T03:53:23Z

+        perm.shuffle(&mut rng);
+        perm
+    } else {
+        (0..size)


We should generate shared row group ids first, then map each group id to a multi-column tuple. That keeps total group combinations near target_distinct

good catch! previously column_count * distinct was being generated. I've updated the code to generate the group id's first and then pass it into the creation of the dictionarys. this way no matter how many columns get produced they will produce the same number of group id's.

Ive also added benchmarks for the cross-product group-bys to try and better reflect the worse case scenario. ex

GROUP BY department, city,rank

as opposed to the current benchmarks that cover the case of group-bys where the row-space may be shared. ex

GROUP BY employee_id, department, manager_id

…-product aggregations

Rich-T-kid · 2026-06-15T14:26:22Z

Thanks for the review and comments @kumarUjjawal. Ive updated the PR could you take another look?

kumarUjjawal · 2026-06-16T05:42:19Z

@Rich-T-kid I was running this locally and I think this benchmark in not registered anywhere, I added it to the cargo toml only then it ran. Could you verify locally?

Rich-T-kid · 2026-06-16T12:29:30Z

I forgot to include the cargo file in the commit. should be good to go now @kumarUjjawal

Rich-T-kid · 2026-06-21T23:44:47Z

I replaced CARDS = [20, 100, 500, 1_000] (total distinct groups) with PER_COL_CARDS = [3, 4, 5, 6] (values per column), so cross-products stay below batch size and group reuse is real
&
I fixed trailing columns in make_batch() by setting n_groups = per_col_card^n_cols instead of target_distinct, ensuring group IDs span the full mixed-radix space so every column exercises all its values
@kumarUjjawal

Rich-T-kid added 2 commits June 9, 2026 13:31

first iterion for multi-column dictionary benchmarks

c0f60ae

introduce multi-column dictionary group value benchmarks

d9f1af0

github-actions Bot added the physical-plan Changes to the physical-plan crate label Jun 9, 2026

Rich-T-kid commented Jun 9, 2026

View reviewed changes

Rich-T-kid commented Jun 10, 2026

View reviewed changes

re-run CI

9fd8d2a

Rich-T-kid changed the title ~~Rich t kid/multi dictionary benchmarks~~ multi dictionary benchmarks Jun 10, 2026

kumarUjjawal reviewed Jun 15, 2026

View reviewed changes

Rich-T-kid added 2 commits June 15, 2026 10:16

fix distinct_rows calculation and introduce benchmarks to track cross…

45005c4

…-product aggregations

fix github issues

0a2cac4

Rich-T-kid commented Jun 15, 2026

View reviewed changes

Comment thread datafusion/physical-plan/benches/multi_column_dictionary_group_values.rs

include benchmark harness in cargo.toml

eb46aec

Rich-T-kid force-pushed the rich-T-kid/Multi-dictionary-benchmarks branch from 1333642 to eb46aec Compare June 16, 2026 13:11

Rich-T-kid mentioned this pull request Jun 16, 2026

Improved performance for streaming grouping with single string columns #9195

Open

kumarUjjawal reviewed Jun 18, 2026

View reviewed changes

Comment thread datafusion/physical-plan/benches/multi_column_dictionary_group_values.rs

Comment thread datafusion/physical-plan/benches/multi_column_dictionary_group_values.rs

change total cardinality count of groups

d20300f

Conversation

Rich-T-kid commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Rich-T-kid left a comment

Choose a reason for hiding this comment

Uh oh!

Rich-T-kid Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Rich-T-kid Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Rich-T-kid Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Rich-T-kid Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Rich-T-kid commented Jun 15, 2026

Uh oh!

kumarUjjawal left a comment

Choose a reason for hiding this comment

Uh oh!

kumarUjjawal Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Rich-T-kid Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Rich-T-kid commented Jun 15, 2026

Uh oh!

kumarUjjawal commented Jun 16, 2026

Uh oh!

Rich-T-kid commented Jun 16, 2026

Uh oh!

Uh oh!

Uh oh!

Rich-T-kid commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Rich-T-kid commented Jun 9, 2026 •

edited

Loading

Rich-T-kid Jun 15, 2026 •

edited

Loading

Rich-T-kid commented Jun 21, 2026 •

edited

Loading