Skip to content

multi dictionary benchmarks#22859

Open
Rich-T-kid wants to merge 7 commits into
apache:mainfrom
Rich-T-kid:rich-T-kid/Multi-dictionary-benchmarks
Open

multi dictionary benchmarks#22859
Rich-T-kid wants to merge 7 commits into
apache:mainfrom
Rich-T-kid:rich-T-kid/Multi-dictionary-benchmarks

Conversation

@Rich-T-kid

@Rich-T-kid Rich-T-kid commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Works towards closing #21878

Rationale for this change

There is currently no benchmarks that cover GroupValueRow() when dictionary arrays are passed in. #22004 adds benchmarks for the single single column case. This PR adds benchmarks for the multi column case

What changes are included in this PR?

Adds a new Criterion harness for GroupValues over multi-column Dictionary<UInt64, Utf8> GROUP BY workloads, covering 4 and 8 group-by columns across batch sizes of 8 KiB and 64 KiB rows. It tests four realistic cardinalities (20, 100, 500, 1 000 distinct values) with per-column variance introduced by sampling each column's distinct count from a ±5% window around the target, so columns within the same batch have slightly different cardinalities. Two benchmark variants are included — repeated intern+emit and a partial-emit pattern that spills half the accumulated groups after each intern — to stress both the steady-state and incremental-flush paths of the group values implementation.

Are these changes tested?

n/a

Are there any user-facing changes?

No

@github-actions github-actions Bot added the physical-plan Changes to the physical-plan crate label Jun 9, 2026

@Rich-T-kid Rich-T-kid left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going to take another review tomorrow before pinging a reviewer to take a look!

Arc::new(DictionaryArray::<UInt64Type>::try_new(key_array, values).unwrap())
}

fn make_batch(

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a value type as a parameter here. strings are currently covered, in the future it may make sense for other types to also be covered. I left that out for this PR but I can introduce it if needed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that types that need to be benchmarks include (utf8, List<utf8>,Binary). I don't think their variants. (utf8View,List,binaryView...ect>,binaryView,LargeBinary,FixedSizeBinary) add enough difference to the point that there be any meaningful performance differences.
is there any recommendation on how to do this without multiplying the number of benchmarks we need to run by 3x?

/// each column's distinct count is sampled from [target*0.95, target*1.05].
const CARDINALITY_RANGE: usize = 10;

fn schema_for_cols(n_cols: usize) -> SchemaRef {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if/when we introduce other value data types, this function needs to change


const SIZES: [usize; 2] = [8 * 1024, 64 * 1024];
const N_COLS: [usize; 2] = [4, 8];
const CARDS: [usize; 4] = [20, 100, 500, 1_000];

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I also include a 1-1 cardinality ratio similar to #21765?

@Rich-T-kid Rich-T-kid changed the title Rich t kid/multi dictionary benchmarks multi dictionary benchmarks Jun 10, 2026
@Rich-T-kid

Copy link
Copy Markdown
Contributor Author

@kumarUjjawal could you take a look a this and #22888 when you get a chance? Thank you

@kumarUjjawal kumarUjjawal left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Rich-T-kid

Left couple of comments, let me know what you think.

perm.shuffle(&mut rng);
perm
} else {
(0..size)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should generate shared row group ids first, then map each group id to a multi-column tuple. That keeps total group combinations near target_distinct

@Rich-T-kid Rich-T-kid Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! previously column_count * distinct was being generated. I've updated the code to generate the group id's first and then pass it into the creation of the dictionarys. this way no matter how many columns get produced they will produce the same number of group id's.

Ive also added benchmarks for the cross-product group-bys to try and better reflect the worse case scenario. ex

GROUP BY department, city,rank

as opposed to the current benchmarks that cover the case of group-bys where the row-space may be shared. ex

GROUP BY employee_id, department, manager_id

Comment thread datafusion/physical-plan/benches/multi_column_dictionary_group_values.rs Outdated
@Rich-T-kid

Copy link
Copy Markdown
Contributor Author

Thanks for the review and comments @kumarUjjawal. Ive updated the PR could you take another look?

@kumarUjjawal

Copy link
Copy Markdown
Contributor

@Rich-T-kid I was running this locally and I think this benchmark in not registered anywhere, I added it to the cargo toml only then it ran. Could you verify locally?

@Rich-T-kid

Copy link
Copy Markdown
Contributor Author

I forgot to include the cargo file in the commit. should be good to go now @kumarUjjawal

@Rich-T-kid

Rich-T-kid commented Jun 21, 2026

Copy link
Copy Markdown
Contributor Author

I replaced CARDS = [20, 100, 500, 1_000] (total distinct groups) with PER_COL_CARDS = [3, 4, 5, 6] (values per column), so cross-products stay below batch size and group reuse is real
&
I fixed trailing columns in make_batch() by setting n_groups = per_col_card^n_cols instead of target_distinct, ensuring group IDs span the full mixed-radix space so every column exercises all its values
@kumarUjjawal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants