Skip to content

Rich t kid/implement multi dictionary aggr#22983

Draft
Rich-T-kid wants to merge 5 commits into
apache:mainfrom
Rich-T-kid:rich-T-kid/implement-multi-dictionary-aggr
Draft

Rich t kid/implement multi dictionary aggr#22983
Rich-T-kid wants to merge 5 commits into
apache:mainfrom
Rich-T-kid:rich-T-kid/implement-multi-dictionary-aggr

Conversation

@Rich-T-kid

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

  • Closes #.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions Bot added the physical-plan Changes to the physical-plan crate label Jun 16, 2026

@Rich-T-kid Rich-T-kid left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alot of minor things to clean up.
notably most of the logic here isn't too specific to dictionary arrays besides caching the values pointer. we need benchmarks to determine if this is worth approaching over a generic implementation like GroupValueRows

Comment thread datafusion/physical-plan/src/aggregates/group_values/multi_group_by/dict.rs Outdated
Comment thread datafusion/physical-plan/src/aggregates/group_values/multi_group_by/dict.rs Outdated
Comment thread datafusion/physical-plan/src/aggregates/group_values/multi_group_by/dict.rs Outdated
Comment thread datafusion/physical-plan/src/aggregates/group_values/multi_group_by/dict.rs Outdated
Comment thread datafusion/physical-plan/src/aggregates/group_values/multi_group_by/dict.rs Outdated
Comment thread datafusion/physical-plan/src/aggregates/group_values/multi_group_by/dict.rs Outdated
@Rich-T-kid

Copy link
Copy Markdown
Contributor Author

going through this now

@Rich-T-kid Rich-T-kid force-pushed the rich-T-kid/implement-multi-dictionary-aggr branch from c6c4c5b to d03f712 Compare June 18, 2026 19:12
@Rich-T-kid Rich-T-kid force-pushed the rich-T-kid/implement-multi-dictionary-aggr branch from d03f712 to ef4ba19 Compare June 22, 2026 17:12
@Rich-T-kid

Rich-T-kid commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

Still iterating on the PR, these are loose ideas for optimizations (intern path)

use VecDeque instead of Vec

  • This should avoid the O(n) that happens for Vec<T>.drain()

adjust the row-tuple cache to be configurable

  • currently set to 10,000, once it reaches this size the elements that were written to remain there but are never read again.
  • need to strike a balance between having a cache for per batch keys as well as avoiding ever growing memory in the worst case
  • Stop inserting after sparse-cache cap, but keep looking up
  • Mabey a heap of the top K elements? K can be set as through SessionConfig?

avoid current double allocation spent for option

  • options that wrap non pointer types generally take up more space. Option takes up 16 bytes instead of just 8
  • mabey use Option<NonZeroUsize> 🤔

bitpack key-tuples cache

  • instead of using a slice of usize (u64) bit back the combination of tuples into one u64 or 128
  • need to do more research on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant