Skip to content

ARROW-11733: [Rust][DataFusion] Implement hash partitioning#9548

Closed
Dandandan wants to merge 11 commits into
apache:masterfrom
Dandandan:hash_repartition
Closed

ARROW-11733: [Rust][DataFusion] Implement hash partitioning#9548
Dandandan wants to merge 11 commits into
apache:masterfrom
Dandandan:hash_repartition

Conversation

@Dandandan

@Dandandan Dandandan commented Feb 22, 2021

Copy link
Copy Markdown
Contributor

This PR implements a first version of hash repartition.
This can used to (further) parallelize hash joins / aggregates or to implement distributed algorithms like ShuffleHashJoin(https://github.com/ballista-compute/ballista/issues/595 )

I didn't yet optimize for speed, as I think it makes sense to implement it and look for improvements later.

FYI @andygrove

@github-actions

Copy link
Copy Markdown

@Dandandan Dandandan changed the title ARROW-11733: [Rust][DataFusion] Hash repartition ARROW-11733: [Rust][DataFusion] Hash repartition [WIP] Feb 22, 2021
@Dandandan Dandandan changed the title ARROW-11733: [Rust][DataFusion] Hash repartition [WIP] ARROW-11733: [Rust][DataFusion] Hash repartition Feb 23, 2021
@Dandandan Dandandan changed the title ARROW-11733: [Rust][DataFusion] Hash repartition ARROW-11733: [Rust][DataFusion] Implement hash repartition Feb 23, 2021
@Dandandan Dandandan changed the title ARROW-11733: [Rust][DataFusion] Implement hash repartition ARROW-11733: [Rust][DataFusion] Implement hash partitioning Feb 23, 2021
@codecov-io

Copy link
Copy Markdown

Codecov Report

Merging #9548 (bbba43f) into master (39b23b7) will increase coverage by 0.00%.
The diff coverage is 81.13%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #9548   +/-   ##
=======================================
  Coverage   82.28%   82.29%           
=======================================
  Files         244      244           
  Lines       55616    55659   +43     
=======================================
+ Hits        45766    45804   +38     
- Misses       9850     9855    +5     
Impacted Files Coverage Δ
rust/datafusion/src/physical_plan/repartition.rs 77.52% <80.76%> (+2.71%) ⬆️
rust/datafusion/src/physical_plan/hash_join.rs 83.52% <100.00%> (ø)
rust/datafusion/src/physical_plan/mod.rs 88.00% <0.00%> (+2.00%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 39b23b7...bbba43f. Read the comment docs.

@andygrove andygrove left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @Dandandan. I agree that it makes sense to get the functionality working first and optimize later.

@alamb

alamb commented Feb 23, 2021

Copy link
Copy Markdown
Contributor

The integration failure looks like https://issues.apache.org/jira/browse/ARROW-11717

@alamb alamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good -- nice work @Dandandan

Comment thread rust/datafusion/src/physical_plan/repartition.rs Outdated

let total_rows: usize = output_partitions.iter().map(|x| x.len()).sum();

assert_eq!(8, output_partitions.len());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense here also to assert on the distribution of rows (e.g. ensure that each batch has ~ 50*3 rows?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, but not sure how to do that currently, as it depends on random state (it could happen that all of them end up on same hash / partition in a very rare case).

@Dandandan

Copy link
Copy Markdown
Contributor Author

@alamb from my side the PR is good to go

@alamb alamb closed this in d731c91 Feb 26, 2021
@alamb

alamb commented Feb 26, 2021

Copy link
Copy Markdown
Contributor

Thanks @Dandandan -- looks great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants