Skip to content

Implement join reordering of fact-dimension joins#1027

Merged
charlesbluca merged 19 commits into
dask-contrib:mainfrom
sarahyurick:join_reorder
Apr 5, 2023
Merged

Implement join reordering of fact-dimension joins#1027
charlesbluca merged 19 commits into
dask-contrib:mainfrom
sarahyurick:join_reorder

Conversation

@sarahyurick

Copy link
Copy Markdown
Collaborator

Reopening #950 here.

This PR implements a new logical plan optimization rule based on the paper Improving Join Reordering for Large Scale Distributed Computing.

@codecov-commenter

codecov-commenter commented Feb 2, 2023

Copy link
Copy Markdown

Codecov Report

Merging #1027 (1b96d7b) into main (dc5c6fe) will increase coverage by 0.11%.
The diff coverage is n/a.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@            Coverage Diff             @@
##             main    #1027      +/-   ##
==========================================
+ Coverage   81.75%   81.87%   +0.11%     
==========================================
  Files          78       78              
  Lines        4380     4380              
  Branches      788      788              
==========================================
+ Hits         3581     3586       +5     
+ Misses        626      617       -9     
- Partials      173      177       +4     

see 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@sarahyurick sarahyurick marked this pull request as ready for review March 1, 2023 21:07

@sarahyurick sarahyurick left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this PR does not work for all cases, it implements basic join reordering logic. I've opened #1069 to address additional features in the future.

For now, though, I'll be focusing on dynamic partition pruning, so I wanted to open this PR up for review.

Self {
max_fact_tables: 2,
// FIXME: fact_dimension_ratio should be 0.3
fact_dimension_ratio: 0.7,

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, this should be 0.3, but queries 17, 25, 29, and 85 currently fail without a stricter ratio. This suggests that more work needs to be done with reordering fact-to-fact joins.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

roger

right: LogicalPlan,
join_type: JoinType,
join_keys: (Vec<impl Into<Column>>, Vec<impl Into<Column>>),
filter: Option<Expr>,

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably the biggest improvement we need is to support join filters correctly. This should allow us to run query 72, which is expected to have the largest performance gain with join reordering. Other queries affected include queries 75 and 93.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interested in what the blockers here are? Is there an issue or upstream PR we could link here (and in other related FIXMEs)

@sarahyurick sarahyurick Mar 31, 2023

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There aren't any upstream issues that I'm aware of. Mostly just personal preference to push this to a later iteration of the rule since DPP is currently higher priority, and I think this change would require a decent amount of refactoring.

It's currently listed in #1069

@jdye64 jdye64 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR. This is a complicated bit of logic but was well written and easy to follow/understand. I would like to see a user facing warning when the statistics defaults to 100 for the row count just so users don't get blindsided by unexpected optimizations. Otherwise this is great.

Self {
max_fact_tables: 2,
// FIXME: fact_dimension_ratio should be 0.3
fact_dimension_ratio: 0.7,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

roger

Comment thread dask_planner/src/sql/optimizer/join_reorder.rs
Comment thread dask_planner/src/sql/optimizer/join_reorder.rs Outdated

@charlesbluca charlesbluca left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sarahyurick! Just a few initial comments before I dig into the bulk of the algorithm itself:

Comment thread dask_planner/src/sql/optimizer/join_reorder.rs
Comment thread dask_planner/src/sql/optimizer/join_reorder.rs Outdated

@ayushdg ayushdg left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a couple of test cases here with different table sizes to assert things are re-ordering as expected? No strong preference on whether it should be python or a rust test.

Comment thread dask_planner/src/sql/optimizer/join_reorder.rs Outdated
@sarahyurick

Copy link
Copy Markdown
Collaborator Author

Thanks @charlesbluca and @ayushdg ! Updated from your suggestions, lmk what you think.

@randerzander randerzander left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I compared Dask-SQL main vs this PR on a set of internal benchmark queries. This PR is about 3% faster overall. Any query where perf was worse looks within typical noise range.

}
}

if facts.is_empty() || dims.is_empty() {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, is facts.is_empty() ever true? The only case I could think of where this would be the case is if rels.is_empty(), but not sure if I'm missing something

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried removing facts.is_empty() and ended up with a couple of PyTest failures. You're right that it's serving basically the same purpose as having rels.is_empty() would.

Comment thread dask_planner/src/sql/optimizer/join_reorder.rs Outdated
Comment thread dask_planner/src/sql/optimizer/join_reorder.rs Outdated
Comment thread dask_planner/src/sql/optimizer/join_reorder.rs Outdated
Comment thread dask_planner/src/sql/optimizer/join_reorder.rs Outdated
right: LogicalPlan,
join_type: JoinType,
join_keys: (Vec<impl Into<Column>>, Vec<impl Into<Column>>),
filter: Option<Expr>,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interested in what the blockers here are? Is there an issue or upstream PR we could link here (and in other related FIXMEs)

Comment thread dask_planner/src/sql/optimizer/join_reorder.rs Outdated
assert_eq(result_df, expected_df)


def test_join_reorder(c):

@charlesbluca charlesbluca Mar 31, 2023

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add some tests around other potential join cases we would run into? In particular maybe some with:

  • unsupported join types / conditions
  • joins that would result in more than 2 dominant fact tables

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for joins involving different combinations of numbers of fact tables versus numbers of dimension tables is a little shaky (especially depending on the number of fact tables and fact-fact joins), so I'd prefer to hold off there until I have a better solution for multiple fact tables. I have this listed in #1069

I can work on examples for unsupported join types and conditions, though.

sarahyurick and others added 2 commits March 31, 2023 14:15
Co-authored-by: Charles Blackmon-Luca <20627856+charlesbluca@users.noreply.github.com>
@sarahyurick

Copy link
Copy Markdown
Collaborator Author

Thanks @charlesbluca ! I've updated from your suggestions. I also added an additional test: by default, join reordering should only be done with filtered tables, so we demonstrate that the plan remains unchanged with unfiltered tables. Let me know if there's any additional conditions we should check.

@charlesbluca charlesbluca left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing all my comments @sarahyurick! Overall this LGTM, though would also like to get a thumbs up from @jdye64 and/or @ayushdg here who have a little more experience with optimizers

@charlesbluca charlesbluca merged commit ce55082 into dask-contrib:main Apr 5, 2023
@sarahyurick sarahyurick deleted the join_reorder branch May 26, 2023 22:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants