Implement join reordering of fact-dimension joins#1027
Conversation
Codecov Report
📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more @@ Coverage Diff @@
## main #1027 +/- ##
==========================================
+ Coverage 81.75% 81.87% +0.11%
==========================================
Files 78 78
Lines 4380 4380
Branches 788 788
==========================================
+ Hits 3581 3586 +5
+ Misses 626 617 -9
- Partials 173 177 +4 see 1 file with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
sarahyurick
left a comment
There was a problem hiding this comment.
While this PR does not work for all cases, it implements basic join reordering logic. I've opened #1069 to address additional features in the future.
For now, though, I'll be focusing on dynamic partition pruning, so I wanted to open this PR up for review.
| Self { | ||
| max_fact_tables: 2, | ||
| // FIXME: fact_dimension_ratio should be 0.3 | ||
| fact_dimension_ratio: 0.7, |
There was a problem hiding this comment.
Ideally, this should be 0.3, but queries 17, 25, 29, and 85 currently fail without a stricter ratio. This suggests that more work needs to be done with reordering fact-to-fact joins.
| right: LogicalPlan, | ||
| join_type: JoinType, | ||
| join_keys: (Vec<impl Into<Column>>, Vec<impl Into<Column>>), | ||
| filter: Option<Expr>, |
There was a problem hiding this comment.
Probably the biggest improvement we need is to support join filters correctly. This should allow us to run query 72, which is expected to have the largest performance gain with join reordering. Other queries affected include queries 75 and 93.
There was a problem hiding this comment.
Interested in what the blockers here are? Is there an issue or upstream PR we could link here (and in other related FIXMEs)
There was a problem hiding this comment.
There aren't any upstream issues that I'm aware of. Mostly just personal preference to push this to a later iteration of the rule since DPP is currently higher priority, and I think this change would require a decent amount of refactoring.
It's currently listed in #1069
jdye64
left a comment
There was a problem hiding this comment.
Great PR. This is a complicated bit of logic but was well written and easy to follow/understand. I would like to see a user facing warning when the statistics defaults to 100 for the row count just so users don't get blindsided by unexpected optimizations. Otherwise this is great.
| Self { | ||
| max_fact_tables: 2, | ||
| // FIXME: fact_dimension_ratio should be 0.3 | ||
| fact_dimension_ratio: 0.7, |
charlesbluca
left a comment
There was a problem hiding this comment.
Thanks @sarahyurick! Just a few initial comments before I dig into the bulk of the algorithm itself:
ayushdg
left a comment
There was a problem hiding this comment.
Could we add a couple of test cases here with different table sizes to assert things are re-ordering as expected? No strong preference on whether it should be python or a rust test.
|
Thanks @charlesbluca and @ayushdg ! Updated from your suggestions, lmk what you think. |
randerzander
left a comment
There was a problem hiding this comment.
I compared Dask-SQL main vs this PR on a set of internal benchmark queries. This PR is about 3% faster overall. Any query where perf was worse looks within typical noise range.
| } | ||
| } | ||
|
|
||
| if facts.is_empty() || dims.is_empty() { |
There was a problem hiding this comment.
In practice, is facts.is_empty() ever true? The only case I could think of where this would be the case is if rels.is_empty(), but not sure if I'm missing something
There was a problem hiding this comment.
I tried removing facts.is_empty() and ended up with a couple of PyTest failures. You're right that it's serving basically the same purpose as having rels.is_empty() would.
| right: LogicalPlan, | ||
| join_type: JoinType, | ||
| join_keys: (Vec<impl Into<Column>>, Vec<impl Into<Column>>), | ||
| filter: Option<Expr>, |
There was a problem hiding this comment.
Interested in what the blockers here are? Is there an issue or upstream PR we could link here (and in other related FIXMEs)
| assert_eq(result_df, expected_df) | ||
|
|
||
|
|
||
| def test_join_reorder(c): |
There was a problem hiding this comment.
Could we add some tests around other potential join cases we would run into? In particular maybe some with:
- unsupported join types / conditions
- joins that would result in more than 2 dominant fact tables
There was a problem hiding this comment.
The logic for joins involving different combinations of numbers of fact tables versus numbers of dimension tables is a little shaky (especially depending on the number of fact tables and fact-fact joins), so I'd prefer to hold off there until I have a better solution for multiple fact tables. I have this listed in #1069
I can work on examples for unsupported join types and conditions, though.
Co-authored-by: Charles Blackmon-Luca <20627856+charlesbluca@users.noreply.github.com>
|
Thanks @charlesbluca ! I've updated from your suggestions. I also added an additional test: by default, join reordering should only be done with filtered tables, so we demonstrate that the plan remains unchanged with unfiltered tables. Let me know if there's any additional conditions we should check. |
charlesbluca
left a comment
There was a problem hiding this comment.
Thanks for addressing all my comments @sarahyurick! Overall this LGTM, though would also like to get a thumbs up from @jdye64 and/or @ayushdg here who have a little more experience with optimizers
Reopening #950 here.
This PR implements a new logical plan optimization rule based on the paper Improving Join Reordering for Large Scale Distributed Computing.