Implement join reordering of fact-dimension joins by sarahyurick · Pull Request #1027 · dask-contrib/dask-sql

sarahyurick · 2023-02-02T23:36:08Z

Reopening #950 here.

This PR implements a new logical plan optimization rule based on the paper Improving Join Reordering for Large Scale Distributed Computing.

codecov-commenter · 2023-02-02T23:52:01Z

Codecov Report

Merging #1027 (1b96d7b) into main (dc5c6fe) will increase coverage by 0.11%.
The diff coverage is n/a.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@            Coverage Diff             @@
##             main    #1027      +/-   ##
==========================================
+ Coverage   81.75%   81.87%   +0.11%     
==========================================
  Files          78       78              
  Lines        4380     4380              
  Branches      788      788              
==========================================
+ Hits         3581     3586       +5     
+ Misses        626      617       -9     
- Partials      173      177       +4

see 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

sarahyurick

While this PR does not work for all cases, it implements basic join reordering logic. I've opened #1069 to address additional features in the future.

For now, though, I'll be focusing on dynamic partition pruning, so I wanted to open this PR up for review.

sarahyurick · 2023-03-01T21:10:31Z

+        Self {
+            max_fact_tables: 2,
+            // FIXME: fact_dimension_ratio should be 0.3
+            fact_dimension_ratio: 0.7,


Ideally, this should be 0.3, but queries 17, 25, 29, and 85 currently fail without a stricter ratio. This suggests that more work needs to be done with reordering fact-to-fact joins.

sarahyurick · 2023-03-01T21:15:00Z

+                right: LogicalPlan,
+                join_type: JoinType,
+                join_keys: (Vec<impl Into<Column>>, Vec<impl Into<Column>>),
+                filter: Option<Expr>,


Probably the biggest improvement we need is to support join filters correctly. This should allow us to run query 72, which is expected to have the largest performance gain with join reordering. Other queries affected include queries 75 and 93.

Interested in what the blockers here are? Is there an issue or upstream PR we could link here (and in other related FIXMEs)

There aren't any upstream issues that I'm aware of. Mostly just personal preference to push this to a later iteration of the rule since DPP is currently higher priority, and I think this change would require a decent amount of refactoring.

It's currently listed in #1069

jdye64

Great PR. This is a complicated bit of logic but was well written and easy to follow/understand. I would like to see a user facing warning when the statistics defaults to 100 for the row count just so users don't get blindsided by unexpected optimizations. Otherwise this is great.

jdye64 · 2023-03-13T17:48:29Z

+        Self {
+            max_fact_tables: 2,
+            // FIXME: fact_dimension_ratio should be 0.3
+            fact_dimension_ratio: 0.7,


charlesbluca

Thanks @sarahyurick! Just a few initial comments before I dig into the bulk of the algorithm itself:

ayushdg

Could we add a couple of test cases here with different table sizes to assert things are re-ordering as expected? No strong preference on whether it should be python or a rust test.

sarahyurick · 2023-03-30T21:01:35Z

Thanks @charlesbluca and @ayushdg ! Updated from your suggestions, lmk what you think.

randerzander

I compared Dask-SQL main vs this PR on a set of internal benchmark queries. This PR is about 3% faster overall. Any query where perf was worse looks within typical noise range.

charlesbluca · 2023-03-31T14:27:50Z

+        }
+    }
+
+    if facts.is_empty() || dims.is_empty() {


In practice, is facts.is_empty() ever true? The only case I could think of where this would be the case is if rels.is_empty(), but not sure if I'm missing something

I tried removing facts.is_empty() and ended up with a couple of PyTest failures. You're right that it's serving basically the same purpose as having rels.is_empty() would.

charlesbluca · 2023-03-31T15:26:25Z

+                right: LogicalPlan,
+                join_type: JoinType,
+                join_keys: (Vec<impl Into<Column>>, Vec<impl Into<Column>>),
+                filter: Option<Expr>,


Interested in what the blockers here are? Is there an issue or upstream PR we could link here (and in other related FIXMEs)

charlesbluca · 2023-03-31T17:02:18Z

    assert_eq(result_df, expected_df)


+def test_join_reorder(c):


Could we add some tests around other potential join cases we would run into? In particular maybe some with:

unsupported join types / conditions

joins that would result in more than 2 dominant fact tables

The logic for joins involving different combinations of numbers of fact tables versus numbers of dimension tables is a little shaky (especially depending on the number of fact tables and fact-fact joins), so I'd prefer to hold off there until I have a better solution for multiple fact tables. I have this listed in #1069

I can work on examples for unsupported join types and conditions, though.

Co-authored-by: Charles Blackmon-Luca <20627856+charlesbluca@users.noreply.github.com>

sarahyurick · 2023-03-31T22:20:17Z

Thanks @charlesbluca ! I've updated from your suggestions. I also added an additional test: by default, join reordering should only be done with filtered tables, so we demonstrate that the plan remains unchanged with unfiltered tables. Let me know if there's any additional conditions we should check.

charlesbluca

Thanks for addressing all my comments @sarahyurick! Overall this LGTM, though would also like to get a thumbs up from @jdye64 and/or @ayushdg here who have a little more experience with optimizers

add changes from dask-contrib#950

a2b9f67

sarahyurick added 2 commits February 7, 2023 12:33

add changes from apache/datafusion#4620

3308bf1

minor changes

6a729a7

This was referenced Feb 8, 2023

[ENH] Collect and track table statistics when applicable #994

Closed

Table Statistics Support #1037

Merged

sarahyurick and others added 7 commits February 14, 2023 10:56

Merge branch 'main' into join_reorder

1e04dde

save df 17 progress

a7a596d

better save

60624a8

fix optimize_children logic

c93fe30

add FIXMEs

0984a60

style fix

c1af624

remove bracket

f82c37b

sarahyurick marked this pull request as ready for review March 1, 2023 21:07

sarahyurick requested review from andygrove, ayushdg, charlesbluca, galipremsagar and jdye64 as code owners March 1, 2023 21:07

sarahyurick mentioned this pull request Mar 1, 2023

[ENH] Join reordering improvements #1069

Open

5 tasks

sarahyurick commented Mar 1, 2023

View reviewed changes

sarahyurick mentioned this pull request Mar 7, 2023

Add filepath fields in Python and Rust #1074

Merged

Merge branch 'main' into join_reorder

52ca539

jdye64 requested changes Mar 13, 2023

View reviewed changes

sarahyurick and others added 3 commits March 13, 2023 16:41

add warning

530b81b

Merge branch 'main' into join_reorder

d805cfa

Merge branch 'main' into join_reorder

7de80d7

sarahyurick mentioned this pull request Mar 29, 2023

Bump datafusion -> 21.0.0 and add dyn and dyn_hash functions to custo… #1094

Merged

charlesbluca reviewed Mar 30, 2023

View reviewed changes

Comment thread dask_planner/src/sql/optimizer/join_reorder.rs

Comment thread dask_planner/src/sql/optimizer/join_reorder.rs Outdated

ayushdg requested changes Mar 30, 2023

View reviewed changes

Comment thread dask_planner/src/sql/optimizer/join_reorder.rs Outdated

address reviews

ce83b71

randerzander approved these changes Mar 31, 2023

View reviewed changes

charlesbluca reviewed Mar 31, 2023

View reviewed changes

sarahyurick and others added 2 commits March 31, 2023 14:15

Apply suggestions from code review

8c5c42b

Co-authored-by: Charles Blackmon-Luca <20627856+charlesbluca@users.noreply.github.com>

add more suggestions

69d9eb1

charlesbluca approved these changes Apr 4, 2023

View reviewed changes

Merge branch 'main' into join_reorder

e3018ea

ayushdg approved these changes Apr 5, 2023

View reviewed changes

Merge branch 'main' into join_reorder

1b96d7b

charlesbluca merged commit ce55082 into dask-contrib:main Apr 5, 2023

sarahyurick deleted the join_reorder branch May 26, 2023 22:24

Uh oh!

Conversation

sarahyurick commented Feb 2, 2023

Uh oh!

codecov-commenter commented Feb 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sarahyurick left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sarahyurick Mar 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jdye64 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

charlesbluca left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ayushdg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sarahyurick commented Mar 30, 2023

Uh oh!

randerzander left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

charlesbluca Mar 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sarahyurick commented Mar 31, 2023

Uh oh!

charlesbluca left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

codecov-commenter commented Feb 2, 2023 •

edited

Loading

sarahyurick Mar 31, 2023 •

edited

Loading

charlesbluca Mar 31, 2023 •

edited

Loading