Skip to content

fix: split "union" and "interleave"#6045

Merged
alamb merged 3 commits into
apache:mainfrom
crepererum:crepererum/issue5970_b
Apr 21, 2023
Merged

fix: split "union" and "interleave"#6045
alamb merged 3 commits into
apache:mainfrom
crepererum:crepererum/issue5970_b

Conversation

@crepererum

@crepererum crepererum commented Apr 18, 2023

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Fixes #5970.

This is an alternative to #6009. Closes #6009

Rationale for this change

UnionExec shall be a simple node and not perform some optimizations on its own. If repartition / partition concatenation is desired, then there should be an optimizer pass for that. Being a simple node avoids confusion and optimizer bugs like #5970.

What changes are included in this PR?

Remove the entire "partition aware" logic from UnionExec

Are these changes tested?

  1. Existing tests pass
  2. Test for UNION ALL with ORDER BY results are inconsistent #5970 passes.
  3. Query mentioned in his comment is correctly optimized (i.e. does not require top-level repartition)

Are there any user-facing changes?

Less bugs.

@github-actions github-actions Bot added the core Core DataFusion crate label Apr 18, 2023
Silently interleaving partitions is confusing and causes bugs.

Fixes apache#5970.
@crepererum crepererum force-pushed the crepererum/issue5970_b branch from 846a867 to 6c0ae5f Compare April 18, 2023 15:04
@mingmwang

Copy link
Copy Markdown
Contributor

Could you please add a UT for the rule itself ?
For example for the case there are multiple partition-aware UnionExecs, finally they are all replaced.

@mingmwang

Copy link
Copy Markdown
Contributor

I'm afraid the current approach will not work in below case, additional Hash Repartition will still be added:

select count(*) from {
select distinct name from table t1
union all
select distinct name from table t2
) group by name

@crepererum

Copy link
Copy Markdown
Contributor Author

will add a unit test for the optimizer. will also try to add a test for the query listed above. I think this can be fixed by pushing the Interleave optimizer rule before Repartition.

@crepererum

Copy link
Copy Markdown
Contributor Author

I have this working locally by changing the optimizer. Instead of having an extra optimizer, this has to be integrated into the "dist enforcement" transform-up system. I'll need to add more tests though and clean up the code, but I think the result is satisfying.

@mingmwang

Copy link
Copy Markdown
Contributor

I have this working locally by changing the optimizer. Instead of having an extra optimizer, this has to be integrated into the "dist enforcement" transform-up system. I'll need to add more tests though and clean up the code, but I think the result is satisfying.

Yes, this should works, thanks!

Comment thread datafusion/core/src/physical_plan/union.rs
Comment thread datafusion/core/src/physical_plan/union.rs
Comment thread datafusion/core/src/physical_plan/union.rs Outdated
Comment thread datafusion/core/src/physical_plan/union.rs Outdated
Comment thread datafusion/core/src/physical_plan/union.rs Outdated
@mingmwang

Copy link
Copy Markdown
Contributor

LGTM, except for some wrong comments.

@crepererum crepererum force-pushed the crepererum/issue5970_b branch from 4b0e2ba to 5a03a96 Compare April 19, 2023 09:47
This avoids a top-level repartition in the following query:

```sql
SELECT c1, c9 FROM aggregate_test_100
UNION ALL
SELECT c1, c3 FROM aggregate_test_100
ORDER BY c9 DESC LIMIT 5
```
@crepererum crepererum force-pushed the crepererum/issue5970_b branch from 5a03a96 to 7818c18 Compare April 19, 2023 09:47
Comment thread datafusion/core/src/physical_plan/union.rs Outdated
@ozankabak

Copy link
Copy Markdown
Contributor

Does the new InterleaveExec behave similar to UnionExec in terms of sort enforcement? There are a couple places in sort_enforcement.rs where it makes a call to is_union to optimize/enforce sorts efficiently. Do we need to make any changes there?

@mustafasrepo, PTAL

@mingmwang

Copy link
Copy Markdown
Contributor

InterleaveExec

InterleaveExec can not keep the ordering(maintains_input_order is false), I think the logic in the rules do not need to update.

@mingmwang

Copy link
Copy Markdown
Contributor

I think this PR can be merged to main.

@alamb alamb merged commit ab09536 into apache:main Apr 21, 2023
@alamb

alamb commented Apr 21, 2023

Copy link
Copy Markdown
Contributor

Thanks @mingmwang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

UNION ALL with ORDER BY results are inconsistent

4 participants