Gate new ScalarSubqueryExec node behind session property#22530
Gate new ScalarSubqueryExec node behind session property#22530LiaCastaneda wants to merge 7 commits into
Conversation
1c7af79 to
4f23ed0
Compare
4f23ed0 to
8416100
Compare
6e88a1c to
ddc20cd
Compare
gabotechs
left a comment
There was a problem hiding this comment.
Nice, thanks @LiaCastaneda. Is there a chance you could take a look at this one @neilconway?
|
I also ran the tpch queries in my local with the flag turned off (old path), all results match. Maybe it's worth adding it as part of the regular checks edit: added here d1b9dad |
neilconway
left a comment
There was a problem hiding this comment.
I think this makes sense as an interim measure if it will be too difficult to adapt df-distributed and/or ballista in the short-term, but long-term I'd prefer not to have a config option that silently produces incorrect query results. Can we add a note that disabling this is not recommended, and that we plan to remove the config option in the future -- say in a few DF releases from now?
|
thank you @LiaCastaneda , @gabotechs & @neilconway for driving this |
dbb8450 to
4523e07
Compare
Makes sense, I will also create an issue to keep track on this and not forget |
4523e07 to
bcc0d78
Compare
Looks like it is |
alamb
left a comment
There was a problem hiding this comment.
I went over this PR carefully and it looks good to me. Thank you @LiaCastaneda and @neilconway
| /// physical execution. When set to false, all scalar subqueries | ||
| /// (including uncorrelated ones) are rewritten to left joins by the | ||
| /// `ScalarSubqueryToJoin` optimizer rule. | ||
| /// |
There was a problem hiding this comment.
For my understanding, if this flag is enabled does it
- restore DataFusion 53 behavior (which can be wrong in some cases)
- Introduce some new ways for incorrect results?
There was a problem hiding this comment.
restore DataFusion 53 behavior (which can be wrong in some cases)
yes
Introduce some new ways for incorrect results?
It just introduces the incorrect results/limitations that DataFusion 53 already had:
- no support of scalar subqueries in order by and join on expressions.
- when the subquery returns more than 1 row DF 53 does not throw an error and instead returns wrong results.
|
Thank you for opening this pull request! Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch). Details |
gabotechs
left a comment
There was a problem hiding this comment.
+1 from me, unless @neilconway has any other concern, I think we can pull this in.
Which issue does this PR close?
Related to discussion on #21240 and #21080 (comment).
PR #21240 introduced
ScalarSubqueryExec/ScalarSubqueryExprto execute uncorrelated scalar subqueries during physical execution. The two communicate via shared in process state (aslotinExecutionProps), which breaks distributed execution that may split execution across a network boundary between the producer (ScalarSubqueryExec) and the consumer expression (ScalarSubqueryExpr). See more details on this explanation in datafusion-contrib/datafusion-distributed#460What changes are included in this PR?
Adds a new optimizer config option
datafusion.optimizer.enable_physical_uncorrelated_scalar_subquery(default true, preserving the current behavior). When true (default), behavior is unchanged from current main; when false, all scalar subqueries are rewritten to left joins byScalarSubqueryToJoinandScalarSubqueryExecis never constructed (which was the previous behavior).Are these changes tested?
Yes all tests pass and added
uncorrelated_scalar_subquery_rewritten_when_flag_offto test the negative case.Are there any user-facing changes?
Yes, a new config option
datafusion.optimizer.physical_uncorrelated_scalar_subquery(this just changes the way the query is executed but not the results)