fix: handle duplicate WindowFunction expressions in Substrait consumer#15211
Conversation
having two times the same windowfunction expr fails planning with
`Error: SchemaError(DuplicateUnqualifiedField { name: "row_number() ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW" }, Some(""))`
| ); | ||
|
|
||
| // Trigger execution to ensure plan validity | ||
| DataFrame::new(ctx.state(), plan).show().await?; |
There was a problem hiding this comment.
I just added these for all of the cases, since some of my iterations passed logical plan creation but failed at physical planning stage.
| pub fn window_plan( | ||
| input: LogicalPlan, | ||
| window_exprs: Vec<Expr>, | ||
| window_exprs: impl IntoIterator<Item = Expr>, |
There was a problem hiding this comment.
this allows feeding in a HashSet
|
Fyi @vbarua @westonpace @alamb, don't remember now if there were others to tag :) |
this is the best list I know of ! |
| } | ||
|
|
||
| let input = if !window_exprs.is_empty() { | ||
| LogicalPlanBuilder::window_plan(input, window_exprs)? |
There was a problem hiding this comment.
This has logic built in to separate the expressions by their windows, so it's kinda nice.
Does this error come from Datafusion or from Substrait? I wouldn't think Substrait would care about duplicate names (since it doesn't use names). If the error is coming from Datafusion then maybe it's not a valid logical plan? I'm not opposed to the change but do we want to have the Substrait serializer be robust to invalid input? That seems like it could be a slippery slope to more work. |
DataFusion. Substrait indeed doesn't care.
Exactly. It's valid Substrait (produced by Isthmus, and why would it not be?), but not valid DF since DF has extra conditions on the logical plan.
By "serializer", do you mean "consumer" or "producer"? This PR is for the part that turns Substrait into DF plans, so as far as I know the plan is valid Substrait and this PR helps make it valid DF as well. Not sure if I understood the question though? |
It was my mistake. I misunderstood the PR. I see now this is a change in the consumer and it makes perfect sense now. This is just normalizing some set of Substrait input plans that aren't valid in Substrait if translated literally but they have a valid (and semantically equivalent) normalized form. For some reason I thought this was a change to the producer and that is why I was confused. |
alamb
left a comment
There was a problem hiding this comment.
THanks @Blizzara and @westonpace
I have a question about order of window functions, but otherwise this PR looks good to me
| let mut explicit_exprs: Vec<Expr> = vec![]; | ||
| // For WindowFunctions, we need to wrap them in a Window relation. If there are duplicates, | ||
| // we can do the window'ing only once, then the project will duplicate the result. | ||
| let mut window_exprs: HashSet<Expr> = HashSet::new(); |
There was a problem hiding this comment.
Doesn't order of the expressions matter? If so, I think you could use an IndexSet rather than an HashSet to preserve the input order too
There was a problem hiding this comment.
It seems like an improvement over what is on main, so I think we could merge this PR as is, but this seems the potential reordering might cause potentially confusing intermittently variable output
There was a problem hiding this comment.
Hm, I don't think it matters per se, since the project below then puts things into right places based on the names. However, for consistency it might be nice for the order to stay. Let me see if I can quickly do that!
There was a problem hiding this comment.
Actually no, the ordering we use here doesn't matter, since the LogicalPlanBuilder::window_plan calls group_window_expr_by_sort_keys which sorts the expressions anyways. At least I think so. So I just added a comment to note that: 696bf5d
…licate-window-expr
|
I merged this PR up from main to resolve a conflict and plan to merge it when the CI passes |
Thanks @alamb , should be good to go! (once CI is done) |
Which issue does this PR close?
Currently duplicate WindowFunction calls would fail due to duplicate unqualified name.
Rationale for this change
What changes are included in this PR?
De-duplicate WindowFunction expressions before creating a window node to wrap them. This also optimizes the window wrapping to only do it once per unique Window, with all provided WindowFunctions, rather than creating one window rel per function.
Are these changes tested?
Added two new tests Substrait JSON tests, since DF itself wouldn't normally produce these plans (the DF optimizer handles deduplication and separating the distinct windows.
Are there any user-facing changes?