Skip to content

fix: handle duplicate WindowFunction expressions in Substrait consumer#15211

Merged
alamb merged 8 commits into
apache:mainfrom
Blizzara:avo/fix-substrait-duplicate-window-expr
Mar 19, 2025
Merged

fix: handle duplicate WindowFunction expressions in Substrait consumer#15211
alamb merged 8 commits into
apache:mainfrom
Blizzara:avo/fix-substrait-duplicate-window-expr

Conversation

@Blizzara

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Currently duplicate WindowFunction calls would fail due to duplicate unqualified name.

Rationale for this change

What changes are included in this PR?

De-duplicate WindowFunction expressions before creating a window node to wrap them. This also optimizes the window wrapping to only do it once per unique Window, with all provided WindowFunctions, rather than creating one window rel per function.

Are these changes tested?

Added two new tests Substrait JSON tests, since DF itself wouldn't normally produce these plans (the DF optimizer handles deduplication and separating the distinct windows.

Are there any user-facing changes?

Arttu Voutilainen added 4 commits March 13, 2025 13:49
having two times the same windowfunction expr fails planning with
`Error: SchemaError(DuplicateUnqualifiedField { name: "row_number() ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW" }, Some(""))`
);

// Trigger execution to ensure plan validity
DataFrame::new(ctx.state(), plan).show().await?;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added these for all of the cases, since some of my iterations passed logical plan creation but failed at physical planning stage.

pub fn window_plan(
input: LogicalPlan,
window_exprs: Vec<Expr>,
window_exprs: impl IntoIterator<Item = Expr>,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this allows feeding in a HashSet

@github-actions github-actions Bot added logical-expr Logical plan and expressions substrait Changes to the substrait crate labels Mar 13, 2025
@Blizzara

Copy link
Copy Markdown
Contributor Author

Fyi @vbarua @westonpace @alamb, don't remember now if there were others to tag :)

@alamb

alamb commented Mar 13, 2025

Copy link
Copy Markdown
Contributor

Fyi @vbarua @westonpace @alamb, don't remember now if there were others to tag :)

this is the best list I know of !

}

let input = if !window_exprs.is_empty() {
LogicalPlanBuilder::window_plan(input, window_exprs)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has logic built in to separate the expressions by their windows, so it's kinda nice.

@westonpace

Copy link
Copy Markdown
Member

Currently duplicate WindowFunction calls would fail due to duplicate unqualified name.

Does this error come from Datafusion or from Substrait? I wouldn't think Substrait would care about duplicate names (since it doesn't use names).

If the error is coming from Datafusion then maybe it's not a valid logical plan?

I'm not opposed to the change but do we want to have the Substrait serializer be robust to invalid input? That seems like it could be a slippery slope to more work.

@Blizzara

Copy link
Copy Markdown
Contributor Author

Does this error come from Datafusion or from Substrait? I wouldn't think Substrait would care about duplicate names (since it doesn't use names).

DataFusion. Substrait indeed doesn't care.

If the error is coming from Datafusion then maybe it's not a valid logical plan?

Exactly. It's valid Substrait (produced by Isthmus, and why would it not be?), but not valid DF since DF has extra conditions on the logical plan.

I'm not opposed to the change but do we want to have the Substrait serializer be robust to invalid input? That seems like it could be a slippery slope to more work.

By "serializer", do you mean "consumer" or "producer"? This PR is for the part that turns Substrait into DF plans, so as far as I know the plan is valid Substrait and this PR helps make it valid DF as well. Not sure if I understood the question though?

@westonpace

westonpace commented Mar 16, 2025

Copy link
Copy Markdown
Member

By "serializer", do you mean "consumer" or "producer"? This PR is for the part that turns Substrait into DF plans, so as far as I know the plan is valid Substrait and this PR helps make it valid DF as well. Not sure if I understood the question though?

It was my mistake. I misunderstood the PR. I see now this is a change in the consumer and it makes perfect sense now. This is just normalizing some set of Substrait input plans that aren't valid in Substrait if translated literally but they have a valid (and semantically equivalent) normalized form.

For some reason I thought this was a change to the producer and that is why I was confused.

@alamb alamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THanks @Blizzara and @westonpace

I have a question about order of window functions, but otherwise this PR looks good to me

let mut explicit_exprs: Vec<Expr> = vec![];
// For WindowFunctions, we need to wrap them in a Window relation. If there are duplicates,
// we can do the window'ing only once, then the project will duplicate the result.
let mut window_exprs: HashSet<Expr> = HashSet::new();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't order of the expressions matter? If so, I think you could use an IndexSet rather than an HashSet to preserve the input order too

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like an improvement over what is on main, so I think we could merge this PR as is, but this seems the potential reordering might cause potentially confusing intermittently variable output

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I don't think it matters per se, since the project below then puts things into right places based on the names. However, for consistency it might be nice for the order to stay. Let me see if I can quickly do that!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually no, the ordering we use here doesn't matter, since the LogicalPlanBuilder::window_plan calls group_window_expr_by_sort_keys which sorts the expressions anyways. At least I think so. So I just added a comment to note that: 696bf5d

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Blizzara

@alamb

alamb commented Mar 19, 2025

Copy link
Copy Markdown
Contributor

I merged this PR up from main to resolve a conflict and plan to merge it when the CI passes

@Blizzara

Blizzara commented Mar 19, 2025

Copy link
Copy Markdown
Contributor Author

I merged this PR up from main to resolve a conflict and plan to merge it when the CI passes

Thanks @alamb , should be good to go! (once CI is done)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

logical-expr Logical plan and expressions substrait Changes to the substrait crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants