Skip to content

[SPARK-49352][SQL][3.4] Avoid redundant array transform for identical expression#47862

Closed
viirya wants to merge 1 commit into
apache:branch-3.4from
viirya:fix_redundant_array_transform_3.4
Closed

[SPARK-49352][SQL][3.4] Avoid redundant array transform for identical expression#47862
viirya wants to merge 1 commit into
apache:branch-3.4from
viirya:fix_redundant_array_transform_3.4

Conversation

@viirya

@viirya viirya commented Aug 24, 2024

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

This patch avoids ArrayTransform in resolveArrayType function if the resolution expression is the same as input param.

Why are the changes needed?

Our customer encounters significant performance regression when migrating from Spark 3.2 to Spark 3.4 on a Insert Into query which is analyzed as a AppendData on an Iceberg table.
We found that the root cause is in Spark 3.4, TableOutputResolver resolves the query with additional ArrayTransform on an ArrayType field. The ArrayTransform's lambda function is actually an identical function, i.e., the transformation is redundant.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test and manual e2e test

Was this patch authored or co-authored using generative AI tooling?

No

@viirya viirya changed the title [SPARK-49352][SQL] Avoid redundant array transform for identical expression [SPARK-49352][SQL][3.4] Avoid redundant array transform for identical expression Aug 24, 2024
@github-actions github-actions Bot added the SQL label Aug 24, 2024
@viirya

viirya commented Aug 24, 2024

Copy link
Copy Markdown
Member Author

cc @dongjoon-hyun

@dongjoon-hyun dongjoon-hyun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (Pending CIs). Thank you, @viirya .

dongjoon-hyun pushed a commit that referenced this pull request Aug 24, 2024
… expression

### What changes were proposed in this pull request?

This patch avoids `ArrayTransform` in `resolveArrayType` function if the resolution expression is the same as input param.

### Why are the changes needed?

Our customer encounters significant performance regression when migrating from Spark 3.2 to Spark 3.4 on a `Insert Into` query which is analyzed as a `AppendData` on an Iceberg table.
We found that the root cause is in Spark 3.4, `TableOutputResolver` resolves the query with additional `ArrayTransform` on an `ArrayType` field. The `ArrayTransform`'s lambda function is actually an identical function, i.e., the transformation is redundant.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test and manual e2e test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #47862 from viirya/fix_redundant_array_transform_3.4.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@dongjoon-hyun

Copy link
Copy Markdown
Member

Merged to branch-3.4.

@viirya

viirya commented Aug 24, 2024

Copy link
Copy Markdown
Member Author

Thank you @dongjoon-hyun

@viirya viirya deleted the fix_redundant_array_transform_3.4 branch August 24, 2024 05:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants