Skip to content

[SPARK-38138][SQL] Materialize QueryPlan subqueries#35438

Closed
pan3793 wants to merge 2 commits into
apache:masterfrom
pan3793:subquery
Closed

[SPARK-38138][SQL] Materialize QueryPlan subqueries#35438
pan3793 wants to merge 2 commits into
apache:masterfrom
pan3793:subquery

Conversation

@pan3793

@pan3793 pan3793 commented Feb 8, 2022

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

This PR propose to materialize QueryPlan#subqueries and pruned by PLAN_EXPRESSION on searching to improve the SQL compile performance.

Why are the changes needed?

We found a query in production that cost lots of time in optimize phase (also include AQE optimize phase) when enable DPP, the SQL pattern likes

select <cols...>
from a
left join b on a.<col> = b.<col>
left join c on b.<col> = c.<col>
left join d on c.<col> = d.<col>
left join e on d.<col> = e.<col>
left join f on e.<col> = f.<col>
left join g on f.<col> = g.<col>
left join h on g.<col> = h.<col>
...

SPARK-36444 significantly reduces the optimize time (exclude AQE phase), see detail at #35431, but there are still lots of time costs in InsertAdaptiveSparkPlan on AQE optimize phase.

Before this change, the query costs 658s, after this change only costs 65s.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing UTs.

@github-actions github-actions Bot added the SQL label Feb 8, 2022
@pan3793

pan3793 commented Feb 8, 2022

Copy link
Copy Markdown
Member Author

cc @wangyum @cloud-fan @yaooqinn

@HyukjinKwon

Copy link
Copy Markdown
Member

cc @maryannxue @allisonwang-db @sigmod FYI

*/
def subqueries: Seq[PlanType] = {
expressions.flatMap(_.collect {
lazy val subqueries: Seq[PlanType] = {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add @transient

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tips, updated.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for education purpose: why @transient is useful here?

@pan3793 pan3793 Feb 9, 2022

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SparkPlan is the subclass of QueryPlan, which need to be sent to executor, use @transient to reduce the memory usage of executor.

abstract class SparkPlan extends QueryPlan[SparkPlan] with Logging with Serializable

@amaliujia

Copy link
Copy Markdown

cc @amaliujia

@AmplabJenkins

Copy link
Copy Markdown

Can one of the admins verify this patch?

@pan3793

pan3793 commented Feb 16, 2022

Copy link
Copy Markdown
Member Author

@cloud-fan would you please take a look? thanks

@cloud-fan

Copy link
Copy Markdown
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 0fcb560 Feb 18, 2022
@pan3793 pan3793 deleted the subquery branch April 4, 2022 09:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants