Skip to content

Spark: Analyze but don't optimize view body during creation#14681

Merged
nastra merged 5 commits into
apache:mainfrom
jbewing:analyze-dont-optimize-view-creation
Dec 3, 2025
Merged

Spark: Analyze but don't optimize view body during creation#14681
nastra merged 5 commits into
apache:mainfrom
jbewing:analyze-dont-optimize-view-creation

Conversation

@jbewing

@jbewing jbewing commented Nov 25, 2025

Copy link
Copy Markdown
Contributor

What

This PR updates view creation from Spark 4, 3.5, & 3.4 to analyze, but not optimize the view body when creating a view. Previously, the view body would be optimized which could result in long view creation times with larger tables. When creating views over a larger table (hundreds of TBs), creating a small number of views (say just a couple thousand) takes about ~12 hours and requires a moderately sized Spark cluster (~100 CPUs). Without running optimization over a view body, the view body is still analyzed for invalid syntax or references.

How

Upon looking upstream at Spark, we can see that for similar pieces of the view creation logic that views have some explicit code that enables the view body to be analyzed, but not optimized.

In Iceberg, we hijack the regular upstream Spark code and run our own variants of view creation that don't pull in this optimization. Given that a table scan planning is both redundant and slow in this case, we should update the internal Iceberg view creation code to only be used in Spark analysis, but not optimization phases.

Testing

I've run the existing test suite locally for Spark 3.4, 3.5, & 4 to verify that they still pass. Additionally, I've run this iceberg patch on an fork of Iceberg 1.10.0 on a fork of Spark 3.5 an observed in a staging environment that a task which creates some views over a smaller (~10TB) table that used to take 2 hours now takes 14 minutes consistently. Additionally, no errors or bugs were observed with the created views when testing in this staging environment.

Issue: #14680

@github-actions github-actions Bot added the spark label Nov 25, 2025
@nastra nastra requested review from huaxingao and nastra November 25, 2025 08:30
rewritten: Boolean = false) extends BinaryCommand {
override def left: LogicalPlan = child
rewritten: Boolean = false,
isAnalyzed: Boolean = false) extends AnalysisOnlyCommand {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we add a comment to explain why we want to use AnalysisOnlyCommand?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! See db21f91 for added comments

@huaxingao

Copy link
Copy Markdown
Contributor

LGTM. This aligns Iceberg CreateIcebergView with Spark’s CreateViewCommand by extending AnalysisOnlyCommand. The command’s children are analyzed then hidden, so the optimizer/planner won’t traverse the view body.

@jbewing jbewing requested a review from huaxingao November 25, 2025 22:23

@huaxingao huaxingao left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jbewing

jbewing commented Nov 26, 2025

Copy link
Copy Markdown
Contributor Author

🙇 Thank you for the review @huaxingao ! Any timeline on when this can be merged?

@nastra

nastra commented Nov 26, 2025

Copy link
Copy Markdown
Contributor

@jbewing I'll also take a look at this PR this week

@jbewing

jbewing commented Nov 26, 2025

Copy link
Copy Markdown
Contributor Author

Sounds good! Thank you @nastra !

override protected def withNewChildrenInternal(
newLeft: LogicalPlan, newRight: LogicalPlan): LogicalPlan =
copy(child = newLeft, query = newRight)
def markAsAnalyzed(analysisContext: AnalysisContext): LogicalPlan = {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

override def markAsAnalyzed(analysisContext: AnalysisContext): LogicalPlan = {
    copy(isAnalyzed = true)
  }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please also fix the other spark versions?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 1a09db6!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbewing I think this is missing the override for markAsAnalyzed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right @nastra! I've addressed that in f77a893!

rewritten: Boolean = false) extends BinaryCommand {
override def left: LogicalPlan = child
rewritten: Boolean = false,
// Align Iceberg CreateIcebergView with Spark’s CreateViewCommand by extending AnalysisOnlyCommand.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Align Iceberg CreateIcebergView with Spark’s CreateViewCommand by extending AnalysisOnlyCommand.
// Align Iceberg's CreateIcebergView with Spark’s CreateViewCommand by extending AnalysisOnlyCommand.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also can we move the comment right above case class CreateIcebergView?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 1a09db6

- Improve + move CreateIcebergView comment for clarity
- Remove excessive indentation in `markAsAnalyzed` scala function
@jbewing jbewing requested a review from nastra December 1, 2025 18:34

@nastra nastra left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one final comment

@jbewing

jbewing commented Dec 2, 2025

Copy link
Copy Markdown
Contributor Author

Thank you for the review @nastra! I've addressed your final comment in f77a893

@jbewing

jbewing commented Dec 2, 2025

Copy link
Copy Markdown
Contributor Author

Actually, hold for a re-review as a recent PR #8023 has been merged which creates a ton of conflicts here that I need to resolve

@jbewing jbewing requested a review from nastra December 2, 2025 19:33
@jbewing

jbewing commented Dec 2, 2025

Copy link
Copy Markdown
Contributor Author

Alright, I've rebased on master!

@nastra nastra merged commit fac485c into apache:main Dec 3, 2025
29 checks passed
thomaschow pushed a commit to thomaschow/iceberg that referenced this pull request Jan 19, 2026
talatuyarer pushed a commit to talatuyarer/iceberg that referenced this pull request Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants