Skip to content

[SPARK-54003][SQL] Use the staging directory as the output path then move to final path.#52720

Closed
zhengchenyu wants to merge 1 commit into
apache:masterfrom
zhengchenyu:SPARK-54003
Closed

[SPARK-54003][SQL] Use the staging directory as the output path then move to final path.#52720
zhengchenyu wants to merge 1 commit into
apache:masterfrom
zhengchenyu:SPARK-54003

Conversation

@zhengchenyu

@zhengchenyu zhengchenyu commented Oct 24, 2025

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

The key modifications are as follows:

  • Force to use staging directory in SQLHadoopMapReduceCommitProtocol except PathOutputCommitProtocol and perform a rename from the source directory to the destination directory during the commit job phase. Dynamic partition overwrite and custom partition path have also been integrated into this process. And Handle paths according to SaveMode.

  • Avoid deleting partitions before task runs and implement dynamic overwrite in SQLHadoopMapReduceCommitProtocol. To maintain compatibility with static mode, the corresponding partition files need to be deleted after the job is done.

Note: For ease of review, some code in HadoopMapReduceCommitProtocol has been retained. In fact, I think the parameter dynamicPartitionOverwrite and the code for renaming partition directories during the commit job phase are no longer meaningful and should be removed.

Why are the changes needed?

SparkSQL uses the partition location or table location as the commit path (except in dynamic partition overwrite mode and custom partition path mode). This has at least the following issues:

  • As described in SPARK-37210, conflicts can occur when multiple partitions job of the same table are run concurrently. Using a staging directory can avoid this issue.
  • As described in SPARK-53937, using a staging directory allows for near-atomic operations.

Dynamic partition overwrite mode and custom partition path mode already use the staging directory. And dynamic partition overwrite mode and custom partition path are implemented differently, which can be simplified into a unified process. And in #29000, reset the staging directory as the output directory of FileOutputCommitter. This way is more safer. It should be modified to this way.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing unit tests and newly added unit tests

@zhengchenyu zhengchenyu marked this pull request as draft October 24, 2025 07:18
@zhengchenyu zhengchenyu deleted the SPARK-54003 branch November 12, 2025 11:01
@zhengchenyu zhengchenyu reopened this Nov 12, 2025
@zhengchenyu zhengchenyu marked this pull request as ready for review November 13, 2025 01:33
@zhengchenyu

Copy link
Copy Markdown
Contributor Author

@cloud-fan Can you please review this PR? I think that using the staging directory directly is a safer approach. Can you give me some advice?

@github-actions

Copy link
Copy Markdown

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions Bot added the Stale label Feb 22, 2026
@github-actions github-actions Bot closed this Feb 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant