[SPARK-56773][CORE][TESTS] Add more fetch-failure injection knobs for INJECT_SHUFFLE_FETCH_FAILURES by juliuszsompolski · Pull Request #55738 · apache/spark

juliuszsompolski · 2026-05-07T15:47:54Z

What changes were proposed in this pull request?

Extends the test-only fetch-failure injection in DAGScheduler with three new knobs and changes the semantics of the existing master switch. Four flags total (all in org.apache.spark.internal.config.Tests):

INJECT_SHUFFLE_FETCH_FAILURES (existing). Semantics changed: previously corrupted every map task of stage attempt 0 (so only leaf shuffle map stages were ever affected, since non-leaf stages typically fail-fetch on attempt 0). Now corrupts the partition-0 task of the first SUCCESSFUL attempt of every shuffle map stage, including non-leaf stages whose attempt 0 fails on fetch from upstream.
INJECT_SHUFFLE_FETCH_FAILURES_DOWNSTREAM_DELAY (int, default 1, new). Defers the producer's mapper-0 corruption until N task-success events have arrived from ShuffleMapStage consumers of the shuffle. The DAGScheduler event loop processes task-completion events serially, so this guarantees N consumer tasks fully completed BEFORE the FetchFailed cascade kicks in. Subsequent tasks dispatched to free slots after the corruption see the invalid MapStatus and FetchFailed. With spark.sql.shuffle.partitions much larger than executor cores, this gives a deterministic "partial first-attempt + recompute" shape. Set to 0 to corrupt inline at registration.
INJECT_SHUFFLE_FETCH_FAILURES_RESULT_STAGE_DELAY (int, default 0, new). Counterpart of the above for ResultStage consumers. With the default 0, when a ResultStage is the consumer of a pending corruption it is corrupted before the result tasks dispatch, so the result stage has zero finished tasks when INJECT_SHUFFLE_FORCE_CHECKSUM_MISMATCH_ON_RECOMPUTE later triggers rollbackSucceedingStages (the rollback path would otherwise abort a partially-finished result stage, since OSS Spark does not support rolling them back). Set to N > 0 to defer until N result-stage tasks have succeeded - the only way to actually exercise the result-stage abort path.
INJECT_SHUFFLE_FORCE_CHECKSUM_MISMATCH_ON_RECOMPUTE (boolean, default false, new). After a downstream FetchFailed forces the producer's partition 0 to be recomputed, the recomputed task's MapStatus registration is artificially flagged as a checksum mismatch. The DAGScheduler then runs rollbackSucceedingStages, which clears the downstream ShuffleMapStage's outputs and forces a full retry of that stage. ResultStage downstreams are aborted (OSS Spark does not support rolling them back, SPARK-25342).

Why are the changes needed?

The existing INJECT_SHUFFLE_FETCH_FAILURES flag corrupts attempt 0 of every shuffle map stage, but only leaf shuffle map stages succeed on attempt 0. Non-leaf stages fail-fetch from corrupted upstream on attempt 0 and are never themselves corrupted, so unit tests cannot exercise the full range of stage-retry shapes that production hits (metric stability under non-determinism, rollback for indeterminate producers, SLAM semantics across retries). The new knobs let tests deterministically reach those shapes.

Does this PR introduce any user-facing change?

No. All flags are test-only (gated by Utils.isTesting) and are under spark.testing.*.

How was this patch tested?

MetricsFailureInjectionSuite (12 tests). Existing tests continue to pass with the new default semantics. The non-deterministic-stage test sees stage-2 raw metric overcount under the new default (because checksum-mismatch rollback now fires on the non-determinism); SLAM remains stable, which is the point of that test.
New test Three stage metrics force-checksum-mismatch with delayed corruption: with shuffle.partitions=20 (much greater than the test's local[2] cores) and delay=1, the rollback re-plays at least one already-completed stage-2 partition on top of the full re-run, putting the raw metric strictly above the recompute-only baseline.
New test Force checksum mismatch aborts a downstream ResultStage: 2-stage groupBy().count() query where stage 2 is a ResultStage. With RESULT_STAGE_DELAY=1 (opted-in) and INJECT_SHUFFLE_FORCE_CHECKSUM_MISMATCH_ON_RECOMPUTE, one result task succeeds before the FetchFailed cascade; the forced checksum mismatch on stage 1's mapper-0 recompute then fires rollbackSucceedingStages, which sees numMissingPartitions < numTasks on the result stage and aborts. The query throws a SparkException with "indeterminate" in the message.
SQLLastAttemptMetricPlanShapesSuite (220 tests, was previously parameterised on stageRetries: Boolean, now on a tri-valued FailureMode: NoFailure, FetchFailure, ChecksumMismatch). Each plan shape is now also exercised under forced checksum-mismatch rollback; SLAM values remain stable.
SQLLastAttemptMetricIntegrationSuiteWithChecksumMismatch (new subclass): runs the full integration suite with INJECT_SHUFFLE_FETCH_FAILURES + INJECT_SHUFFLE_FORCE_CHECKSUM_MISMATCH_ON_RECOMPUTE enabled.
core/scalastyle, sql/scalastyle, and 149 DAGSchedulerSuite tests pass.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code, Opus 4.7.

juliuszsompolski · 2026-05-07T15:50:41Z

@cloud-fan I added some refinements to the stage retry testing I added with #55371. This makes it exercise a bigger variety of scenarios. I also want to use it for more correctness testing in presence of retries and recomputes.

juliuszsompolski · 2026-05-11T19:41:02Z

[info] org.apache.spark.util.collection.SorterSuite *** ABORTED *** (48 seconds, 440 milliseconds)
[info]   java.lang.OutOfMemoryError: Java heap space

ugh...

juliuszsompolski · 2026-05-11T19:42:02Z

@cloud-fan @gengliangwang could you help review this? I would like to use this infra to test #55711, targeting 4.2.

juliuszsompolski · 2026-05-11T22:19:52Z

And now sql - other tests https://github.com/juliuszsompolski/apache-spark/actions/runs/25693038491/job/75434352332 timed out.
How many times for a clean CI?

gengliangwang

Summary

Extends the test-only fetch-failure injection in DAGScheduler so unit tests can deterministically reach scheduler shapes the existing inline corruption could never produce — most importantly, partial-progress-on-downstream followed by upstream recompute with a forced checksum mismatch driving rollbackSucceedingStages.

Prior state and problem. INJECT_SHUFFLE_FETCH_FAILURES corrupted every mapper of stage attempt 0 inline (before registerMapOutput). Because non-leaf shuffle map stages always fetch-fail on attempt 0 from a corrupted upstream, they're never themselves observed to succeed an attempt that gets corrupted — so the existing flag only ever exercises the leaf-stage corruption path. Several scheduler behaviors are untestable as a result: metric stability under non-determinism for non-leaf stages, rollback of partially-finished downstream stages on checksum mismatch, and the result-stage abort path (OSS Spark can't roll back result stages with completed tasks, SPARK-25342).

Design approach. Two structural changes plus three layered sub-flags:

The master switch now latches the first successful attempt per shuffle (per-shuffleId computeIfAbsent) and corrupts only partition 0 on that attempt. Non-leaf stages eventually corrupt too — once their attempt 0 recovers from upstream FetchFailed and successfully completes partition 0. Recomputes of partition 0 are left clean so the consumer can make progress.
INJECT_SHUFFLE_FETCH_FAILURES_DOWNSTREAM_DELAY (default 1) and INJECT_SHUFFLE_FETCH_FAILURES_RESULT_STAGE_DELAY (default 0) defer corruption until N consumer task successes have been observed. The DAGScheduler event loop is single-threaded, so this guarantees N downstream completions before the FetchFailed cascade. The asymmetric defaults are forced by the result-stage rollback restriction: with RESULT_STAGE_DELAY=0 preemptive corruption (in submitMissingTasks) means the result stage has zero completed tasks when rollback runs and filterAndAbortUnrollbackableStages lets it through; with >0, the abort path is the thing under test.
INJECT_SHUFFLE_FORCE_CHECKSUM_MISMATCH_ON_RECOMPUTE ORs a synthetic mismatch bit into the recompute's addMapOutput result, which (with the SQL default checksumMismatchFullRetryEnabled=true) drives rollbackSucceedingStages on the producer.

Key design decisions made by this PR.

Partition 0 only: the latch and the forced-mismatch are partition-0-scoped, which keeps the deterministic shape simple but means real checksum-mismatch behaviors on non-zero partitions aren't exercised by these flags.
Per-shuffle (not per-stage) state across all attempts, so the latch follows the shuffle as the stage retries.
Test-only state lives directly on DAGScheduler rather than behind an injectable hook — about 30 lines of fields and 90 lines of helper methods. The original inline implementation was already in-class; this preserves that pattern at the cost of more test-only code in production source.

Implementation sketch. Five new private methods on DAGScheduler (shouldCorruptShuffleOutputForTest, corruptShuffleOutputForTest, isForcedChecksumMismatchForTest, maybeApplyDelayedCorruptionForTest, maybePreemptiveCorruptionForResultStage). Three new ConcurrentHashMap fields hold the latch, pending corruptions, and consumer-success counters. Cleanup happens in cleanupStateForJobAndIndependentStages. The original inline corruption block in the ShuffleMapTask Success branch is replaced by calls to shouldCorrupt/corruptShuffleOutputForTest, and the local isChecksumMismatched is split into realChecksumMismatched (tracker result) plus the OR with the new forced bit.

OSS-applicable as-is — all changes live under org/apache/spark paths.

A few comments inline.

gengliangwang · 2026-05-11T23:19:31Z

+        if (newCount >= delay) {
+          val mapId = injectShuffleFetchFailuresPendingDelayedCorruption.remove(shuffleId)
+          mapOutputTracker.updateMapOutput(
+            shuffleId, mapId, injectShuffleFetchFailuresInvalidBlockManagerId)


Both delayed-corruption call sites (here and the one in maybePreemptiveCorruptionForResultStage) reach the executor only through MapOutputTracker.updateMapOutput, which calls invalidateSerializedMapOutputStatusCache() but does not call incrementEpoch(). Compare unregisterMapOutput (MapOutputTracker.scala:874) which DOES bump the epoch precisely so executors invalidate their MapOutputTrackerWorker.mapStatuses cache and re-fetch.

In local[2] (where these tests run) env.mapOutputTracker is MapOutputTrackerMaster shared with the driver (Executor.scala:865-872 documents this), so the update is immediately visible. In any non-local deployment, executors that already fetched this shuffle's statuses would keep their cached (valid) location and the FetchFailed cascade would not fire. The original inline-corruption code (status.updateLocation before registerMapOutput) didn't have this limitation because it mutated the status before workers ever fetched. Suggest calling mapOutputTracker.incrementEpoch() after each updateMapOutput here, or noting the local-mode-only constraint in the method docstrings.

gengliangwang · 2026-05-11T23:19:31Z

+                val realChecksumMismatched = mapOutputTracker.registerMapOutput(
                  shuffleStage.shuffleDep.shuffleId, smt.partitionId, status)
+                val isChecksumMismatched = realChecksumMismatched ||
+                  (Utils.isTesting &&
+                    isForcedChecksumMismatchForTest(shuffleStage.shuffleDep.shuffleId, task))


The original local was isChecksumMismatched. Now the tracker result is realChecksumMismatched and the OR'd combined value reuses the old name on the next line — the original name now refers to a different concept, and "real" implies a "fake" version exists (which is what the test forcing produces, but the reader has no setup for that). Either name the tracker result something neutral, or just inline the OR and drop the intermediate:

Suggested change

val realChecksumMismatched = mapOutputTracker.registerMapOutput(

shuffleStage.shuffleDep.shuffleId, smt.partitionId, status)

val isChecksumMismatched = realChecksumMismatched ||

(Utils.isTesting &&

isForcedChecksumMismatchForTest(shuffleStage.shuffleDep.shuffleId, task))

val isChecksumMismatched = mapOutputTracker.registerMapOutput(

shuffleStage.shuffleDep.shuffleId, smt.partitionId, status) ||

(Utils.isTesting &&

isForcedChecksumMismatchForTest(shuffleStage.shuffleDep.shuffleId, task))

gengliangwang · 2026-05-11T23:19:31Z

-        "location for the first stage attempt. Testing only flag!")
+      .doc("Corrupt the registered MapStatus of partition 0 on the first successful attempt " +
+        "of every shuffle map stage, to induce downstream FetchFailed and stage retry. " +
+        "Testing only.")


This says corruption happens "on the first successful attempt of every shuffle map stage" — true for the latch, but the visible timing on the master switch alone is now governed by the new DOWNSTREAM_DELAY=1 default, which defers the actual updateMapOutput until after 1 consumer task succeeds. A reader who only reads this flag's doc would think setting it alone gives inline corruption (matching the old semantics), but the new default behavior is materially different. Worth either pointing at INJECT_SHUFFLE_FETCH_FAILURES_DOWNSTREAM_DELAY here ("actual timing also depends on the *_DELAY flags below") or briefly noting the deferred-by-default behavior.

gengliangwang · 2026-05-11T23:19:31Z

        }

+        if (Utils.isTesting) {
+          maybeApplyDelayedCorruptionForTest(stage)


Minor: this fires before the task match { case rt: ResultTask ... case smt: ShuffleMapTask ... } block, so it runs for every Success event including those for which the rest of the handler is skipped via ignoreOldTaskAttempts. A late completion from a rolled-back consumer attempt still ticks the per-shuffle success counter. Probably never matters for the deterministic local-mode tests this targets — but if so, worth a one-line note in maybeApplyDelayedCorruptionForTest's docstring; otherwise gate this call on !ignoreOldTaskAttempts.

juliuszsompolski · 2026-05-12T06:26:39Z

Thanks @gengliangwang , applied the remarks.

juliuszsompolski · 2026-05-12T10:02:36Z

sql - slow tests and sql - other tests timed out again... let me push one more retrigger...

juliuszsompolski · 2026-06-08T09:13:20Z

@gengliangwang @Ngone51 could you take another look / merge?

cloud-fan

0 blocking, 1 non-blocking, 0 nits.
Clean, well-tested test-infra extension; all prior review comments are addressed. One non-blocking dead-code cleanup.

Suggestions (1)

SQLLastAttemptMetricPlanShapesSuite.scala:122: hasChecksumMismatch is defined but never used anywhere in the repo — see inline

cloud-fan · 2026-06-08T22:32:27Z

    def hasStageRetries: Boolean = spark.sparkContext.conf
      .getOption(config.Tests.INJECT_SHUFFLE_FETCH_FAILURES.key).contains("true")

+    def hasChecksumMismatch: Boolean = spark.sparkContext.conf


hasChecksumMismatch is defined here but never referenced anywhere in the repo, whereas its sibling hasAQEReplans is used (line 369). Looks added by analogy with the also-unused, pre-existing hasStageRetries (line 119). Consider removing it (and optionally the unused hasStageRetries), or wiring it into a plan-shape match if a use was intended. Non-blocking.

Drop unused PhysicalPlan helpers in SQLLastAttemptMetricPlanShapesSuite: hasChecksumMismatch (added in this PR by analogy) and the pre-existing hasStageRetries. Neither is referenced anywhere - only hasAQEReplans is used. Per apache#55738 (comment). Co-authored-by: Isaac

juliuszsompolski · 2026-06-09T17:49:56Z

Removed both the unused defs.

Drop unused PhysicalPlan helpers in SQLLastAttemptMetricPlanShapesSuite: hasChecksumMismatch (added in this PR by analogy) and the pre-existing hasStageRetries. Neither is referenced anywhere - only hasAQEReplans is used. Per apache#55738 (comment). Co-authored-by: Isaac

cloud-fan

1 addressed, 0 remaining, 3 new. (3 new = 0 newly introduced, 3 late catches — my misses from earlier rounds.)
0 blocking, 0 non-blocking, 3 nits.
The delta since my prior approval checks out: the unused defs are removed, and the new withBindingPolicy(NOT_APPLICABLE) annotations are correct for these scheduler-level test-only configs. The nits are optional comment polish I missed last round.

Nits: 3 minor items (see inline comments).

cloud-fan · 2026-06-11T20:06:14Z

+  // resolves to a real host, only the executorId is INVALID_EXECUTOR_ID so any fetch from this
+  // location FetchFailed.


Comma splice, and FetchFailed reads as a verb with no verb in the clause:

Suggested change

// resolves to a real host, only the executorId is INVALID_EXECUTOR_ID so any fetch from this

// location FetchFailed.

// resolves to a real host; only the executorId is INVALID_EXECUTOR_ID, so any fetch from

// this location fails with FetchFailed.

cloud-fan · 2026-06-11T20:06:14Z

+    // the result stage's findMissingPartitions is strictly less than numTasks - which OSS
+    // Spark cannot roll back, so the job aborts. With the default RESULT_STAGE_DELAY=0 the


findMissingPartitions returns a partition list, so it can't be "less than numTasks" — the actual gate in filterAndAbortUnrollbackableStages is findMissingPartitions().length < numTasks:

Suggested change

// the result stage's findMissingPartitions is strictly less than numTasks - which OSS

// Spark cannot roll back, so the job aborts. With the default RESULT_STAGE_DELAY=0 the

// the result stage's findMissingPartitions().length is strictly less than numTasks, and

// OSS Spark cannot roll back a partially-finished result stage, so the job aborts. With

// the default RESULT_STAGE_DELAY=0 the

cloud-fan · 2026-06-11T20:06:14Z

+    // and shuffle.partitions=20 (much greater than the test's local[2] cores). The DAGScheduler
+    // event loop is single-threaded for completion events, so deferring the producer's
+    // mapper-0 corruption until after one consumer success guarantees AT LEAST ONE consumer
+    // task fully completed before the FetchFailed cascade kicks in. With mode 3's rollback,


This "mode 3" numbering (also at lines 542 and 567, and "mode-2" in the assert message at 548) is never defined in this file, so a reader can't map the numbers to injection setups. Spell the modes out — e.g. "the force-checksum-mismatch injection" for mode 3 and "the recompute-only baseline" for mode 2.

juliuszsompolski · 2026-06-17T16:28:54Z

Last run hit a flake on PythonUDFWorkerSpecificationSuite

2026-06-12T22:52:19.3343130Z �[0m[�[0m�[31merror�[0m] �[0m�[0mFailed: Total 18936, Failed 1, Errors 0, Passed 18935, Ignored 38, Canceled 1�[0m
2026-06-12T22:52:19.3413066Z �[0m[�[0m�[31merror�[0m] �[0m�[0mFailed tests:�[0m
2026-06-12T22:52:19.3414234Z �[0m[�[0m�[31merror�[0m] �[0m�[0m	org.apache.spark.sql.execution.externalUDF.PythonUDFWorkerSpecificationSuite�[0m
2026-06-12T22:52:20.1582109Z �[0m[�[0m�[31merror�[0m] �[0m�[0m(sql / Test / �[31mtest�[0m) sbt.TestsFailedException: Tests unsuccessful�[0m
2026-06-12T22:52:20.1611999Z �[0m[�[0m�[31merror�[0m] �[0m�[0mTotal time: 6112 s (01:41:52.0), completed Jun 12, 2026, 10:52:20 PM�[0m

Let me rebase and retry once more.

…ing stage retries and rollback Extends the test-only INJECT_SHUFFLE_FETCH_FAILURES injection in the DAGScheduler with two orthogonal knobs that together let unit tests force the full range of stage-retry shapes that ordinary fetch-failure injection cannot reach: - INJECT_SHUFFLE_FETCH_FAILURES (existing master switch). Semantics extended: it now corrupts the partition-0 task of the FIRST SUCCESSFUL attempt of every shuffle map stage, including non-leaf stages (whose attempt 0 typically fails on fetch from upstream and is never corrupted under the previous semantics). - INJECT_SHUFFLE_FETCH_FAILURES_DOWNSTREAM_DELAY (int, default 1). Defers the producer's mapper-0 corruption until N task-success events have arrived from stages that consume the shuffle. Because the DAGScheduler event loop processes task-completion events serially, this guarantees that N consumer tasks fully completed BEFORE the FetchFailed cascade kicks in -- the realistic "lost shuffle on recompute" shape rather than corrupting at producer registration time. Subsequent consumer tasks dispatched to free slots after the corruption see the invalid MapStatus and FetchFailed. With shuffle.partitions much larger than executor cores, this gives a deterministic "partial first-attempt + recompute" shape. Set to 0 to corrupt inline at registration. - INJECT_SHUFFLE_FORCE_CHECKSUM_MISMATCH_ON_RECOMPUTE (boolean, default false). After a downstream FetchFailed forces the producer's partition 0 to be recomputed, the recomputed task's MapStatus registration is artificially flagged as a checksum mismatch. The DAGScheduler then runs rollbackSucceedingStages, which clears the downstream ShuffleMapStage's outputs and forces a full retry of that stage (every previously-finished partition runs again). Mirrors the production path where a recomputed shuffle output has a different checksum from a non-deterministic producer. ResultStage downstreams still abort because OSS Spark does not support rolling them back (SPARK-25342). Adds tests in MetricsFailureInjectionSuite covering the recompute injection (with and without rollback) and the delayed-corruption timing guarantee. Co-authored-by: Isaac

The bogus BlockManagerId used by INJECT_SHUFFLE_FETCH_FAILURES was using host="invalid" port=1, which is not a real cluster host. Reducer tasks whose preferred location resolved to that fake host got stuck in locality wait: local mode's LocalSchedulerBackend.reviveOffers is purely event-driven, so once the few tasks whose preferred location was the real driver host finished, no further `reviveOffers` fired to let the TaskSetManager escalate from NODE_LOCAL to ANY, and the job hung forever. Restore the original-host invariant from the inline-corruption path in apache/master: keep the producer's host/port/topology, only swap executorId to INVALID_EXECUTOR_ID so any fetch from this location FetchFailed. Fixes the SQLLastAttemptMetricIntegrationSuiteWithStageRetries "multi stage rdd updates" hang. Co-authored-by: Isaac

Only allocate the three test-only INJECT_SHUFFLE_FETCH_FAILURES ConcurrentHashMaps when Utils.isTesting; in production they stay null. All access paths to these maps are already gated on Utils.isTesting at the call site, so the null is never touched. Co-authored-by: Isaac

Drop unused PhysicalPlan helpers in SQLLastAttemptMetricPlanShapesSuite: hasChecksumMismatch (added in this PR by analogy) and the pre-existing hasStageRetries. Neither is referenced anywhere - only hasAQEReplans is used. Per apache#55738 (comment). Co-authored-by: Isaac

Doc-string nits from PR review: - DAGScheduler.scala: fix comma splice / verb-less FetchFailed clause in the injectShuffleFetchFailuresInvalidBlockManagerId comment. - MetricsFailureInjectionSuite.scala: drop undefined "mode 2"/"mode 3" shorthand throughout (the comments are the only place those numbers appear); spell them out as "force-checksum-mismatch injection" and "recompute-only baseline". Also fix "findMissingPartitions is strictly less than numTasks" -> ".length is strictly less than numTasks", since findMissingPartitions returns a partition list. Co-authored-by: Isaac

cloud-fan · 2026-06-17T21:58:08Z

the test failure is unrelated, thanks, merging to master/4.x!

…INJECT_SHUFFLE_FETCH_FAILURES ### What changes were proposed in this pull request? Extends the test-only fetch-failure injection in `DAGScheduler` with three new knobs and changes the semantics of the existing master switch. Four flags total (all in `org.apache.spark.internal.config.Tests`): - `INJECT_SHUFFLE_FETCH_FAILURES` (existing). Semantics changed: previously corrupted every map task of stage attempt 0 (so only leaf shuffle map stages were ever affected, since non-leaf stages typically fail-fetch on attempt 0). Now corrupts the partition-0 task of the first SUCCESSFUL attempt of every shuffle map stage, including non-leaf stages whose attempt 0 fails on fetch from upstream. - `INJECT_SHUFFLE_FETCH_FAILURES_DOWNSTREAM_DELAY` (int, default `1`, new). Defers the producer's mapper-0 corruption until N task-success events have arrived from `ShuffleMapStage` consumers of the shuffle. The DAGScheduler event loop processes task-completion events serially, so this guarantees N consumer tasks fully completed BEFORE the FetchFailed cascade kicks in. Subsequent tasks dispatched to free slots after the corruption see the invalid `MapStatus` and `FetchFailed`. With `spark.sql.shuffle.partitions` much larger than executor cores, this gives a deterministic "partial first-attempt + recompute" shape. Set to `0` to corrupt inline at registration. - `INJECT_SHUFFLE_FETCH_FAILURES_RESULT_STAGE_DELAY` (int, default `0`, new). Counterpart of the above for `ResultStage` consumers. With the default `0`, when a `ResultStage` is the consumer of a pending corruption it is corrupted *before* the result tasks dispatch, so the result stage has zero finished tasks when `INJECT_SHUFFLE_FORCE_CHECKSUM_MISMATCH_ON_RECOMPUTE` later triggers `rollbackSucceedingStages` (the rollback path would otherwise abort a partially-finished result stage, since OSS Spark does not support rolling them back). Set to N > 0 to defer until N result-stage tasks have succeeded - the only way to actually exercise the result-stage abort path. - `INJECT_SHUFFLE_FORCE_CHECKSUM_MISMATCH_ON_RECOMPUTE` (boolean, default false, new). After a downstream `FetchFailed` forces the producer's partition 0 to be recomputed, the recomputed task's `MapStatus` registration is artificially flagged as a checksum mismatch. The DAGScheduler then runs `rollbackSucceedingStages`, which clears the downstream `ShuffleMapStage`'s outputs and forces a full retry of that stage. `ResultStage` downstreams are aborted (OSS Spark does not support rolling them back, SPARK-25342). ### Why are the changes needed? The existing `INJECT_SHUFFLE_FETCH_FAILURES` flag corrupts attempt 0 of every shuffle map stage, but only leaf shuffle map stages succeed on attempt 0. Non-leaf stages fail-fetch from corrupted upstream on attempt 0 and are never themselves corrupted, so unit tests cannot exercise the full range of stage-retry shapes that production hits (metric stability under non-determinism, rollback for indeterminate producers, SLAM semantics across retries). The new knobs let tests deterministically reach those shapes. ### Does this PR introduce _any_ user-facing change? No. All flags are test-only (gated by `Utils.isTesting`) and are under `spark.testing.*`. ### How was this patch tested? - `MetricsFailureInjectionSuite` (12 tests). Existing tests continue to pass with the new default semantics. The non-deterministic-stage test sees stage-2 raw metric overcount under the new default (because checksum-mismatch rollback now fires on the non-determinism); SLAM remains stable, which is the point of that test. - New test `Three stage metrics force-checksum-mismatch with delayed corruption`: with `shuffle.partitions=20` (much greater than the test's `local[2]` cores) and `delay=1`, the rollback re-plays at least one already-completed stage-2 partition on top of the full re-run, putting the raw metric strictly above the recompute-only baseline. - New test `Force checksum mismatch aborts a downstream ResultStage`: 2-stage `groupBy().count()` query where stage 2 is a `ResultStage`. With `RESULT_STAGE_DELAY=1` (opted-in) and `INJECT_SHUFFLE_FORCE_CHECKSUM_MISMATCH_ON_RECOMPUTE`, one result task succeeds before the FetchFailed cascade; the forced checksum mismatch on stage 1's mapper-0 recompute then fires `rollbackSucceedingStages`, which sees `numMissingPartitions < numTasks` on the result stage and aborts. The query throws a `SparkException` with `"indeterminate"` in the message. - `SQLLastAttemptMetricPlanShapesSuite` (220 tests, was previously parameterised on `stageRetries: Boolean`, now on a tri-valued `FailureMode`: `NoFailure`, `FetchFailure`, `ChecksumMismatch`). Each plan shape is now also exercised under forced checksum-mismatch rollback; SLAM values remain stable. - `SQLLastAttemptMetricIntegrationSuiteWithChecksumMismatch` (new subclass): runs the full integration suite with `INJECT_SHUFFLE_FETCH_FAILURES + INJECT_SHUFFLE_FORCE_CHECKSUM_MISMATCH_ON_RECOMPUTE` enabled. - `core/scalastyle`, `sql/scalastyle`, and 149 `DAGSchedulerSuite` tests pass. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code, Opus 4.7. Closes #55738 from juliuszsompolski/retry-injection-infra. Lead-authored-by: Juliusz Sompolski <julek@databricks.com> Co-authored-by: Juliusz Sompolski <Juliusz Sompolski> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit c101701) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…TCH_FAILURES injection maps ### What changes were proposed in this pull request? Followup to #55738. The three `INJECT_SHUFFLE_FETCH_FAILURES` injection-state maps in `DAGScheduler` (`injectShuffleFetchFailuresCorruptedAttempt`, `injectShuffleFetchFailuresPendingDelayedCorruption`, `injectShuffleFetchFailuresDownstreamSuccessCount`) are initialized with `if (Utils.isTesting) new ConcurrentHashMap else null`. This patch allocates them unconditionally instead. ### Why are the changes needed? The cleanup site in `cleanupStateForJobAndIndependentStages` (and the other use-sites) re-evaluate `if (Utils.isTesting)` to decide whether to dereference these maps. `Utils.isTesting` reads the mutable `spark.testing` system property, so it can return a different value when the `DAGScheduler` is constructed than at the later use-sites. If it does, a `null` map is dereferenced and the `DAGScheduler` event loop crashes with a `NullPointerException`. Allocating the maps unconditionally makes the construction guard and the use-site guards consistent, so they can never disagree. ### Does this PR introduce _any_ user-facing change? No. The maps are only ever populated inside the config-gated `INJECT_SHUFFLE_FETCH_FAILURES` test paths, so in production they stay empty and there is no behavioral change beyond allocating an empty map. ### How was this patch tested? Existing tests. This is a defensive initialization change; the maps stay empty unless the existing `INJECT_SHUFFLE_FETCH_FAILURES` test paths populate them, so the existing coverage exercises them unchanged. ### Was this patch authored or co-authored using generative AI tooling? Yes, drafted with Claude Code (Anthropic). Closes #56631 from cloud-fan/SPARK-56773-followup. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…TCH_FAILURES injection maps ### What changes were proposed in this pull request? Followup to #55738. The three `INJECT_SHUFFLE_FETCH_FAILURES` injection-state maps in `DAGScheduler` (`injectShuffleFetchFailuresCorruptedAttempt`, `injectShuffleFetchFailuresPendingDelayedCorruption`, `injectShuffleFetchFailuresDownstreamSuccessCount`) are initialized with `if (Utils.isTesting) new ConcurrentHashMap else null`. This patch allocates them unconditionally instead. ### Why are the changes needed? The cleanup site in `cleanupStateForJobAndIndependentStages` (and the other use-sites) re-evaluate `if (Utils.isTesting)` to decide whether to dereference these maps. `Utils.isTesting` reads the mutable `spark.testing` system property, so it can return a different value when the `DAGScheduler` is constructed than at the later use-sites. If it does, a `null` map is dereferenced and the `DAGScheduler` event loop crashes with a `NullPointerException`. Allocating the maps unconditionally makes the construction guard and the use-site guards consistent, so they can never disagree. ### Does this PR introduce _any_ user-facing change? No. The maps are only ever populated inside the config-gated `INJECT_SHUFFLE_FETCH_FAILURES` test paths, so in production they stay empty and there is no behavioral change beyond allocating an empty map. ### How was this patch tested? Existing tests. This is a defensive initialization change; the maps stay empty unless the existing `INJECT_SHUFFLE_FETCH_FAILURES` test paths populate them, so the existing coverage exercises them unchanged. ### Was this patch authored or co-authored using generative AI tooling? Yes, drafted with Claude Code (Anthropic). Closes #56631 from cloud-fan/SPARK-56773-followup. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 41cc304) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

juliuszsompolski force-pushed the retry-injection-infra branch from 16fb596 to eeaa228 Compare May 7, 2026 16:18

juliuszsompolski mentioned this pull request May 7, 2026

[SPARK-56743][SQL] Use SQLLastAttemptMetric for DSv2 UPDATE/DELETE/MERGE metrics #55711

Closed

juliuszsompolski force-pushed the retry-injection-infra branch from eeaa228 to dcb314a Compare May 7, 2026 20:33

gengliangwang reviewed May 11, 2026

View reviewed changes

gengliangwang requested a review from Ngone51 May 12, 2026 23:16

gengliangwang approved these changes May 12, 2026

View reviewed changes

cloud-fan approved these changes Jun 8, 2026

View reviewed changes

juliuszsompolski force-pushed the retry-injection-infra branch from 95d1e70 to d576f6c Compare June 11, 2026 11:54

juliuszsompolski force-pushed the retry-injection-infra branch from d576f6c to 270b26d Compare June 11, 2026 14:13

cloud-fan approved these changes Jun 11, 2026

View reviewed changes

Juliusz Sompolski and others added 6 commits June 17, 2026 16:34

review remarks

8db809e

review remarks

1350ba9

Drop unused PhysicalPlan helpers in SQLLastAttemptMetricPlanShapesSuite: hasChecksumMismatch (added in this PR by analogy) and the pre-existing hasStageRetries. Neither is referenced anywhere - only hasAQEReplans is used. Per apache#55738 (comment). Co-authored-by: Isaac

juliuszsompolski force-pushed the retry-injection-infra branch from 3bb1ac9 to 00a9c45 Compare June 17, 2026 16:34

cloud-fan closed this in c101701 Jun 17, 2026

cloud-fan mentioned this pull request Jun 20, 2026

[SPARK-56773][CORE][TESTS][FOLLOWUP] Always allocate INJECT_SHUFFLE_FETCH_FAILURES injection maps #56631

Closed

juliuszsompolski mentioned this pull request Jun 22, 2026

[SPARK-56743][SPARK-56773][SQL][CORE][TESTS] Exercise writer-stage retries in DSv2 DML metric tests and fix injection-state cleanup under AQE #56597

Closed

		// resolves to a real host, only the executorId is INVALID_EXECUTOR_ID so any fetch from this
		// location FetchFailed.

		// the result stage's findMissingPartitions is strictly less than numTasks - which OSS
		// Spark cannot roll back, so the job aborts. With the default RESULT_STAGE_DELAY=0 the

-    // the result stage's findMissingPartitions is strictly less than numTasks - which OSS
-    // Spark cannot roll back, so the job aborts. With the default RESULT_STAGE_DELAY=0 the
+    // the result stage's findMissingPartitions().length is strictly less than numTasks, and
+    // OSS Spark cannot roll back a partially-finished result stage, so the job aborts. With
+    // the default RESULT_STAGE_DELAY=0 the

Uh oh!

Conversation

juliuszsompolski commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

juliuszsompolski commented May 7, 2026

Uh oh!

juliuszsompolski commented May 11, 2026

Uh oh!

juliuszsompolski commented May 11, 2026

Uh oh!

juliuszsompolski commented May 11, 2026

Uh oh!

gengliangwang left a comment

Choose a reason for hiding this comment

Summary

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juliuszsompolski commented May 12, 2026

Uh oh!

juliuszsompolski commented May 12, 2026

Uh oh!

juliuszsompolski commented Jun 8, 2026

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Suggestions (1)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juliuszsompolski commented Jun 9, 2026

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juliuszsompolski commented Jun 17, 2026

Uh oh!

cloud-fan commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

juliuszsompolski commented May 7, 2026 •

edited

Loading