Skip to content

[SPARK-56773][CORE][TESTS] Add more fetch-failure injection knobs for INJECT_SHUFFLE_FETCH_FAILURES#55738

Closed
juliuszsompolski wants to merge 6 commits into
apache:masterfrom
juliuszsompolski:retry-injection-infra
Closed

[SPARK-56773][CORE][TESTS] Add more fetch-failure injection knobs for INJECT_SHUFFLE_FETCH_FAILURES#55738
juliuszsompolski wants to merge 6 commits into
apache:masterfrom
juliuszsompolski:retry-injection-infra

Conversation

@juliuszsompolski

@juliuszsompolski juliuszsompolski commented May 7, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Extends the test-only fetch-failure injection in DAGScheduler with three new knobs and changes the semantics of the existing master switch. Four flags total (all in org.apache.spark.internal.config.Tests):

  • INJECT_SHUFFLE_FETCH_FAILURES (existing). Semantics changed: previously corrupted every map task of stage attempt 0 (so only leaf shuffle map stages were ever affected, since non-leaf stages typically fail-fetch on attempt 0). Now corrupts the partition-0 task of the first SUCCESSFUL attempt of every shuffle map stage, including non-leaf stages whose attempt 0 fails on fetch from upstream.

  • INJECT_SHUFFLE_FETCH_FAILURES_DOWNSTREAM_DELAY (int, default 1, new). Defers the producer's mapper-0 corruption until N task-success events have arrived from ShuffleMapStage consumers of the shuffle. The DAGScheduler event loop processes task-completion events serially, so this guarantees N consumer tasks fully completed BEFORE the FetchFailed cascade kicks in. Subsequent tasks dispatched to free slots after the corruption see the invalid MapStatus and FetchFailed. With spark.sql.shuffle.partitions much larger than executor cores, this gives a deterministic "partial first-attempt + recompute" shape. Set to 0 to corrupt inline at registration.

  • INJECT_SHUFFLE_FETCH_FAILURES_RESULT_STAGE_DELAY (int, default 0, new). Counterpart of the above for ResultStage consumers. With the default 0, when a ResultStage is the consumer of a pending corruption it is corrupted before the result tasks dispatch, so the result stage has zero finished tasks when INJECT_SHUFFLE_FORCE_CHECKSUM_MISMATCH_ON_RECOMPUTE later triggers rollbackSucceedingStages (the rollback path would otherwise abort a partially-finished result stage, since OSS Spark does not support rolling them back). Set to N > 0 to defer until N result-stage tasks have succeeded - the only way to actually exercise the result-stage abort path.

  • INJECT_SHUFFLE_FORCE_CHECKSUM_MISMATCH_ON_RECOMPUTE (boolean, default false, new). After a downstream FetchFailed forces the producer's partition 0 to be recomputed, the recomputed task's MapStatus registration is artificially flagged as a checksum mismatch. The DAGScheduler then runs rollbackSucceedingStages, which clears the downstream ShuffleMapStage's outputs and forces a full retry of that stage. ResultStage downstreams are aborted (OSS Spark does not support rolling them back, SPARK-25342).

Why are the changes needed?

The existing INJECT_SHUFFLE_FETCH_FAILURES flag corrupts attempt 0 of every shuffle map stage, but only leaf shuffle map stages succeed on attempt 0. Non-leaf stages fail-fetch from corrupted upstream on attempt 0 and are never themselves corrupted, so unit tests cannot exercise the full range of stage-retry shapes that production hits (metric stability under non-determinism, rollback for indeterminate producers, SLAM semantics across retries). The new knobs let tests deterministically reach those shapes.

Does this PR introduce any user-facing change?

No. All flags are test-only (gated by Utils.isTesting) and are under spark.testing.*.

How was this patch tested?

  • MetricsFailureInjectionSuite (12 tests). Existing tests continue to pass with the new default semantics. The non-deterministic-stage test sees stage-2 raw metric overcount under the new default (because checksum-mismatch rollback now fires on the non-determinism); SLAM remains stable, which is the point of that test.

  • New test Three stage metrics force-checksum-mismatch with delayed corruption: with shuffle.partitions=20 (much greater than the test's local[2] cores) and delay=1, the rollback re-plays at least one already-completed stage-2 partition on top of the full re-run, putting the raw metric strictly above the recompute-only baseline.

  • New test Force checksum mismatch aborts a downstream ResultStage: 2-stage groupBy().count() query where stage 2 is a ResultStage. With RESULT_STAGE_DELAY=1 (opted-in) and INJECT_SHUFFLE_FORCE_CHECKSUM_MISMATCH_ON_RECOMPUTE, one result task succeeds before the FetchFailed cascade; the forced checksum mismatch on stage 1's mapper-0 recompute then fires rollbackSucceedingStages, which sees numMissingPartitions < numTasks on the result stage and aborts. The query throws a SparkException with "indeterminate" in the message.

  • SQLLastAttemptMetricPlanShapesSuite (220 tests, was previously parameterised on stageRetries: Boolean, now on a tri-valued FailureMode: NoFailure, FetchFailure, ChecksumMismatch). Each plan shape is now also exercised under forced checksum-mismatch rollback; SLAM values remain stable.

  • SQLLastAttemptMetricIntegrationSuiteWithChecksumMismatch (new subclass): runs the full integration suite with INJECT_SHUFFLE_FETCH_FAILURES + INJECT_SHUFFLE_FORCE_CHECKSUM_MISMATCH_ON_RECOMPUTE enabled.

  • core/scalastyle, sql/scalastyle, and 149 DAGSchedulerSuite tests pass.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code, Opus 4.7.

@juliuszsompolski

Copy link
Copy Markdown
Contributor Author

@cloud-fan I added some refinements to the stage retry testing I added with #55371. This makes it exercise a bigger variety of scenarios. I also want to use it for more correctness testing in presence of retries and recomputes.

@juliuszsompolski

Copy link
Copy Markdown
Contributor Author
[info] org.apache.spark.util.collection.SorterSuite *** ABORTED *** (48 seconds, 440 milliseconds)
[info]   java.lang.OutOfMemoryError: Java heap space

ugh...

@juliuszsompolski

Copy link
Copy Markdown
Contributor Author

@cloud-fan @gengliangwang could you help review this? I would like to use this infra to test #55711, targeting 4.2.

@juliuszsompolski

Copy link
Copy Markdown
Contributor Author

And now sql - other tests https://github.com/juliuszsompolski/apache-spark/actions/runs/25693038491/job/75434352332 timed out.
How many times for a clean CI?

@gengliangwang gengliangwang left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Extends the test-only fetch-failure injection in DAGScheduler so unit tests can deterministically reach scheduler shapes the existing inline corruption could never produce — most importantly, partial-progress-on-downstream followed by upstream recompute with a forced checksum mismatch driving rollbackSucceedingStages.

Prior state and problem. INJECT_SHUFFLE_FETCH_FAILURES corrupted every mapper of stage attempt 0 inline (before registerMapOutput). Because non-leaf shuffle map stages always fetch-fail on attempt 0 from a corrupted upstream, they're never themselves observed to succeed an attempt that gets corrupted — so the existing flag only ever exercises the leaf-stage corruption path. Several scheduler behaviors are untestable as a result: metric stability under non-determinism for non-leaf stages, rollback of partially-finished downstream stages on checksum mismatch, and the result-stage abort path (OSS Spark can't roll back result stages with completed tasks, SPARK-25342).

Design approach. Two structural changes plus three layered sub-flags:

  • The master switch now latches the first successful attempt per shuffle (per-shuffleId computeIfAbsent) and corrupts only partition 0 on that attempt. Non-leaf stages eventually corrupt too — once their attempt 0 recovers from upstream FetchFailed and successfully completes partition 0. Recomputes of partition 0 are left clean so the consumer can make progress.
  • INJECT_SHUFFLE_FETCH_FAILURES_DOWNSTREAM_DELAY (default 1) and INJECT_SHUFFLE_FETCH_FAILURES_RESULT_STAGE_DELAY (default 0) defer corruption until N consumer task successes have been observed. The DAGScheduler event loop is single-threaded, so this guarantees N downstream completions before the FetchFailed cascade. The asymmetric defaults are forced by the result-stage rollback restriction: with RESULT_STAGE_DELAY=0 preemptive corruption (in submitMissingTasks) means the result stage has zero completed tasks when rollback runs and filterAndAbortUnrollbackableStages lets it through; with >0, the abort path is the thing under test.
  • INJECT_SHUFFLE_FORCE_CHECKSUM_MISMATCH_ON_RECOMPUTE ORs a synthetic mismatch bit into the recompute's addMapOutput result, which (with the SQL default checksumMismatchFullRetryEnabled=true) drives rollbackSucceedingStages on the producer.

Key design decisions made by this PR.

  • Partition 0 only: the latch and the forced-mismatch are partition-0-scoped, which keeps the deterministic shape simple but means real checksum-mismatch behaviors on non-zero partitions aren't exercised by these flags.
  • Per-shuffle (not per-stage) state across all attempts, so the latch follows the shuffle as the stage retries.
  • Test-only state lives directly on DAGScheduler rather than behind an injectable hook — about 30 lines of fields and 90 lines of helper methods. The original inline implementation was already in-class; this preserves that pattern at the cost of more test-only code in production source.

Implementation sketch. Five new private methods on DAGScheduler (shouldCorruptShuffleOutputForTest, corruptShuffleOutputForTest, isForcedChecksumMismatchForTest, maybeApplyDelayedCorruptionForTest, maybePreemptiveCorruptionForResultStage). Three new ConcurrentHashMap fields hold the latch, pending corruptions, and consumer-success counters. Cleanup happens in cleanupStateForJobAndIndependentStages. The original inline corruption block in the ShuffleMapTask Success branch is replaced by calls to shouldCorrupt/corruptShuffleOutputForTest, and the local isChecksumMismatched is split into realChecksumMismatched (tracker result) plus the OR with the new forced bit.

OSS-applicable as-is — all changes live under org/apache/spark paths.

A few comments inline.

if (newCount >= delay) {
val mapId = injectShuffleFetchFailuresPendingDelayedCorruption.remove(shuffleId)
mapOutputTracker.updateMapOutput(
shuffleId, mapId, injectShuffleFetchFailuresInvalidBlockManagerId)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both delayed-corruption call sites (here and the one in maybePreemptiveCorruptionForResultStage) reach the executor only through MapOutputTracker.updateMapOutput, which calls invalidateSerializedMapOutputStatusCache() but does not call incrementEpoch(). Compare unregisterMapOutput (MapOutputTracker.scala:874) which DOES bump the epoch precisely so executors invalidate their MapOutputTrackerWorker.mapStatuses cache and re-fetch.

In local[2] (where these tests run) env.mapOutputTracker is MapOutputTrackerMaster shared with the driver (Executor.scala:865-872 documents this), so the update is immediately visible. In any non-local deployment, executors that already fetched this shuffle's statuses would keep their cached (valid) location and the FetchFailed cascade would not fire. The original inline-corruption code (status.updateLocation before registerMapOutput) didn't have this limitation because it mutated the status before workers ever fetched. Suggest calling mapOutputTracker.incrementEpoch() after each updateMapOutput here, or noting the local-mode-only constraint in the method docstrings.

Comment on lines +2497 to +2501
val realChecksumMismatched = mapOutputTracker.registerMapOutput(
shuffleStage.shuffleDep.shuffleId, smt.partitionId, status)
val isChecksumMismatched = realChecksumMismatched ||
(Utils.isTesting &&
isForcedChecksumMismatchForTest(shuffleStage.shuffleDep.shuffleId, task))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original local was isChecksumMismatched. Now the tracker result is realChecksumMismatched and the OR'd combined value reuses the old name on the next line — the original name now refers to a different concept, and "real" implies a "fake" version exists (which is what the test forcing produces, but the reader has no setup for that). Either name the tracker result something neutral, or just inline the OR and drop the intermediate:

Suggested change
val realChecksumMismatched = mapOutputTracker.registerMapOutput(
shuffleStage.shuffleDep.shuffleId, smt.partitionId, status)
val isChecksumMismatched = realChecksumMismatched ||
(Utils.isTesting &&
isForcedChecksumMismatchForTest(shuffleStage.shuffleDep.shuffleId, task))
val isChecksumMismatched = mapOutputTracker.registerMapOutput(
shuffleStage.shuffleDep.shuffleId, smt.partitionId, status) ||
(Utils.isTesting &&
isForcedChecksumMismatchForTest(shuffleStage.shuffleDep.shuffleId, task))

"location for the first stage attempt. Testing only flag!")
.doc("Corrupt the registered MapStatus of partition 0 on the first successful attempt " +
"of every shuffle map stage, to induce downstream FetchFailed and stage retry. " +
"Testing only.")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This says corruption happens "on the first successful attempt of every shuffle map stage" — true for the latch, but the visible timing on the master switch alone is now governed by the new DOWNSTREAM_DELAY=1 default, which defers the actual updateMapOutput until after 1 consumer task succeeds. A reader who only reads this flag's doc would think setting it alone gives inline corruption (matching the old semantics), but the new default behavior is materially different. Worth either pointing at INJECT_SHUFFLE_FETCH_FAILURES_DOWNSTREAM_DELAY here ("actual timing also depends on the *_DELAY flags below") or briefly noting the deferred-by-default behavior.

}

if (Utils.isTesting) {
maybeApplyDelayedCorruptionForTest(stage)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: this fires before the task match { case rt: ResultTask ... case smt: ShuffleMapTask ... } block, so it runs for every Success event including those for which the rest of the handler is skipped via ignoreOldTaskAttempts. A late completion from a rolled-back consumer attempt still ticks the per-shuffle success counter. Probably never matters for the deterministic local-mode tests this targets — but if so, worth a one-line note in maybeApplyDelayedCorruptionForTest's docstring; otherwise gate this call on !ignoreOldTaskAttempts.

@juliuszsompolski

Copy link
Copy Markdown
Contributor Author

Thanks @gengliangwang , applied the remarks.

@juliuszsompolski

Copy link
Copy Markdown
Contributor Author

sql - slow tests and sql - other tests timed out again... let me push one more retrigger...

@gengliangwang gengliangwang requested a review from Ngone51 May 12, 2026 23:16
@juliuszsompolski

Copy link
Copy Markdown
Contributor Author

@gengliangwang @Ngone51 could you take another look / merge?

@cloud-fan cloud-fan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 blocking, 1 non-blocking, 0 nits.
Clean, well-tested test-infra extension; all prior review comments are addressed. One non-blocking dead-code cleanup.

Suggestions (1)

  • SQLLastAttemptMetricPlanShapesSuite.scala:122: hasChecksumMismatch is defined but never used anywhere in the repo — see inline

def hasStageRetries: Boolean = spark.sparkContext.conf
.getOption(config.Tests.INJECT_SHUFFLE_FETCH_FAILURES.key).contains("true")

def hasChecksumMismatch: Boolean = spark.sparkContext.conf

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hasChecksumMismatch is defined here but never referenced anywhere in the repo, whereas its sibling hasAQEReplans is used (line 369). Looks added by analogy with the also-unused, pre-existing hasStageRetries (line 119). Consider removing it (and optionally the unused hasStageRetries), or wiring it into a plan-shape match if a use was intended. Non-blocking.

juliuszsompolski pushed a commit to juliuszsompolski/apache-spark that referenced this pull request Jun 9, 2026
Drop unused PhysicalPlan helpers in SQLLastAttemptMetricPlanShapesSuite:
hasChecksumMismatch (added in this PR by analogy) and the pre-existing
hasStageRetries. Neither is referenced anywhere - only hasAQEReplans is
used. Per apache#55738 (comment).

Co-authored-by: Isaac
@juliuszsompolski

Copy link
Copy Markdown
Contributor Author

Removed both the unused defs.

juliuszsompolski pushed a commit to juliuszsompolski/apache-spark that referenced this pull request Jun 11, 2026
Drop unused PhysicalPlan helpers in SQLLastAttemptMetricPlanShapesSuite:
hasChecksumMismatch (added in this PR by analogy) and the pre-existing
hasStageRetries. Neither is referenced anywhere - only hasAQEReplans is
used. Per apache#55738 (comment).

Co-authored-by: Isaac
juliuszsompolski pushed a commit to juliuszsompolski/apache-spark that referenced this pull request Jun 11, 2026
Drop unused PhysicalPlan helpers in SQLLastAttemptMetricPlanShapesSuite:
hasChecksumMismatch (added in this PR by analogy) and the pre-existing
hasStageRetries. Neither is referenced anywhere - only hasAQEReplans is
used. Per apache#55738 (comment).

Co-authored-by: Isaac

@cloud-fan cloud-fan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 addressed, 0 remaining, 3 new. (3 new = 0 newly introduced, 3 late catches — my misses from earlier rounds.)
0 blocking, 0 non-blocking, 3 nits.
The delta since my prior approval checks out: the unused defs are removed, and the new withBindingPolicy(NOT_APPLICABLE) annotations are correct for these scheduler-level test-only configs. The nits are optional comment polish I missed last round.

Nits: 3 minor items (see inline comments).

Comment on lines +205 to +206
// resolves to a real host, only the executorId is INVALID_EXECUTOR_ID so any fetch from this
// location FetchFailed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comma splice, and FetchFailed reads as a verb with no verb in the clause:

Suggested change
// resolves to a real host, only the executorId is INVALID_EXECUTOR_ID so any fetch from this
// location FetchFailed.
// resolves to a real host; only the executorId is INVALID_EXECUTOR_ID, so any fetch from
// this location fails with FetchFailed.

Comment on lines +568 to +569
// the result stage's findMissingPartitions is strictly less than numTasks - which OSS
// Spark cannot roll back, so the job aborts. With the default RESULT_STAGE_DELAY=0 the

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

findMissingPartitions returns a partition list, so it can't be "less than numTasks" — the actual gate in filterAndAbortUnrollbackableStages is findMissingPartitions().length < numTasks:

Suggested change
// the result stage's findMissingPartitions is strictly less than numTasks - which OSS
// Spark cannot roll back, so the job aborts. With the default RESULT_STAGE_DELAY=0 the
// the result stage's findMissingPartitions().length is strictly less than numTasks, and
// OSS Spark cannot roll back a partially-finished result stage, so the job aborts. With
// the default RESULT_STAGE_DELAY=0 the

// and shuffle.partitions=20 (much greater than the test's local[2] cores). The DAGScheduler
// event loop is single-threaded for completion events, so deferring the producer's
// mapper-0 corruption until after one consumer success guarantees AT LEAST ONE consumer
// task fully completed before the FetchFailed cascade kicks in. With mode 3's rollback,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "mode 3" numbering (also at lines 542 and 567, and "mode-2" in the assert message at 548) is never defined in this file, so a reader can't map the numbers to injection setups. Spell the modes out — e.g. "the force-checksum-mismatch injection" for mode 3 and "the recompute-only baseline" for mode 2.

@juliuszsompolski

Copy link
Copy Markdown
Contributor Author

Last run hit a flake on PythonUDFWorkerSpecificationSuite

2026-06-12T22:52:19.3343130Z �[0m[�[0m�[31merror�[0m] �[0m�[0mFailed: Total 18936, Failed 1, Errors 0, Passed 18935, Ignored 38, Canceled 1�[0m
2026-06-12T22:52:19.3413066Z �[0m[�[0m�[31merror�[0m] �[0m�[0mFailed tests:�[0m
2026-06-12T22:52:19.3414234Z �[0m[�[0m�[31merror�[0m] �[0m�[0m	org.apache.spark.sql.execution.externalUDF.PythonUDFWorkerSpecificationSuite�[0m
2026-06-12T22:52:20.1582109Z �[0m[�[0m�[31merror�[0m] �[0m�[0m(sql / Test / �[31mtest�[0m) sbt.TestsFailedException: Tests unsuccessful�[0m
2026-06-12T22:52:20.1611999Z �[0m[�[0m�[31merror�[0m] �[0m�[0mTotal time: 6112 s (01:41:52.0), completed Jun 12, 2026, 10:52:20 PM�[0m

Let me rebase and retry once more.

Juliusz Sompolski and others added 6 commits June 17, 2026 16:34
…ing stage retries and rollback

Extends the test-only INJECT_SHUFFLE_FETCH_FAILURES injection in the
DAGScheduler with two orthogonal knobs that together let unit tests
force the full range of stage-retry shapes that ordinary fetch-failure
injection cannot reach:

- INJECT_SHUFFLE_FETCH_FAILURES (existing master switch). Semantics
  extended: it now corrupts the partition-0 task of the FIRST
  SUCCESSFUL attempt of every shuffle map stage, including non-leaf
  stages (whose attempt 0 typically fails on fetch from upstream and
  is never corrupted under the previous semantics).

- INJECT_SHUFFLE_FETCH_FAILURES_DOWNSTREAM_DELAY (int, default 1).
  Defers the producer's mapper-0 corruption until N task-success
  events have arrived from stages that consume the shuffle. Because
  the DAGScheduler event loop processes task-completion events
  serially, this guarantees that N consumer tasks fully completed
  BEFORE the FetchFailed cascade kicks in -- the realistic
  "lost shuffle on recompute" shape rather than corrupting at
  producer registration time. Subsequent consumer tasks dispatched to
  free slots after the corruption see the invalid MapStatus and
  FetchFailed. With shuffle.partitions much larger than executor
  cores, this gives a deterministic "partial first-attempt + recompute"
  shape. Set to 0 to corrupt inline at registration.

- INJECT_SHUFFLE_FORCE_CHECKSUM_MISMATCH_ON_RECOMPUTE (boolean,
  default false). After a downstream FetchFailed forces the
  producer's partition 0 to be recomputed, the recomputed task's
  MapStatus registration is artificially flagged as a checksum
  mismatch. The DAGScheduler then runs rollbackSucceedingStages,
  which clears the downstream ShuffleMapStage's outputs and forces a
  full retry of that stage (every previously-finished partition runs
  again). Mirrors the production path where a recomputed shuffle
  output has a different checksum from a non-deterministic producer.
  ResultStage downstreams still abort because OSS Spark does not
  support rolling them back (SPARK-25342).

Adds tests in MetricsFailureInjectionSuite covering the recompute
injection (with and without rollback) and the delayed-corruption
timing guarantee.

Co-authored-by: Isaac
The bogus BlockManagerId used by INJECT_SHUFFLE_FETCH_FAILURES was using
host="invalid" port=1, which is not a real cluster host. Reducer tasks
whose preferred location resolved to that fake host got stuck in
locality wait: local mode's LocalSchedulerBackend.reviveOffers is purely
event-driven, so once the few tasks whose preferred location was the
real driver host finished, no further `reviveOffers` fired to let the
TaskSetManager escalate from NODE_LOCAL to ANY, and the job hung
forever.

Restore the original-host invariant from the inline-corruption path in
apache/master: keep the producer's host/port/topology, only swap
executorId to INVALID_EXECUTOR_ID so any fetch from this location
FetchFailed. Fixes the SQLLastAttemptMetricIntegrationSuiteWithStageRetries
"multi stage rdd updates" hang.

Co-authored-by: Isaac
Only allocate the three test-only INJECT_SHUFFLE_FETCH_FAILURES
ConcurrentHashMaps when Utils.isTesting; in production they stay
null. All access paths to these maps are already gated on Utils.isTesting
at the call site, so the null is never touched.

Co-authored-by: Isaac
Drop unused PhysicalPlan helpers in SQLLastAttemptMetricPlanShapesSuite:
hasChecksumMismatch (added in this PR by analogy) and the pre-existing
hasStageRetries. Neither is referenced anywhere - only hasAQEReplans is
used. Per apache#55738 (comment).

Co-authored-by: Isaac
Doc-string nits from PR review:
- DAGScheduler.scala: fix comma splice / verb-less FetchFailed clause
  in the injectShuffleFetchFailuresInvalidBlockManagerId comment.
- MetricsFailureInjectionSuite.scala: drop undefined "mode 2"/"mode 3"
  shorthand throughout (the comments are the only place those numbers
  appear); spell them out as "force-checksum-mismatch injection" and
  "recompute-only baseline". Also fix
  "findMissingPartitions is strictly less than numTasks" -> ".length is
  strictly less than numTasks", since findMissingPartitions returns a
  partition list.

Co-authored-by: Isaac
@cloud-fan

Copy link
Copy Markdown
Contributor

the test failure is unrelated, thanks, merging to master/4.x!

@cloud-fan cloud-fan closed this in c101701 Jun 17, 2026
cloud-fan pushed a commit that referenced this pull request Jun 17, 2026
…INJECT_SHUFFLE_FETCH_FAILURES

### What changes were proposed in this pull request?

Extends the test-only fetch-failure injection in `DAGScheduler` with three new knobs and changes the semantics of the existing master switch. Four flags total (all in `org.apache.spark.internal.config.Tests`):

- `INJECT_SHUFFLE_FETCH_FAILURES` (existing). Semantics changed: previously corrupted every map task of stage attempt 0 (so only leaf shuffle map stages were ever affected, since non-leaf stages typically fail-fetch on attempt 0). Now corrupts the partition-0 task of the first SUCCESSFUL attempt of every shuffle map stage, including non-leaf stages whose attempt 0 fails on fetch from upstream.

- `INJECT_SHUFFLE_FETCH_FAILURES_DOWNSTREAM_DELAY` (int, default `1`, new). Defers the producer's mapper-0 corruption until N task-success events have arrived from `ShuffleMapStage` consumers of the shuffle. The DAGScheduler event loop processes task-completion events serially, so this guarantees N consumer tasks fully completed BEFORE the FetchFailed cascade kicks in. Subsequent tasks dispatched to free slots after the corruption see the invalid `MapStatus` and `FetchFailed`. With `spark.sql.shuffle.partitions` much larger than executor cores, this gives a deterministic "partial first-attempt + recompute" shape. Set to `0` to corrupt inline at registration.

- `INJECT_SHUFFLE_FETCH_FAILURES_RESULT_STAGE_DELAY` (int, default `0`, new). Counterpart of the above for `ResultStage` consumers. With the default `0`, when a `ResultStage` is the consumer of a pending corruption it is corrupted *before* the result tasks dispatch, so the result stage has zero finished tasks when `INJECT_SHUFFLE_FORCE_CHECKSUM_MISMATCH_ON_RECOMPUTE` later triggers `rollbackSucceedingStages` (the rollback path would otherwise abort a partially-finished result stage, since OSS Spark does not support rolling them back). Set to N > 0 to defer until N result-stage tasks have succeeded - the only way to actually exercise the result-stage abort path.

- `INJECT_SHUFFLE_FORCE_CHECKSUM_MISMATCH_ON_RECOMPUTE` (boolean, default false, new). After a downstream `FetchFailed` forces the producer's partition 0 to be recomputed, the recomputed task's `MapStatus` registration is artificially flagged as a checksum mismatch. The DAGScheduler then runs `rollbackSucceedingStages`, which clears the downstream `ShuffleMapStage`'s outputs and forces a full retry of that stage. `ResultStage` downstreams are aborted (OSS Spark does not support rolling them back, SPARK-25342).

### Why are the changes needed?

The existing `INJECT_SHUFFLE_FETCH_FAILURES` flag corrupts attempt 0 of every shuffle map stage, but only leaf shuffle map stages succeed on attempt 0. Non-leaf stages fail-fetch from corrupted upstream on attempt 0 and are never themselves corrupted, so unit tests cannot exercise the full range of stage-retry shapes that production hits (metric stability under non-determinism, rollback for indeterminate producers, SLAM semantics across retries). The new knobs let tests deterministically reach those shapes.

### Does this PR introduce _any_ user-facing change?

No. All flags are test-only (gated by `Utils.isTesting`) and are under `spark.testing.*`.

### How was this patch tested?

- `MetricsFailureInjectionSuite` (12 tests). Existing tests continue to pass with the new default semantics. The non-deterministic-stage test sees stage-2 raw metric overcount under the new default (because checksum-mismatch rollback now fires on the non-determinism); SLAM remains stable, which is the point of that test.

- New test `Three stage metrics force-checksum-mismatch with delayed corruption`: with `shuffle.partitions=20` (much greater than the test's `local[2]` cores) and `delay=1`, the rollback re-plays at least one already-completed stage-2 partition on top of the full re-run, putting the raw metric strictly above the recompute-only baseline.

- New test `Force checksum mismatch aborts a downstream ResultStage`: 2-stage `groupBy().count()` query where stage 2 is a `ResultStage`. With `RESULT_STAGE_DELAY=1` (opted-in) and `INJECT_SHUFFLE_FORCE_CHECKSUM_MISMATCH_ON_RECOMPUTE`, one result task succeeds before the FetchFailed cascade; the forced checksum mismatch on stage 1's mapper-0 recompute then fires `rollbackSucceedingStages`, which sees `numMissingPartitions < numTasks` on the result stage and aborts. The query throws a `SparkException` with `"indeterminate"` in the message.

- `SQLLastAttemptMetricPlanShapesSuite` (220 tests, was previously parameterised on `stageRetries: Boolean`, now on a tri-valued `FailureMode`: `NoFailure`, `FetchFailure`, `ChecksumMismatch`). Each plan shape is now also exercised under forced checksum-mismatch rollback; SLAM values remain stable.

- `SQLLastAttemptMetricIntegrationSuiteWithChecksumMismatch` (new subclass): runs the full integration suite with `INJECT_SHUFFLE_FETCH_FAILURES + INJECT_SHUFFLE_FORCE_CHECKSUM_MISMATCH_ON_RECOMPUTE` enabled.

- `core/scalastyle`, `sql/scalastyle`, and 149 `DAGSchedulerSuite` tests pass.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code, Opus 4.7.

Closes #55738 from juliuszsompolski/retry-injection-infra.

Lead-authored-by: Juliusz Sompolski <julek@databricks.com>
Co-authored-by: Juliusz Sompolski <Juliusz Sompolski>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit c101701)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan added a commit that referenced this pull request Jun 22, 2026
…TCH_FAILURES injection maps

### What changes were proposed in this pull request?

Followup to #55738.

The three `INJECT_SHUFFLE_FETCH_FAILURES` injection-state maps in `DAGScheduler` (`injectShuffleFetchFailuresCorruptedAttempt`, `injectShuffleFetchFailuresPendingDelayedCorruption`, `injectShuffleFetchFailuresDownstreamSuccessCount`) are initialized with `if (Utils.isTesting) new ConcurrentHashMap else null`. This patch allocates them unconditionally instead.

### Why are the changes needed?

The cleanup site in `cleanupStateForJobAndIndependentStages` (and the other use-sites) re-evaluate `if (Utils.isTesting)` to decide whether to dereference these maps. `Utils.isTesting` reads the mutable `spark.testing` system property, so it can return a different value when the `DAGScheduler` is constructed than at the later use-sites. If it does, a `null` map is dereferenced and the `DAGScheduler` event loop crashes with a `NullPointerException`. Allocating the maps unconditionally makes the construction guard and the use-site guards consistent, so they can never disagree.

### Does this PR introduce _any_ user-facing change?

No. The maps are only ever populated inside the config-gated `INJECT_SHUFFLE_FETCH_FAILURES` test paths, so in production they stay empty and there is no behavioral change beyond allocating an empty map.

### How was this patch tested?

Existing tests. This is a defensive initialization change; the maps stay empty unless the existing `INJECT_SHUFFLE_FETCH_FAILURES` test paths populate them, so the existing coverage exercises them unchanged.

### Was this patch authored or co-authored using generative AI tooling?

Yes, drafted with Claude Code (Anthropic).

Closes #56631 from cloud-fan/SPARK-56773-followup.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan added a commit that referenced this pull request Jun 22, 2026
…TCH_FAILURES injection maps

### What changes were proposed in this pull request?

Followup to #55738.

The three `INJECT_SHUFFLE_FETCH_FAILURES` injection-state maps in `DAGScheduler` (`injectShuffleFetchFailuresCorruptedAttempt`, `injectShuffleFetchFailuresPendingDelayedCorruption`, `injectShuffleFetchFailuresDownstreamSuccessCount`) are initialized with `if (Utils.isTesting) new ConcurrentHashMap else null`. This patch allocates them unconditionally instead.

### Why are the changes needed?

The cleanup site in `cleanupStateForJobAndIndependentStages` (and the other use-sites) re-evaluate `if (Utils.isTesting)` to decide whether to dereference these maps. `Utils.isTesting` reads the mutable `spark.testing` system property, so it can return a different value when the `DAGScheduler` is constructed than at the later use-sites. If it does, a `null` map is dereferenced and the `DAGScheduler` event loop crashes with a `NullPointerException`. Allocating the maps unconditionally makes the construction guard and the use-site guards consistent, so they can never disagree.

### Does this PR introduce _any_ user-facing change?

No. The maps are only ever populated inside the config-gated `INJECT_SHUFFLE_FETCH_FAILURES` test paths, so in production they stay empty and there is no behavioral change beyond allocating an empty map.

### How was this patch tested?

Existing tests. This is a defensive initialization change; the maps stay empty unless the existing `INJECT_SHUFFLE_FETCH_FAILURES` test paths populate them, so the existing coverage exercises them unchanged.

### Was this patch authored or co-authored using generative AI tooling?

Yes, drafted with Claude Code (Anthropic).

Closes #56631 from cloud-fan/SPARK-56773-followup.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 41cc304)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants