Skip to content

[MINOR] Reference-count shared Spark context to prevent concurrent shutdown#2516

Merged
Baunsgaard merged 1 commit into
apache:mainfrom
Baunsgaard:refcount-spark-context
Jun 27, 2026
Merged

[MINOR] Reference-count shared Spark context to prevent concurrent shutdown#2516
Baunsgaard merged 1 commit into
apache:mainfrom
Baunsgaard:refcount-spark-context

Conversation

@Baunsgaard

Copy link
Copy Markdown
Contributor

The JVM-wide singleton SparkContext in SparkExecutionContext was stopped unconditionally at the end of every DMLScript.execute(). With surefire parallel tests (parallel=classes, threadCount=2), one finishing execution would stop the context while a concurrent execution still had an in-flight Spark job.

Change

  • Make close() reference-counted. SparkExecutionContext.enterSparkExecution() is called from DMLScript.execute() when the execution context is a SparkExecutionContext, and close() only stops the shared context once the active-execution count reaches zero.

Notes

  • Single-run behaviour is unchanged (enter -> 1, close -> 0 -> stop).
  • The counter is balanced: enter and close run under the same condition at the same single site, with close() always in the finally block.
  • Only the DMLScript.execute() path that owns the context is covered; MLContext manages an externally-owned context separately and is unaffected.

@codecov

codecov Bot commented Jun 26, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 83.33333% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.57%. Comparing base (d4e7def) to head (e51b6eb).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
.../controlprogram/context/SparkExecutionContext.java 85.00% 2 Missing and 1 partial ⚠️
src/main/java/org/apache/sysds/api/DMLScript.java 75.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##               main    #2516   +/-   ##
=========================================
  Coverage     71.56%   71.57%           
+ Complexity    49125    49123    -2     
=========================================
  Files          1575     1575           
  Lines        189784   189814   +30     
  Branches      37232    37239    +7     
=========================================
+ Hits         135823   135853   +30     
+ Misses        43470    43467    -3     
- Partials      10491    10494    +3     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Baunsgaard Baunsgaard force-pushed the refcount-spark-context branch 3 times, most recently from 4c40609 to 530be01 Compare June 26, 2026 16:56
The singleton JVM-wide SparkContext in SparkExecutionContext was stopped
unconditionally at the end of every DMLScript.execute() run. When two DML
script executions run concurrently in the same JVM (e.g. surefire parallel
tests with parallel=classes and threadCount=2), one execution finishing its
script would call close() and stop the context while the other execution
still had an in-flight spark job. The wedged job never receives a completion
event from the now-stopped DAGScheduler, so it hangs until the 1000s test
watchdog fires, surfacing as "Job N cancelled because SparkContext was shut
down" and intermittently blowing the suite's CI time budget.

Reference-count the active users of the shared context and separate counting
from teardown:
- enterSparkExecution()/exitSparkExecution() maintain a static count of active
  DML executions (DMLScript.execute registers after creating a
  SparkExecutionContext and releases in the finally block).
- close() never changes the count; it only stops the context when no registered
  execution remains. This way an unpaired close() (a caller that borrows the
  shared context but never registered, e.g. tests or parfor children) can no
  longer tear down a context another execution is still using, and a finishing
  execution cannot stop the context mid-job.
Single-run behavior is unchanged (enter -> 1, exit -> 0, close -> stop).

When the context is force-discarded outside close() (resetSparkContextStatic(),
and the re-init path in getSparkContextStatic() when the previous context was
stopped), the count is reset to zero so a stale registration cannot skip a
future legitimate stop. The keep-alive branch logs at debug level for
diagnosability.

The invariant covers only the DMLScript.execute() path that owns the context;
MLContext manages an externally-owned context separately and is unaffected.

SparkContextReferenceCountTest covers two scenarios that both fail on the old
unconditional-stop code and pass now: a finishing execution keeps the context
alive while another is active, and an unpaired close() does not stop a context
another execution still uses.
@Baunsgaard Baunsgaard force-pushed the refcount-spark-context branch from 530be01 to e51b6eb Compare June 27, 2026 16:46
@Baunsgaard Baunsgaard merged commit 80022a2 into apache:main Jun 27, 2026
50 checks passed
@github-project-automation github-project-automation Bot moved this from In Progress to Done in SystemDS PR Queue Jun 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

1 participant