[SPARK-53209][YARN] Add ActiveProcessorCount JVM option to YARN executor and AM by jzhuge · Pull Request #51948 · apache/spark

jzhuge · 2025-08-09T06:50:08Z

What changes were proposed in this pull request?

When starting Spark driver and executors on YARN, the JVM process can discover all CPU cores on the node and set thread-pool or GC thread counts based on that value. We should limit what the JVM sees for the number of cores set by the user via -XX:ActiveProcessorCount, which was introduced in Java 8u191.

Adds three boolean config flags (default false):

spark.yarn.am.limitActiveProcessorCount.enabled: sets -XX:ActiveProcessorCount=<spark.yarn.am.cores> in the YARN AM JVM (client mode).
spark.driver.limitActiveProcessorCount.enabled: sets -XX:ActiveProcessorCount=<spark.driver.cores> in the YARN AM JVM (cluster mode).
spark.executor.limitActiveProcessorCount.enabled: sets -XX:ActiveProcessorCount=<spark.executor.cores> in executor JVMs on YARN.

Why are the changes needed?

Without this change, the JVM discovers all CPU cores on the YARN node rather than the cores allocated to the container. Users have assigned driver and executors a number of cores and we should honor that. A simple test would be:
Runtime.getRuntime().availableProcessors()

Does this PR introduce any user-facing change?

Yes — three new public configuration keys.

How was this patch tested?

New unit tests in ClientSuite and ExecutorRunnableSuite.

Co-authored-by: Shanyu Zhao shzhao@microsoft.com

pan3793 · 2025-08-09T17:02:38Z

what about the driver JVM options in YARN client mode?

jzhuge · 2025-08-11T04:34:37Z

3 tests failed for the same error

- SPARK-53209: ActiveProcessorCount defaults to 1 in cluster mode when driver cores not set *** FAILED *** (53 milliseconds)
[info]   java.lang.IllegalStateException: Library directory '/home/runner/work/spark/spark/resource-managers/yarn/assembly/target/scala-2.13/jars' does not exist; make sure Spark is built.
[info]   at org.apache.spark.launcher.CommandBuilderUtils.checkState(CommandBuilderUtils.java:230)

WweiL

Thank you for picking this up!

jzhuge · 2025-08-11T17:06:18Z

what about the driver JVM options in YARN client mode?

Client mode is out of scope for this JIRA.

Several things to consider for client mode:

Driver runs inside client JVM
No Spark option or conf to set driver cores
If necessary, users can use env var to set this Java option
Possible code changes will reside in launcher, instead of yarn module, so a follow-up JIRA is preferred.

pan3793 · 2025-08-11T17:27:35Z

what about the driver JVM options in YARN client mode?

Client mode is out of scope for this JIRA.

Several things to consider for client mode:

Driver runs inside client JVM

No Spark option or conf to set driver cores

If necessary, users can use env var to set this Java option

Possible code changes will reside in launcher, instead of yarn module, so a follow-up JIRA is preferred.

Sounds reasonable.

nit: maybe the title should be "... to YARN executor and AM" instead of "... to YARN executor and driver"?

jzhuge · 2025-08-11T19:38:24Z

nit: maybe the title should be "... to YARN executor and AM" instead of "... to YARN executor and driver"?

How about "Add ActiveProcessorCount JVM option to Spark driver and executor in YARN mode"

pan3793 · 2025-08-12T02:36:53Z

@jzhuge, my point is, "AM" is more consistent with your change than "driver" - in client mode, AM container -XX:ActiveProcessorCount respects spark.yarn.am.cores; in cluster mode, AM container is driver, though it respects spark.driver.cores

HyukjinKwon · 2025-08-13T00:15:32Z

Seems fine. cc @mridulm FYI

jzhuge · 2025-08-16T21:03:47Z

@mridulm @HyukjinKwon Just wanted to check in on the review for this, thanks!

mridulm

While technically correct as a change, let us flag guard it - there are too many casses where applications are implicitly assuming their ability to flex cores (for ex, during gc, netty, etc); and this will cause a regression.

jzhuge · 2025-08-24T19:27:54Z

While technically correct as a change, let us flag guard it - there are too many casses where applications are implicitly assuming their ability to flex cores (for ex, during gc, netty, etc); and this will cause a regression.

Will add a flag after my vacation for 2 weeks.

pan3793 · 2025-09-28T10:17:30Z

hi @jzhuge, do you have time to address the comment to move this PR forward?

jzhuge · 2025-09-28T16:52:22Z

hi @jzhuge, do you have time to address the comment to move this PR forward?

Ah, it fell through the crack :-(
Working on it ...

jzhuge · 2025-09-28T21:24:53Z

@pan3793 @mridulm @HyukjinKwon Added a feature flag, default to false.

Let me know whether I need to rebase or squash WIP commits.

jzhuge · 2025-09-28T21:27:07Z

Question: do we need separate flags for driver and executor?

jzhuge · 2025-09-29T07:33:53Z

Many tests failed, let me rebase.

jzhuge · 2025-09-30T17:53:41Z

Hmm, 2 unrelated test failures in these modules:

kafka

[info] *** 1 TEST FAILED ***
[error] Failed tests:
[error] 	org.apache.spark.sql.kafka010.KafkaMicroBatchV1SourceWithConsumerSuite
[error] (sql-kafka-0-10 / Test / test) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 2715 s (45:15), completed Sep 30, 2025, 9:25:25 AM
[error] running /home/runner/work/spark/spark/build/sbt -Phadoop-3 -Pvolcano -Pkubernetes -Pspark-ganglia-lgpl -Pkinesis-asl -Phadoop-cloud sql-kafka-0-10/test protobuf/test streaming/test streaming-kinesis-asl/test streaming-kafka-0-10/test token-provider-kafka-0-10/test connect/test connect-client-jvm/test kubernetes/test hadoop-cloud/test ; received return code 1

sparkr

[warn] 	Note: Unresolved dependencies path:
[error] sbt.librarymanagement.ResolveException: Error downloading org.scalaz:scalaz-effect_2.12:7.2.35
[error]   Not found
[error]   Not found
[error]   download error: Caught java.io.IOException (Server returned HTTP response code: 500 for URL: https://repo1.maven.org/maven2/org/scalaz/scalaz-effect_2.12/7.2.35/scalaz-effect_2.12-7.2.35.pom) while downloading https://repo1.maven.org/maven2/org/scalaz/scalaz-effect_2.12/7.2.35/scalaz-effect_2.12-7.2.35.pom

jzhuge · 2025-09-30T17:53:57Z

Retest please

jzhuge · 2026-03-23T16:11:38Z

Thanks @pan3793 for the review! Looking ...

jzhuge · 2026-03-23T17:06:08Z

I wouldn't be worried about local here I don't see it impacting it.

@holdenk, this PR does not cover that case, but it's indeed a case that we can support in follow-up. I do see some users running Spark in local mode on bare-metal machines, which usually have hundreds of cores, and each spark app only allocates a few cores, without applying such a limit, the CPU is easily exhausted by the global fork-join pool, or GC threads.

Created SPARK-56157 for standalone and SPARK-56158 for local.

jzhuge · 2026-03-24T00:43:01Z

@pan3793 Thanks for the feedback! The changes are cleaner. Please take another look.

jzhuge · 2026-03-27T01:12:33Z

Sql test failure seem unrelated

jzhuge · 2026-03-27T07:41:23Z

Unrelated test failures in sql - other tests

pan3793 · 2026-03-30T01:25:24Z

Unrelated test failures in sql - other tests

since the UT runs on your forked repo, you have permission to rerun the single failed job.

pan3793 · 2026-03-30T01:26:32Z

I'm going to merge this if no further comments in 24 hours

…tor and AM Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

jzhuge · 2026-03-30T18:11:49Z

This seems unrelated:

[info] *** 1 SUITE ABORTED ***
[error] Error during tests:
[error] 	org.apache.spark.sql.jdbc.v2.join.OracleJoinPushdownIntegrationSuite
[error] (docker-integration-tests / Test / test) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 4795 s (01:19:55.0), completed Mar 30, 2026, 5:52:10 AM

Error:  running /home/runner/work/spark/spark/build/sbt -Phadoop-3 -Pdocker-integration-tests -Dtest.include.tags=org.apache.spark.tags.DockerTest docker-integration-tests/test ; received return code 1
Error: Process completed with exit code 18.

pan3793 · 2026-03-31T05:17:50Z

merged to master, thank you, @jzhuge and all reviewers

jzhuge · 2026-03-31T07:44:05Z

Thanks @pan3793 for reviewing and merging the pr, @mridulm @HyukjinKwon @holdenk @WweiL for the review!

github-actions Bot added the YARN label Aug 9, 2025

WweiL reviewed Aug 10, 2025

View reviewed changes

Comment thread resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala

WweiL reviewed Aug 10, 2025

View reviewed changes

Comment thread resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala Outdated

WweiL reviewed Aug 11, 2025

View reviewed changes

Comment thread resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ExecutorRunnableSuite.scala

WweiL approved these changes Aug 11, 2025

View reviewed changes

jzhuge changed the title ~~[SPARK-53209][YARN] Add ActiveProcessorCount JVM option to YARN executor and driver~~ [SPARK-53209][YARN] Add ActiveProcessorCount JVM option to Spark driver and executor in Yarn mode Aug 11, 2025

jzhuge changed the title ~~[SPARK-53209][YARN] Add ActiveProcessorCount JVM option to Spark driver and executor in Yarn mode~~ [SPARK-53209][YARN] Add ActiveProcessorCount JVM option to Spark driver and executor in YARN mode Aug 11, 2025

jzhuge changed the title ~~[SPARK-53209][YARN] Add ActiveProcessorCount JVM option to Spark driver and executor in YARN mode~~ [SPARK-53209][YARN] Add ActiveProcessorCount JVM option to YARN executor and AM Aug 12, 2025

pan3793 previously approved these changes Aug 12, 2025

View reviewed changes

mridulm reviewed Aug 24, 2025

View reviewed changes

github-actions Bot added the CORE label Sep 28, 2025

jzhuge force-pushed the SPARK-53209 branch from 2303e50 to 33a3b9f Compare September 29, 2025 07:46

jzhuge force-pushed the SPARK-53209 branch 2 times, most recently from 63afcaf to 62839fb Compare October 5, 2025 07:33

jzhuge marked this pull request as draft March 23, 2026 16:51

jzhuge force-pushed the SPARK-53209 branch 2 times, most recently from 8ffc12e to f41dd72 Compare March 23, 2026 20:52

jzhuge marked this pull request as ready for review March 24, 2026 00:42

pan3793 reviewed Mar 24, 2026

View reviewed changes

Comment thread core/src/main/scala/org/apache/spark/internal/config/package.scala Outdated

pan3793 reviewed Mar 24, 2026

View reviewed changes

Comment thread resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala Outdated

pan3793 approved these changes Mar 26, 2026

View reviewed changes

pan3793 reviewed Mar 26, 2026

View reviewed changes

Comment thread resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ExecutorRunnableSuite.scala Outdated

pan3793 reviewed Mar 26, 2026

View reviewed changes

Comment thread core/src/main/scala/org/apache/spark/internal/config/package.scala Outdated

pan3793 reviewed Mar 26, 2026

View reviewed changes

Comment thread core/src/main/scala/org/apache/spark/internal/config/package.scala Outdated

pan3793 reviewed Mar 26, 2026

View reviewed changes

Comment thread core/src/main/scala/org/apache/spark/internal/config/package.scala Outdated

jzhuge force-pushed the SPARK-53209 branch from 8536a51 to a4f765b Compare March 27, 2026 01:13

[SPARK-53209][YARN] Add ActiveProcessorCount JVM option to YARN execu…

5c88276

…tor and AM Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

jzhuge force-pushed the SPARK-53209 branch from a4f765b to 5c88276 Compare March 30, 2026 04:15

pan3793 closed this in 6e8c690 Mar 31, 2026

jzhuge deleted the SPARK-53209 branch April 1, 2026 04:12

This was referenced Apr 5, 2026

[SPARK-56157][CORE] Support limitActiveProcessorCount in standalone mode #55190

Open

[SPARK-56158][CORE] Support limitActiveProcessorCount in local mode #55132

Open

dongjoon-hyun mentioned this pull request Jun 11, 2026

[SPARK-57377][INFRA] Add CI check to prevent new entries in the config binding policy exceptions file #56437

Closed

Uh oh!

Conversation

jzhuge commented Aug 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

pan3793 commented Aug 9, 2025

Uh oh!

Uh oh!

Uh oh!

jzhuge commented Aug 11, 2025

Uh oh!

Uh oh!

WweiL left a comment

Choose a reason for hiding this comment

Uh oh!

jzhuge commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pan3793 commented Aug 11, 2025

Uh oh!

jzhuge commented Aug 11, 2025

Uh oh!

pan3793 commented Aug 12, 2025

Uh oh!

HyukjinKwon commented Aug 13, 2025

Uh oh!

jzhuge commented Aug 16, 2025

Uh oh!

mridulm left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jzhuge commented Aug 24, 2025

Uh oh!

pan3793 commented Sep 28, 2025

Uh oh!

jzhuge commented Sep 28, 2025

Uh oh!

jzhuge commented Sep 28, 2025

Uh oh!

jzhuge commented Sep 28, 2025

Uh oh!

jzhuge commented Sep 29, 2025

Uh oh!

jzhuge commented Sep 30, 2025

kafka

sparkr

Uh oh!

jzhuge commented Sep 30, 2025

Uh oh!

jzhuge commented Mar 23, 2026

Uh oh!

jzhuge commented Mar 23, 2026

Uh oh!

jzhuge commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jzhuge commented Mar 27, 2026

Uh oh!

jzhuge commented Mar 27, 2026

Uh oh!

pan3793 commented Mar 30, 2026

Uh oh!

pan3793 commented Mar 30, 2026

Uh oh!

jzhuge commented Mar 30, 2026

Uh oh!

pan3793 commented Mar 31, 2026

Uh oh!

jzhuge commented Mar 31, 2026

jzhuge commented Aug 9, 2025 •

edited

Loading

jzhuge commented Aug 11, 2025 •

edited

Loading

mridulm left a comment •

edited

Loading

jzhuge commented Mar 24, 2026 •

edited

Loading