Skip to content

[SPARK-53209][YARN] Add ActiveProcessorCount JVM option to YARN executor and AM#51948

Closed
jzhuge wants to merge 1 commit into
apache:masterfrom
jzhuge:SPARK-53209
Closed

[SPARK-53209][YARN] Add ActiveProcessorCount JVM option to YARN executor and AM#51948
jzhuge wants to merge 1 commit into
apache:masterfrom
jzhuge:SPARK-53209

Conversation

@jzhuge

@jzhuge jzhuge commented Aug 9, 2025

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

When starting Spark driver and executors on YARN, the JVM process can discover all CPU cores on the node and set thread-pool or GC thread counts based on that value. We should limit what the JVM sees for the number of cores set by the user via -XX:ActiveProcessorCount, which was introduced in Java 8u191.

Adds three boolean config flags (default false):

  • spark.yarn.am.limitActiveProcessorCount.enabled: sets -XX:ActiveProcessorCount=<spark.yarn.am.cores> in the YARN AM JVM (client mode).
  • spark.driver.limitActiveProcessorCount.enabled: sets -XX:ActiveProcessorCount=<spark.driver.cores> in the YARN AM JVM (cluster mode).
  • spark.executor.limitActiveProcessorCount.enabled: sets -XX:ActiveProcessorCount=<spark.executor.cores> in executor JVMs on YARN.

Why are the changes needed?

Without this change, the JVM discovers all CPU cores on the YARN node rather than the cores allocated to the container. Users have assigned driver and executors a number of cores and we should honor that. A simple test would be:
Runtime.getRuntime().availableProcessors()

Does this PR introduce any user-facing change?

Yes — three new public configuration keys.

How was this patch tested?

New unit tests in ClientSuite and ExecutorRunnableSuite.

Co-authored-by: Shanyu Zhao shzhao@microsoft.com

@github-actions github-actions Bot added the YARN label Aug 9, 2025
@pan3793

pan3793 commented Aug 9, 2025

Copy link
Copy Markdown
Member

what about the driver JVM options in YARN client mode?

@jzhuge

jzhuge commented Aug 11, 2025

Copy link
Copy Markdown
Member Author

3 tests failed for the same error

- SPARK-53209: ActiveProcessorCount defaults to 1 in cluster mode when driver cores not set *** FAILED *** (53 milliseconds)
[info]   java.lang.IllegalStateException: Library directory '/home/runner/work/spark/spark/resource-managers/yarn/assembly/target/scala-2.13/jars' does not exist; make sure Spark is built.
[info]   at org.apache.spark.launcher.CommandBuilderUtils.checkState(CommandBuilderUtils.java:230)

@WweiL WweiL left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for picking this up!

@jzhuge

jzhuge commented Aug 11, 2025

Copy link
Copy Markdown
Member Author

what about the driver JVM options in YARN client mode?

Client mode is out of scope for this JIRA.

Several things to consider for client mode:

  • Driver runs inside client JVM
  • No Spark option or conf to set driver cores
  • If necessary, users can use env var to set this Java option
  • Possible code changes will reside in launcher, instead of yarn module, so a follow-up JIRA is preferred.

@pan3793

pan3793 commented Aug 11, 2025

Copy link
Copy Markdown
Member

what about the driver JVM options in YARN client mode?

Client mode is out of scope for this JIRA.

Several things to consider for client mode:

  • Driver runs inside client JVM
  • No Spark option or conf to set driver cores
  • If necessary, users can use env var to set this Java option
  • Possible code changes will reside in launcher, instead of yarn module, so a follow-up JIRA is preferred.

Sounds reasonable.

nit: maybe the title should be "... to YARN executor and AM" instead of "... to YARN executor and driver"?

@jzhuge jzhuge changed the title [SPARK-53209][YARN] Add ActiveProcessorCount JVM option to YARN executor and driver [SPARK-53209][YARN] Add ActiveProcessorCount JVM option to Spark driver and executor in Yarn mode Aug 11, 2025
@jzhuge jzhuge changed the title [SPARK-53209][YARN] Add ActiveProcessorCount JVM option to Spark driver and executor in Yarn mode [SPARK-53209][YARN] Add ActiveProcessorCount JVM option to Spark driver and executor in YARN mode Aug 11, 2025
@jzhuge

jzhuge commented Aug 11, 2025

Copy link
Copy Markdown
Member Author

nit: maybe the title should be "... to YARN executor and AM" instead of "... to YARN executor and driver"?

How about "Add ActiveProcessorCount JVM option to Spark driver and executor in YARN mode"

@pan3793

pan3793 commented Aug 12, 2025

Copy link
Copy Markdown
Member

@jzhuge, my point is, "AM" is more consistent with your change than "driver" - in client mode, AM container -XX:ActiveProcessorCount respects spark.yarn.am.cores; in cluster mode, AM container is driver, though it respects spark.driver.cores

@jzhuge jzhuge changed the title [SPARK-53209][YARN] Add ActiveProcessorCount JVM option to Spark driver and executor in YARN mode [SPARK-53209][YARN] Add ActiveProcessorCount JVM option to YARN executor and AM Aug 12, 2025
pan3793
pan3793 previously approved these changes Aug 12, 2025
@HyukjinKwon

Copy link
Copy Markdown
Member

Seems fine. cc @mridulm FYI

@jzhuge

jzhuge commented Aug 16, 2025

Copy link
Copy Markdown
Member Author

@mridulm @HyukjinKwon Just wanted to check in on the review for this, thanks!

@mridulm mridulm left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While technically correct as a change, let us flag guard it - there are too many casses where applications are implicitly assuming their ability to flex cores (for ex, during gc, netty, etc); and this will cause a regression.

@jzhuge

jzhuge commented Aug 24, 2025

Copy link
Copy Markdown
Member Author

While technically correct as a change, let us flag guard it - there are too many casses where applications are implicitly assuming their ability to flex cores (for ex, during gc, netty, etc); and this will cause a regression.

Will add a flag after my vacation for 2 weeks.

@pan3793

pan3793 commented Sep 28, 2025

Copy link
Copy Markdown
Member

hi @jzhuge, do you have time to address the comment to move this PR forward?

@jzhuge

jzhuge commented Sep 28, 2025

Copy link
Copy Markdown
Member Author

hi @jzhuge, do you have time to address the comment to move this PR forward?

Ah, it fell through the crack :-(
Working on it ...

@github-actions github-actions Bot added the CORE label Sep 28, 2025
@jzhuge

jzhuge commented Sep 28, 2025

Copy link
Copy Markdown
Member Author

@pan3793 @mridulm @HyukjinKwon Added a feature flag, default to false.

Let me know whether I need to rebase or squash WIP commits.

@jzhuge

jzhuge commented Sep 28, 2025

Copy link
Copy Markdown
Member Author

Question: do we need separate flags for driver and executor?

@jzhuge

jzhuge commented Sep 29, 2025

Copy link
Copy Markdown
Member Author

Many tests failed, let me rebase.

@jzhuge

jzhuge commented Sep 30, 2025

Copy link
Copy Markdown
Member Author

Hmm, 2 unrelated test failures in these modules:

kafka

[info] *** 1 TEST FAILED ***
[error] Failed tests:
[error] 	org.apache.spark.sql.kafka010.KafkaMicroBatchV1SourceWithConsumerSuite
[error] (sql-kafka-0-10 / Test / test) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 2715 s (45:15), completed Sep 30, 2025, 9:25:25 AM
[error] running /home/runner/work/spark/spark/build/sbt -Phadoop-3 -Pvolcano -Pkubernetes -Pspark-ganglia-lgpl -Pkinesis-asl -Phadoop-cloud sql-kafka-0-10/test protobuf/test streaming/test streaming-kinesis-asl/test streaming-kafka-0-10/test token-provider-kafka-0-10/test connect/test connect-client-jvm/test kubernetes/test hadoop-cloud/test ; received return code 1

sparkr

[warn] 	Note: Unresolved dependencies path:
[error] sbt.librarymanagement.ResolveException: Error downloading org.scalaz:scalaz-effect_2.12:7.2.35
[error]   Not found
[error]   Not found
[error]   download error: Caught java.io.IOException (Server returned HTTP response code: 500 for URL: https://repo1.maven.org/maven2/org/scalaz/scalaz-effect_2.12/7.2.35/scalaz-effect_2.12-7.2.35.pom) while downloading https://repo1.maven.org/maven2/org/scalaz/scalaz-effect_2.12/7.2.35/scalaz-effect_2.12-7.2.35.pom

@jzhuge

jzhuge commented Sep 30, 2025

Copy link
Copy Markdown
Member Author

Retest please

@jzhuge jzhuge force-pushed the SPARK-53209 branch 2 times, most recently from 63afcaf to 62839fb Compare October 5, 2025 07:33
@jzhuge

jzhuge commented Mar 23, 2026

Copy link
Copy Markdown
Member Author

Thanks @pan3793 for the review! Looking ...

@jzhuge jzhuge marked this pull request as draft March 23, 2026 16:51
@jzhuge

jzhuge commented Mar 23, 2026

Copy link
Copy Markdown
Member Author

I wouldn't be worried about local here I don't see it impacting it.

@holdenk, this PR does not cover that case, but it's indeed a case that we can support in follow-up. I do see some users running Spark in local mode on bare-metal machines, which usually have hundreds of cores, and each spark app only allocates a few cores, without applying such a limit, the CPU is easily exhausted by the global fork-join pool, or GC threads.

Created SPARK-56157 for standalone and SPARK-56158 for local.

@jzhuge jzhuge force-pushed the SPARK-53209 branch 2 times, most recently from 8ffc12e to f41dd72 Compare March 23, 2026 20:52
@jzhuge jzhuge marked this pull request as ready for review March 24, 2026 00:42
@jzhuge

jzhuge commented Mar 24, 2026

Copy link
Copy Markdown
Member Author

@pan3793 Thanks for the feedback! The changes are cleaner. Please take another look.

Comment thread core/src/main/scala/org/apache/spark/internal/config/package.scala Outdated
Comment thread core/src/main/scala/org/apache/spark/internal/config/package.scala Outdated
Comment thread core/src/main/scala/org/apache/spark/internal/config/package.scala Outdated
Comment thread core/src/main/scala/org/apache/spark/internal/config/package.scala Outdated
@jzhuge

jzhuge commented Mar 27, 2026

Copy link
Copy Markdown
Member Author

Sql test failure seem unrelated

@jzhuge

jzhuge commented Mar 27, 2026

Copy link
Copy Markdown
Member Author

Unrelated test failures in sql - other tests

@pan3793

pan3793 commented Mar 30, 2026

Copy link
Copy Markdown
Member

Unrelated test failures in sql - other tests

since the UT runs on your forked repo, you have permission to rerun the single failed job.

@pan3793

pan3793 commented Mar 30, 2026

Copy link
Copy Markdown
Member

I'm going to merge this if no further comments in 24 hours

…tor and AM

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jzhuge

jzhuge commented Mar 30, 2026

Copy link
Copy Markdown
Member Author

This seems unrelated:

[info] *** 1 SUITE ABORTED ***
[error] Error during tests:
[error] 	org.apache.spark.sql.jdbc.v2.join.OracleJoinPushdownIntegrationSuite
[error] (docker-integration-tests / Test / test) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 4795 s (01:19:55.0), completed Mar 30, 2026, 5:52:10 AM

Error:  running /home/runner/work/spark/spark/build/sbt -Phadoop-3 -Pdocker-integration-tests -Dtest.include.tags=org.apache.spark.tags.DockerTest docker-integration-tests/test ; received return code 1
Error: Process completed with exit code 18.

@pan3793 pan3793 closed this in 6e8c690 Mar 31, 2026
@pan3793

pan3793 commented Mar 31, 2026

Copy link
Copy Markdown
Member

merged to master, thank you, @jzhuge and all reviewers

@jzhuge

jzhuge commented Mar 31, 2026

Copy link
Copy Markdown
Member Author

Thanks @pan3793 for reviewing and merging the pr, @mridulm @HyukjinKwon @holdenk @WweiL for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants