[SPARK-33748][K8S] Respect environment variables and configurations for Python executables by HyukjinKwon · Pull Request #30735 · apache/spark

HyukjinKwon · 2020-12-11T13:03:10Z

What changes were proposed in this pull request?

This PR proposes:

Respect PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment variables, or spark.pyspark.python and spark.pyspark.driver.python configurations in Kubernates just like other cluster types in Spark.
Depreate spark.kubernetes.pyspark.pythonVersion and guide users to set the environment variables and configurations for Python executables.
NOTE that spark.kubernetes.pyspark.pythonVersion is already a no-op configuration without this PR. Default is 3 and other values are disallowed.
In order for Python executable settings to be consistently used, fix spark.archives option to unpack into the current working directory in the driver of Kubernates' cluster mode. This behaviour is identical with Yarn's cluster mode. By doing this, users can leverage Conda or virtuenenv in cluster mode as below:
```
 conda create -y -n pyspark_conda_env -c conda-forge pyarrow pandas conda-pack
 conda activate pyspark_conda_env
 conda pack -f -o pyspark_conda_env.tar.gz
 PYSPARK_PYTHON=./environment/bin/python spark-submit --archives pyspark_conda_env.tar.gz#environment app.py
```
Removed several unused or useless codes such as extractS3Key and renameResourcesToLocalFS

Why are the changes needed?

To provide a consistent support of PySpark by using PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment variables, or spark.pyspark.python and spark.pyspark.driver.python configurations.
To provide Conda and virtualenv support via spark.archives options.

Does this PR introduce any user-facing change?

Yes:

spark.kubernetes.pyspark.pythonVersion is deprecated.
PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment variables, and spark.pyspark.python and spark.pyspark.driver.python configurations are respected.

How was this patch tested?

Manually tested via:

minikube delete
minikube start --cpus 12 --memory 16384
kubectl create namespace spark-integration-test
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
  name: spark
  namespace: spark-integration-test
EOF
kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=spark-integration-test:spark --namespace=spark-integration-test
dev/make-distribution.sh --pip --tgz -Pkubernetes
resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh --spark-tgz `pwd`/spark-3.2.0-SNAPSHOT-bin-3.2.0.tgz  --service-account spark --namespace spark-integration-test

Unittests were also added.

HyukjinKwon · 2020-12-11T13:03:41Z

cc @dongjoon-hyun and @erikerlandson FYI

HyukjinKwon · 2020-12-11T13:07:21Z

I read the codes multiple times for sure, and I think this code does a duplicated job.
If I am not horribly wrong somewhere:

localResources itself are always local files, and resources will always be replaced to localResources.

If resources is null, then localResources will be null too

cc @skonto from 5e74570

HyukjinKwon · 2020-12-11T13:08:28Z

PYTHON_VERSION and pyv3 apparently not used anywhere in Spark code base.

HyukjinKwon · 2020-12-11T13:08:46Z

extractS3Key is not used too.

erikerlandson · 2020-12-11T14:20:29Z

We should not remove the setting spark.kubernetes.pyspark.pythonVersion, as that will break backward compatibility. IMO, this setting should override environment variables, if it is specified.

HyukjinKwon · 2020-12-11T14:29:05Z

@erikerlandson, spark.kubernetes.pyspark.pythonVersion has no other values able to set except 3, which is the default. Even if we remove spark.kubernetes.pyspark.pythonVersion, Spark still allows to set the non-existent configuration. Therefore I believe there's no breaking change here. In addition, Kubernates is experimental so far which generally accepts breaking changes between minor releases.

erikerlandson · 2020-12-11T18:20:57Z

@HyukjinKwon is the issue that spark.kubernetes.pyspark.pythonVersion is purely a version, but PYSPARK_PYTHON and friends allow path names? Because either way Spark currently supports only python 3.

dongjoon-hyun

@erikerlandson and @HyukjinKwon .
This PR seems to contain 3 othorognal stuffs.

First of all, can we discuss of explicit deprecating spark.kubernetes.pyspark.pythonVersion as no-op at Apache Spark 3.1? Since it's effectively no-op for now and @HyukjinKwon 's new approach may supersede it, I believe it's okay.
python -> python3 as new follow-up PR for SPARK-32447.
Finally, @HyukjinKwon 's proposal.

HyukjinKwon · 2020-12-11T23:49:28Z

Yes @erikerlandson thanks for clarification @dongjoon-hyun. So, if users set the configuration as 2 in 2.4 Spark application, that already wouldn't work because Python 2 was dropped. If users set it as 3, it will not be broken and work as expected because the default Python is already 3. Additionally now Spark 3.1 respects the Python environment variables if set, which is supposed to be, like Spark does in other places. I believe its more a bug fix.

SparkQA · 2020-12-13T03:03:06Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37311/

SparkQA · 2020-12-13T04:25:40Z

Test build #132708 has finished for PR 30735 at commit d3e52f4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-12-14T01:07:17Z

I believe this is a known flaky test (SPARK-33276), and it's hardly related:

- Test basic decommissioning with shuffle cleanup *** FAILED ***
  The code passed to eventually never returned normally. Attempted 182 times over 3.0065354390166665 minutes. Last failure message: "++ id -u

and all other tests passed.

Thriftserver test in GitHub Actions is also a known test failure at SPARK-33705

I believe I addressed all standing commands and it's ready for a review or possibly ready to go.

HyukjinKwon · 2020-12-14T01:13:30Z

cc @tgravescs too FYI

tgravescs · 2020-12-14T14:54:04Z

overall approach looks fine to me, I didn't have time to do detailed review.
I'm agree with deprecated config even if not used so users see warning.

I see we changed the driver env settings here, how is executor pyspark being set? I took a very quick look and didn't see it in BasicExecutorFeatureStep. Just wondering if that needs to be updated at all. Especially if driver python path was different from executor

HyukjinKwon · 2020-12-14T19:35:01Z

Thanks @tgravescs. If guiding users is a matter, we can still do by updating migration guide or even show a warning that the configuration was removed.

Given that this is experimental, I wanted to treat a bit differently as the removal and behaviour changes are expected as noted in the tags and as discussed in the mailing list in the past.

I feel like we're treating GA-ed feature and experimental feature similarly.

In any event, I agree that this is more conservative and possibly smoother approach. I'm okay with deprecating.

For setting envioronment variables on executors, I believe PYSPARK_DRIVER_PYTHON is for driver python and PYSPARK_PYTHON is for executor python (and driver if PYSPARK_DRIVER_PYTHON is not set). So, if different python executables are used, users can set both environment variables (or equivalent Spark configurations such as spark.pyspark.python).

tgravescs · 2020-12-14T20:04:53Z

sorry if I missed it in the PR or existing code, I want to make sure the executor side of spark.pyspark.python or PYSPARK_PYTHON is working with k8s? I only saw changes in this PR for the driver side but haven't had time to look in great detail.

HyukjinKwon · 2020-12-14T20:10:18Z

Oh, actually PYSPARK_PYTHON (or spark.pyspark.python) is being passed though from driver to executor side via:

spark/python/pyspark/context.py

Line 230 in e2cdfce

self.pythonExec = os.environ.get("PYSPARK_PYTHON", 'python3')
spark/python/pyspark/rdd.py

Line 2533 in 3959f0d

return sc._jvm.PythonFunction(bytearray(pickled_command), env, includes, sc.pythonExec,
spark/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

Lines 64 to 65 in 4851453

val runner = PythonRunner(func)

runner.compute(firstParent.iterator(split, context), split.index, context)
spark/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala

Line 92 in 3959f0d

protected val pythonExec: String = funcs.head.funcs.head.pythonExec
spark/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala

Line 144 in 3959f0d

val worker: Socket = env.createPythonWorker(pythonExec, envVars.asScala.toMap)

I also added a test to make sure this works in K8S here: https://github.com/apache/spark/pull/30735/files#diff-78ba045f393bcf6ffaa3dfe85bc7682cacf0bef69d447a2346e201279cc0bc5bR179-R197

tgravescs · 2020-12-14T20:16:46Z

great thanks!

dongjoon-hyun · 2020-12-14T22:43:36Z

+        // Each entry does not keep the file permission from the input file.
+        // Setting permissions in the input file do not work. Just simply set
+        // to 777.
+        tarEntry.setMode(0x81ff)


This looks like another orthogonal issue to me. Could you spin-off this one?

Its slightly orthogonal but related to the current PR. The custom python in the tests should have executable permission and tgz should keep the permission. Without this change, the test fails because it loses executable permission and cannot be used as a python alone later. FWIW this util is for test-only, and this PR adds the first case in the test where it needs to keep the file permissions.

dongjoon-hyun

+1, LGTM (two comments: #30735 (comment) and #30735 (comment)).
I'll leave the decision for those comments to @HyukjinKwon .

erikerlandson · 2020-12-14T22:47:26Z

FWIW I also agree about separating the things not specifically about PYSPARK_PYTHON.
Otherwise LGTM

dongjoon-hyun · 2020-12-14T22:47:46Z

@erikerlandson and @viirya . I want to know your opinions. Please let us know if you have concerns still.

dongjoon-hyun · 2020-12-14T22:48:20Z

Oh, thanks, @erikerlandson !

HyukjinKwon · 2020-12-14T23:34:03Z

Thanks @erikerlandson and @dongjoon-hyun for your review and approval. I replied in the comments.

HyukjinKwon · 2020-12-14T23:36:49Z

To clarify, I believe the current change contains all related changes. Without them, either test fails or python related env or configurations would not work as expected like other places.

HyukjinKwon · 2020-12-14T23:56:19Z

Merged to master and branch-3.1 (for K8S GA preparation).

Thanks again @erikerlandson, @dongjoon-hyun and @tgravescs for reviewing and bearing with me :-).

…or Python executables ### What changes were proposed in this pull request? This PR proposes: - Respect `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` environment variables, or `spark.pyspark.python` and `spark.pyspark.driver.python` configurations in Kubernates just like other cluster types in Spark. - Depreate `spark.kubernetes.pyspark.pythonVersion` and guide users to set the environment variables and configurations for Python executables. NOTE that `spark.kubernetes.pyspark.pythonVersion` is already a no-op configuration without this PR. Default is `3` and other values are disallowed. - In order for Python executable settings to be consistently used, fix `spark.archives` option to unpack into the current working directory in the driver of Kubernates' cluster mode. This behaviour is identical with Yarn's cluster mode. By doing this, users can leverage Conda or virtuenenv in cluster mode as below: ```python conda create -y -n pyspark_conda_env -c conda-forge pyarrow pandas conda-pack conda activate pyspark_conda_env conda pack -f -o pyspark_conda_env.tar.gz PYSPARK_PYTHON=./environment/bin/python spark-submit --archives pyspark_conda_env.tar.gz#environment app.py ``` - Removed several unused or useless codes such as `extractS3Key` and `renameResourcesToLocalFS` ### Why are the changes needed? - To provide a consistent support of PySpark by using `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` environment variables, or `spark.pyspark.python` and `spark.pyspark.driver.python` configurations. - To provide Conda and virtualenv support via `spark.archives` options. ### Does this PR introduce _any_ user-facing change? Yes: - `spark.kubernetes.pyspark.pythonVersion` is deprecated. - `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` environment variables, and `spark.pyspark.python` and `spark.pyspark.driver.python` configurations are respected. ### How was this patch tested? Manually tested via: ```bash minikube delete minikube start --cpus 12 --memory 16384 kubectl create namespace spark-integration-test cat <<EOF | kubectl apply -f - apiVersion: v1 kind: ServiceAccount metadata: name: spark namespace: spark-integration-test EOF kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=spark-integration-test:spark --namespace=spark-integration-test dev/make-distribution.sh --pip --tgz -Pkubernetes resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh --spark-tgz `pwd`/spark-3.2.0-SNAPSHOT-bin-3.2.0.tgz --service-account spark --namespace spark-integration-test ``` Unittests were also added. Closes #30735 from HyukjinKwon/SPARK-33748. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit a99a47c) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

erikerlandson · 2020-12-15T12:25:52Z

@HyukjinKwon thanks for standardizing the env-var support!

HyukjinKwon · 2020-12-15T14:36:23Z

Thank you @erikerlandson.

github-actions Bot added CORE DOCS KUBERNETES labels Dec 11, 2020

HyukjinKwon commented Dec 11, 2020

View reviewed changes

Comment thread core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala Outdated