Skip to content

[SPARK-33748][K8S] Respect environment variables and configurations for Python executables#30735

Closed
HyukjinKwon wants to merge 1 commit into
apache:masterfrom
HyukjinKwon:SPARK-33748
Closed

[SPARK-33748][K8S] Respect environment variables and configurations for Python executables#30735
HyukjinKwon wants to merge 1 commit into
apache:masterfrom
HyukjinKwon:SPARK-33748

Conversation

@HyukjinKwon

@HyukjinKwon HyukjinKwon commented Dec 11, 2020

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

This PR proposes:

  • Respect PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment variables, or spark.pyspark.python and spark.pyspark.driver.python configurations in Kubernates just like other cluster types in Spark.

  • Depreate spark.kubernetes.pyspark.pythonVersion and guide users to set the environment variables and configurations for Python executables.
    NOTE that spark.kubernetes.pyspark.pythonVersion is already a no-op configuration without this PR. Default is 3 and other values are disallowed.

  • In order for Python executable settings to be consistently used, fix spark.archives option to unpack into the current working directory in the driver of Kubernates' cluster mode. This behaviour is identical with Yarn's cluster mode. By doing this, users can leverage Conda or virtuenenv in cluster mode as below:

     conda create -y -n pyspark_conda_env -c conda-forge pyarrow pandas conda-pack
     conda activate pyspark_conda_env
     conda pack -f -o pyspark_conda_env.tar.gz
     PYSPARK_PYTHON=./environment/bin/python spark-submit --archives pyspark_conda_env.tar.gz#environment app.py
  • Removed several unused or useless codes such as extractS3Key and renameResourcesToLocalFS

Why are the changes needed?

  • To provide a consistent support of PySpark by using PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment variables, or spark.pyspark.python and spark.pyspark.driver.python configurations.
  • To provide Conda and virtualenv support via spark.archives options.

Does this PR introduce any user-facing change?

Yes:

  • spark.kubernetes.pyspark.pythonVersion is deprecated.
  • PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment variables, and spark.pyspark.python and spark.pyspark.driver.python configurations are respected.

How was this patch tested?

Manually tested via:

minikube delete
minikube start --cpus 12 --memory 16384
kubectl create namespace spark-integration-test
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
  name: spark
  namespace: spark-integration-test
EOF
kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=spark-integration-test:spark --namespace=spark-integration-test
dev/make-distribution.sh --pip --tgz -Pkubernetes
resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh --spark-tgz `pwd`/spark-3.2.0-SNAPSHOT-bin-3.2.0.tgz  --service-account spark --namespace spark-integration-test

Unittests were also added.

@HyukjinKwon

Copy link
Copy Markdown
Member Author

cc @dongjoon-hyun and @erikerlandson FYI

@HyukjinKwon HyukjinKwon Dec 11, 2020

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the codes multiple times for sure, and I think this code does a duplicated job.
If I am not horribly wrong somewhere:

  • localResources itself are always local files, and resources will always be replaced to localResources.
  • If resources is null, then localResources will be null too

cc @skonto from 5e74570

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PYTHON_VERSION and pyv3 apparently not used anywhere in Spark code base.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extractS3Key is not used too.

Comment thread core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala Outdated
@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

Comment thread core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala Outdated
@erikerlandson

Copy link
Copy Markdown
Contributor

We should not remove the setting spark.kubernetes.pyspark.pythonVersion, as that will break backward compatibility. IMO, this setting should override environment variables, if it is specified.

@HyukjinKwon

Copy link
Copy Markdown
Member Author

@erikerlandson, spark.kubernetes.pyspark.pythonVersion has no other values able to set except 3, which is the default. Even if we remove spark.kubernetes.pyspark.pythonVersion, Spark still allows to set the non-existent configuration. Therefore I believe there's no breaking change here. In addition, Kubernates is experimental so far which generally accepts breaking changes between minor releases.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@erikerlandson

Copy link
Copy Markdown
Contributor

@HyukjinKwon is the issue that spark.kubernetes.pyspark.pythonVersion is purely a version, but PYSPARK_PYTHON and friends allow path names? Because either way Spark currently supports only python 3.

@dongjoon-hyun dongjoon-hyun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erikerlandson and @HyukjinKwon .
This PR seems to contain 3 othorognal stuffs.

  1. First of all, can we discuss of explicit deprecating spark.kubernetes.pyspark.pythonVersion as no-op at Apache Spark 3.1? Since it's effectively no-op for now and @HyukjinKwon 's new approach may supersede it, I believe it's okay.
  2. python -> python3 as new follow-up PR for SPARK-32447.
  3. Finally, @HyukjinKwon 's proposal.

Comment thread launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java Outdated
Comment thread python/pyspark/context.py Outdated
@HyukjinKwon

Copy link
Copy Markdown
Member Author

Yes @erikerlandson thanks for clarification @dongjoon-hyun. So, if users set the configuration as 2 in 2.4 Spark application, that already wouldn't work because Python 2 was dropped. If users set it as 3, it will not be broken and work as expected because the default Python is already 3. Additionally now Spark 3.1 respects the Python environment variables if set, which is supposed to be, like Spark does in other places. I believe its more a bug fix.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

SparkQA commented Dec 13, 2020

Copy link
Copy Markdown

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37311/

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

SparkQA commented Dec 13, 2020

Copy link
Copy Markdown

Test build #132708 has finished for PR 30735 at commit d3e52f4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon

HyukjinKwon commented Dec 14, 2020

Copy link
Copy Markdown
Member Author

I believe this is a known flaky test (SPARK-33276), and it's hardly related:

- Test basic decommissioning with shuffle cleanup *** FAILED ***
  The code passed to eventually never returned normally. Attempted 182 times over 3.0065354390166665 minutes. Last failure message: "++ id -u

and all other tests passed.

Thriftserver test in GitHub Actions is also a known test failure at SPARK-33705

I believe I addressed all standing commands and it's ready for a review or possibly ready to go.

@HyukjinKwon

Copy link
Copy Markdown
Member Author

cc @tgravescs too FYI

@tgravescs

Copy link
Copy Markdown
Contributor

overall approach looks fine to me, I didn't have time to do detailed review.
I'm agree with deprecated config even if not used so users see warning.

I see we changed the driver env settings here, how is executor pyspark being set? I took a very quick look and didn't see it in BasicExecutorFeatureStep. Just wondering if that needs to be updated at all. Especially if driver python path was different from executor

@HyukjinKwon

HyukjinKwon commented Dec 14, 2020

Copy link
Copy Markdown
Member Author

Thanks @tgravescs. If guiding users is a matter, we can still do by updating migration guide or even show a warning that the configuration was removed.

Given that this is experimental, I wanted to treat a bit differently as the removal and behaviour changes are expected as noted in the tags and as discussed in the mailing list in the past.

I feel like we're treating GA-ed feature and experimental feature similarly.

In any event, I agree that this is more conservative and possibly smoother approach. I'm okay with deprecating.

For setting envioronment variables on executors, I believe PYSPARK_DRIVER_PYTHON is for driver python and PYSPARK_PYTHON is for executor python (and driver if PYSPARK_DRIVER_PYTHON is not set). So, if different python executables are used, users can set both environment variables (or equivalent Spark configurations such as spark.pyspark.python).

@tgravescs

Copy link
Copy Markdown
Contributor

sorry if I missed it in the PR or existing code, I want to make sure the executor side of spark.pyspark.python or PYSPARK_PYTHON is working with k8s? I only saw changes in this PR for the driver side but haven't had time to look in great detail.

@HyukjinKwon

HyukjinKwon commented Dec 14, 2020

Copy link
Copy Markdown
Member Author

Oh, actually PYSPARK_PYTHON (or spark.pyspark.python) is being passed though from driver to executor side via:

  1. self.pythonExec = os.environ.get("PYSPARK_PYTHON", 'python3')
  2. return sc._jvm.PythonFunction(bytearray(pickled_command), env, includes, sc.pythonExec,
  3. val runner = PythonRunner(func)
    runner.compute(firstParent.iterator(split, context), split.index, context)
  4. protected val pythonExec: String = funcs.head.funcs.head.pythonExec
  5. val worker: Socket = env.createPythonWorker(pythonExec, envVars.asScala.toMap)

I also added a test to make sure this works in K8S here: https://github.com/apache/spark/pull/30735/files#diff-78ba045f393bcf6ffaa3dfe85bc7682cacf0bef69d447a2346e201279cc0bc5bR179-R197

@tgravescs

Copy link
Copy Markdown
Contributor

great thanks!

// Each entry does not keep the file permission from the input file.
// Setting permissions in the input file do not work. Just simply set
// to 777.
tarEntry.setMode(0x81ff)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like another orthogonal issue to me. Could you spin-off this one?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its slightly orthogonal but related to the current PR. The custom python in the tests should have executable permission and tgz should keep the permission. Without this change, the test fails because it loses executable permission and cannot be used as a python alone later. FWIW this util is for test-only, and this PR adds the first case in the test where it needs to keep the file permissions.

@dongjoon-hyun dongjoon-hyun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (two comments: #30735 (comment) and #30735 (comment)).
I'll leave the decision for those comments to @HyukjinKwon .

@erikerlandson

Copy link
Copy Markdown
Contributor

FWIW I also agree about separating the things not specifically about PYSPARK_PYTHON.
Otherwise LGTM

@dongjoon-hyun

Copy link
Copy Markdown
Member

@erikerlandson and @viirya . I want to know your opinions. Please let us know if you have concerns still.

@dongjoon-hyun

Copy link
Copy Markdown
Member

Oh, thanks, @erikerlandson !

@HyukjinKwon

Copy link
Copy Markdown
Member Author

Thanks @erikerlandson and @dongjoon-hyun for your review and approval. I replied in the comments.

@HyukjinKwon

Copy link
Copy Markdown
Member Author

To clarify, I believe the current change contains all related changes. Without them, either test fails or python related env or configurations would not work as expected like other places.

@HyukjinKwon

HyukjinKwon commented Dec 14, 2020

Copy link
Copy Markdown
Member Author

Merged to master and branch-3.1 (for K8S GA preparation).

Thanks again @erikerlandson, @dongjoon-hyun and @tgravescs for reviewing and bearing with me :-).

HyukjinKwon added a commit that referenced this pull request Dec 14, 2020
…or Python executables

### What changes were proposed in this pull request?

This PR proposes:

- Respect `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` environment variables, or `spark.pyspark.python` and `spark.pyspark.driver.python` configurations in Kubernates just like other cluster types in Spark.

- Depreate `spark.kubernetes.pyspark.pythonVersion` and guide users to set the environment variables and configurations for Python executables.
    NOTE that `spark.kubernetes.pyspark.pythonVersion` is already a no-op configuration without this PR. Default is `3` and other values are disallowed.

- In order for Python executable settings to be consistently used, fix `spark.archives` option to unpack into the current working directory in the driver of Kubernates' cluster mode. This behaviour is identical with Yarn's cluster mode. By doing this, users can leverage Conda or virtuenenv in cluster mode as below:

   ```python
    conda create -y -n pyspark_conda_env -c conda-forge pyarrow pandas conda-pack
    conda activate pyspark_conda_env
    conda pack -f -o pyspark_conda_env.tar.gz
    PYSPARK_PYTHON=./environment/bin/python spark-submit --archives pyspark_conda_env.tar.gz#environment app.py
   ```

- Removed several unused or useless codes such as `extractS3Key` and `renameResourcesToLocalFS`

### Why are the changes needed?

- To provide a consistent support of PySpark by using `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` environment variables, or `spark.pyspark.python` and `spark.pyspark.driver.python` configurations.
- To provide Conda and virtualenv support via `spark.archives` options.

### Does this PR introduce _any_ user-facing change?

Yes:

- `spark.kubernetes.pyspark.pythonVersion` is deprecated.
- `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` environment variables, and `spark.pyspark.python` and `spark.pyspark.driver.python` configurations are respected.

### How was this patch tested?

Manually tested via:

```bash
minikube delete
minikube start --cpus 12 --memory 16384
kubectl create namespace spark-integration-test
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
  name: spark
  namespace: spark-integration-test
EOF
kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=spark-integration-test:spark --namespace=spark-integration-test
dev/make-distribution.sh --pip --tgz -Pkubernetes
resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh --spark-tgz `pwd`/spark-3.2.0-SNAPSHOT-bin-3.2.0.tgz  --service-account spark --namespace spark-integration-test
```

Unittests were also added.

Closes #30735 from HyukjinKwon/SPARK-33748.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(cherry picked from commit a99a47c)
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
@erikerlandson

Copy link
Copy Markdown
Contributor

@HyukjinKwon thanks for standardizing the env-var support!

@HyukjinKwon

Copy link
Copy Markdown
Member Author

Thank you @erikerlandson.

@HyukjinKwon HyukjinKwon deleted the SPARK-33748 branch January 4, 2022 00:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants