[SPARK-33748][K8S] Respect environment variables and configurations for Python executables#30735
[SPARK-33748][K8S] Respect environment variables and configurations for Python executables#30735HyukjinKwon wants to merge 1 commit into
Conversation
|
cc @dongjoon-hyun and @erikerlandson FYI |
There was a problem hiding this comment.
I read the codes multiple times for sure, and I think this code does a duplicated job.
If I am not horribly wrong somewhere:
localResourcesitself are always local files, andresourceswill always be replaced tolocalResources.- If
resourcesisnull, thenlocalResourceswill benulltoo
There was a problem hiding this comment.
PYTHON_VERSION and pyv3 apparently not used anywhere in Spark code base.
There was a problem hiding this comment.
extractS3Key is not used too.
This comment has been minimized.
This comment has been minimized.
598e615 to
66a6ea6
Compare
This comment has been minimized.
This comment has been minimized.
66a6ea6 to
92e9a7c
Compare
|
We should not remove the setting |
|
@erikerlandson, |
This comment has been minimized.
This comment has been minimized.
92e9a7c to
39ca741
Compare
39ca741 to
bf2de51
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
@HyukjinKwon is the issue that |
There was a problem hiding this comment.
@erikerlandson and @HyukjinKwon .
This PR seems to contain 3 othorognal stuffs.
- First of all, can we discuss of explicit deprecating
spark.kubernetes.pyspark.pythonVersionas no-op at Apache Spark 3.1? Since it's effectively no-op for now and @HyukjinKwon 's new approach may supersede it, I believe it's okay. python->python3as new follow-up PR for SPARK-32447.- Finally, @HyukjinKwon 's proposal.
|
Yes @erikerlandson thanks for clarification @dongjoon-hyun. So, if users set the configuration as 2 in 2.4 Spark application, that already wouldn't work because Python 2 was dropped. If users set it as 3, it will not be broken and work as expected because the default Python is already 3. Additionally now Spark 3.1 respects the Python environment variables if set, which is supposed to be, like Spark does in other places. I believe its more a bug fix. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Kubernetes integration test status failure |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Test build #132708 has finished for PR 30735 at commit
|
|
I believe this is a known flaky test (SPARK-33276), and it's hardly related: and all other tests passed. Thriftserver test in GitHub Actions is also a known test failure at SPARK-33705 I believe I addressed all standing commands and it's ready for a review or possibly ready to go. |
|
cc @tgravescs too FYI |
|
overall approach looks fine to me, I didn't have time to do detailed review. I see we changed the driver env settings here, how is executor pyspark being set? I took a very quick look and didn't see it in BasicExecutorFeatureStep. Just wondering if that needs to be updated at all. Especially if driver python path was different from executor |
|
Thanks @tgravescs. If guiding users is a matter, we can still do by updating migration guide or even show a warning that the configuration was removed. Given that this is experimental, I wanted to treat a bit differently as the removal and behaviour changes are expected as noted in the tags and as discussed in the mailing list in the past. I feel like we're treating GA-ed feature and experimental feature similarly. In any event, I agree that this is more conservative and possibly smoother approach. I'm okay with deprecating. For setting envioronment variables on executors, I believe |
|
sorry if I missed it in the PR or existing code, I want to make sure the executor side of spark.pyspark.python or PYSPARK_PYTHON is working with k8s? I only saw changes in this PR for the driver side but haven't had time to look in great detail. |
|
Oh, actually
I also added a test to make sure this works in K8S here: https://github.com/apache/spark/pull/30735/files#diff-78ba045f393bcf6ffaa3dfe85bc7682cacf0bef69d447a2346e201279cc0bc5bR179-R197 |
|
great thanks! |
| // Each entry does not keep the file permission from the input file. | ||
| // Setting permissions in the input file do not work. Just simply set | ||
| // to 777. | ||
| tarEntry.setMode(0x81ff) |
There was a problem hiding this comment.
This looks like another orthogonal issue to me. Could you spin-off this one?
There was a problem hiding this comment.
Its slightly orthogonal but related to the current PR. The custom python in the tests should have executable permission and tgz should keep the permission. Without this change, the test fails because it loses executable permission and cannot be used as a python alone later. FWIW this util is for test-only, and this PR adds the first case in the test where it needs to keep the file permissions.
There was a problem hiding this comment.
+1, LGTM (two comments: #30735 (comment) and #30735 (comment)).
I'll leave the decision for those comments to @HyukjinKwon .
|
FWIW I also agree about separating the things not specifically about |
|
@erikerlandson and @viirya . I want to know your opinions. Please let us know if you have concerns still. |
|
Oh, thanks, @erikerlandson ! |
|
Thanks @erikerlandson and @dongjoon-hyun for your review and approval. I replied in the comments. |
|
To clarify, I believe the current change contains all related changes. Without them, either test fails or python related env or configurations would not work as expected like other places. |
|
Merged to master and branch-3.1 (for K8S GA preparation). Thanks again @erikerlandson, @dongjoon-hyun and @tgravescs for reviewing and bearing with me :-). |
…or Python executables
### What changes were proposed in this pull request?
This PR proposes:
- Respect `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` environment variables, or `spark.pyspark.python` and `spark.pyspark.driver.python` configurations in Kubernates just like other cluster types in Spark.
- Depreate `spark.kubernetes.pyspark.pythonVersion` and guide users to set the environment variables and configurations for Python executables.
NOTE that `spark.kubernetes.pyspark.pythonVersion` is already a no-op configuration without this PR. Default is `3` and other values are disallowed.
- In order for Python executable settings to be consistently used, fix `spark.archives` option to unpack into the current working directory in the driver of Kubernates' cluster mode. This behaviour is identical with Yarn's cluster mode. By doing this, users can leverage Conda or virtuenenv in cluster mode as below:
```python
conda create -y -n pyspark_conda_env -c conda-forge pyarrow pandas conda-pack
conda activate pyspark_conda_env
conda pack -f -o pyspark_conda_env.tar.gz
PYSPARK_PYTHON=./environment/bin/python spark-submit --archives pyspark_conda_env.tar.gz#environment app.py
```
- Removed several unused or useless codes such as `extractS3Key` and `renameResourcesToLocalFS`
### Why are the changes needed?
- To provide a consistent support of PySpark by using `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` environment variables, or `spark.pyspark.python` and `spark.pyspark.driver.python` configurations.
- To provide Conda and virtualenv support via `spark.archives` options.
### Does this PR introduce _any_ user-facing change?
Yes:
- `spark.kubernetes.pyspark.pythonVersion` is deprecated.
- `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` environment variables, and `spark.pyspark.python` and `spark.pyspark.driver.python` configurations are respected.
### How was this patch tested?
Manually tested via:
```bash
minikube delete
minikube start --cpus 12 --memory 16384
kubectl create namespace spark-integration-test
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
name: spark
namespace: spark-integration-test
EOF
kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=spark-integration-test:spark --namespace=spark-integration-test
dev/make-distribution.sh --pip --tgz -Pkubernetes
resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh --spark-tgz `pwd`/spark-3.2.0-SNAPSHOT-bin-3.2.0.tgz --service-account spark --namespace spark-integration-test
```
Unittests were also added.
Closes #30735 from HyukjinKwon/SPARK-33748.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(cherry picked from commit a99a47c)
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
|
@HyukjinKwon thanks for standardizing the env-var support! |
|
Thank you @erikerlandson. |
What changes were proposed in this pull request?
This PR proposes:
Respect
PYSPARK_PYTHONandPYSPARK_DRIVER_PYTHONenvironment variables, orspark.pyspark.pythonandspark.pyspark.driver.pythonconfigurations in Kubernates just like other cluster types in Spark.Depreate
spark.kubernetes.pyspark.pythonVersionand guide users to set the environment variables and configurations for Python executables.NOTE that
spark.kubernetes.pyspark.pythonVersionis already a no-op configuration without this PR. Default is3and other values are disallowed.In order for Python executable settings to be consistently used, fix
spark.archivesoption to unpack into the current working directory in the driver of Kubernates' cluster mode. This behaviour is identical with Yarn's cluster mode. By doing this, users can leverage Conda or virtuenenv in cluster mode as below:Removed several unused or useless codes such as
extractS3KeyandrenameResourcesToLocalFSWhy are the changes needed?
PYSPARK_PYTHONandPYSPARK_DRIVER_PYTHONenvironment variables, orspark.pyspark.pythonandspark.pyspark.driver.pythonconfigurations.spark.archivesoptions.Does this PR introduce any user-facing change?
Yes:
spark.kubernetes.pyspark.pythonVersionis deprecated.PYSPARK_PYTHONandPYSPARK_DRIVER_PYTHONenvironment variables, andspark.pyspark.pythonandspark.pyspark.driver.pythonconfigurations are respected.How was this patch tested?
Manually tested via:
Unittests were also added.