From 8c766ab7451c84e3f629f8f2f308d6e7da233b84 Mon Sep 17 00:00:00 2001 From: "H. Vetinari" Date: Mon, 18 Oct 2021 23:08:59 +1100 Subject: [PATCH 1/5] update conda installation instructions --- .../docs/source/getting_started/install.rst | 58 +++++++++++-------- 1 file changed, 33 insertions(+), 25 deletions(-) diff --git a/python/docs/source/getting_started/install.rst b/python/docs/source/getting_started/install.rst index 63a7ecaa27ca6..e21c4d550fa79 100644 --- a/python/docs/source/getting_started/install.rst +++ b/python/docs/source/getting_started/install.rst @@ -83,46 +83,54 @@ Note that this installation way of PySpark with/without a specific Hadoop versio Using Conda ----------- -Conda is an open-source package management and environment management system which is a part of -the `Anaconda `_ distribution. It is both cross-platform and -language agnostic. In practice, Conda can replace both `pip `_ and -`virtualenv `_. - -Create new virtual environment from your terminal as shown below: - -.. code-block:: bash - - conda create -n pyspark_env - -After the virtual environment is created, it should be visible under the list of Conda environments -which can be seen using the following command: +Conda is an open-source package management and environment management system (developed by +`Anaconda` `_), which is best installed through +`Miniconda `_ or `Miniforge `_. +The tool is both cross-platform and language agnostic, and in practice, conda can replace both +`pip `_ and `virtualenv `_. + +Conda uses so-called channels to distribute packages, and together with the default channels by +Anaconda itself, the most important channel is `conda-forge `_, which +is the community-driven packaging effort that is the most extensive & the most current (and also +serves as the upstream for the Anaconda channels in most cases). + +Generally, it is recommended to use _as few channels as possible_. Conda-forge & Anaconda put a +lot of effort in guaranteeing binary compatibility between packages (e.g. by using compatible +compilers for all packages and tracking which packages are ABI-relevant). Needlessly mixing in +other channels can end up breaking those guarantees, which is why conda-forge even recommends +so-called "strict channel priority": .. code-block:: bash - conda env list + conda config --add channels conda-forge + conda config --set channel_priority strict -Now activate the newly created environment with the following command: +To create a new conda environment from your terminal and activate it, proceed as shown below: .. code-block:: bash + conda create -n pyspark_env conda activate pyspark_env -You can install pyspark by `Using PyPI <#using-pypi>`_ to install PySpark in the newly created -environment, for example as below. It will install PySpark under the new virtual environment -``pyspark_env`` created above. +After activating the environment, use the following command to install pyspark, +a python version of your choice, as well as other packages you want to use in +the same session as pyspark (you can install in several steps too). .. code-block:: bash - pip install pyspark + conda install -c conda-forge pyspark python [other packages] # can also use python=3.8, etc. -Alternatively, you can install PySpark from Conda itself as below: - -.. code-block:: bash +Whenever possible, avoid using ``pip`` within conda-environments. Conda and pip "do not speak +the same language" - while conda will detect and try to respect packages installed by pip, +pip might install over existing packages installed by conda and consequently break the +functionality of the environment. - conda install pyspark +Note that `PySpark at Conda `_ is maintained +separately by the community; while new versions generally get packaged quickly, the +availability through conda(-forge) is not directly in sync with the PySpark release cycle. -However, note that `PySpark at Conda `_ is not necessarily -synced with PySpark release cycle because it is maintained by the community separately. +For a short summary about useful conda commands, see their +`cheat sheet `_. Manually Downloading From 8336c5afd45b44c6bf14ebb5896a0f13aeed2f37 Mon Sep 17 00:00:00 2001 From: "H. Vetinari" Date: Tue, 19 Oct 2021 10:13:05 +1100 Subject: [PATCH 2/5] fixes from review --- python/docs/source/getting_started/install.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/python/docs/source/getting_started/install.rst b/python/docs/source/getting_started/install.rst index e21c4d550fa79..7c10544c4c8aa 100644 --- a/python/docs/source/getting_started/install.rst +++ b/python/docs/source/getting_started/install.rst @@ -84,7 +84,7 @@ Using Conda ----------- Conda is an open-source package management and environment management system (developed by -`Anaconda` `_), which is best installed through +`Anaconda `_), which is best installed through `Miniconda `_ or `Miniforge `_. The tool is both cross-platform and language agnostic, and in practice, conda can replace both `pip `_ and `virtualenv `_. @@ -94,7 +94,7 @@ Anaconda itself, the most important channel is `conda-forge Date: Tue, 19 Oct 2021 14:51:23 +1100 Subject: [PATCH 3/5] remove some paragraphs (review feedback) --- python/docs/source/getting_started/install.rst | 18 +----------------- 1 file changed, 1 insertion(+), 17 deletions(-) diff --git a/python/docs/source/getting_started/install.rst b/python/docs/source/getting_started/install.rst index 7c10544c4c8aa..5359c0f762bda 100644 --- a/python/docs/source/getting_started/install.rst +++ b/python/docs/source/getting_started/install.rst @@ -94,17 +94,6 @@ Anaconda itself, the most important channel is `conda-forge `_ is maintained +Note that `PySpark for conda `_ is maintained separately by the community; while new versions generally get packaged quickly, the availability through conda(-forge) is not directly in sync with the PySpark release cycle. From 45dc1384b03e4fe01a571466b7b6f4c383e3d02d Mon Sep 17 00:00:00 2001 From: "H. Vetinari" Date: Wed, 20 Oct 2021 21:27:12 +1100 Subject: [PATCH 4/5] add a note about using pip in conda environments (review feedback) --- python/docs/source/getting_started/install.rst | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/python/docs/source/getting_started/install.rst b/python/docs/source/getting_started/install.rst index 5359c0f762bda..9a6e9edfe4bcd 100644 --- a/python/docs/source/getting_started/install.rst +++ b/python/docs/source/getting_started/install.rst @@ -113,6 +113,11 @@ Note that `PySpark for conda `_ is mai separately by the community; while new versions generally get packaged quickly, the availability through conda(-forge) is not directly in sync with the PySpark release cycle. +While using pip in a conda environment is technically feasible (with the same command as +`above <#using-pypi>`_), this approach is `discouraged `_, +because pip does not interoperate with conda. In particular, pip might install over existing +(conda-installed) packages and consequently break the functionality of the environment. + For a short summary about useful conda commands, see their `cheat sheet `_. From e7b891f5e5f9072037105c6a2dac41a6262df464 Mon Sep 17 00:00:00 2001 From: h-vetinari Date: Thu, 21 Oct 2021 14:32:17 +1100 Subject: [PATCH 5/5] Apply suggestions from code review Co-authored-by: Hyukjin Kwon --- python/docs/source/getting_started/install.rst | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/python/docs/source/getting_started/install.rst b/python/docs/source/getting_started/install.rst index 9a6e9edfe4bcd..13c6f8f3a28e2 100644 --- a/python/docs/source/getting_started/install.rst +++ b/python/docs/source/getting_started/install.rst @@ -107,7 +107,7 @@ the same session as pyspark (you can install in several steps too). .. code-block:: bash - conda install -c conda-forge pyspark python [other packages] # can also use python=3.8, etc. + conda install -c conda-forge pyspark # can also add "python=3.8 some_package [etc.]" here Note that `PySpark for conda `_ is maintained separately by the community; while new versions generally get packaged quickly, the @@ -115,8 +115,7 @@ availability through conda(-forge) is not directly in sync with the PySpark rele While using pip in a conda environment is technically feasible (with the same command as `above <#using-pypi>`_), this approach is `discouraged `_, -because pip does not interoperate with conda. In particular, pip might install over existing -(conda-installed) packages and consequently break the functionality of the environment. +because pip does not interoperate with conda. For a short summary about useful conda commands, see their `cheat sheet `_.