Skip to content

[SPARK-47683][PYTHON][BUILD] Decouple PySpark core API to pyspark.core package#45053

Closed
HyukjinKwon wants to merge 15 commits into
apache:masterfrom
HyukjinKwon:refactoring-core
Closed

[SPARK-47683][PYTHON][BUILD] Decouple PySpark core API to pyspark.core package#45053
HyukjinKwon wants to merge 15 commits into
apache:masterfrom
HyukjinKwon:refactoring-core

Conversation

@HyukjinKwon

@HyukjinKwon HyukjinKwon commented Feb 7, 2024

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

This PR proposes to release a separate pyspark-connect package, see also SPIP: Pure Python Package in PyPI (Spark Connect).

Today's PySpark package is roughly as follows:

pyspark
├── *.py               # *Core / No Spark Connect support*
├── mllib              # MLlib / No Spark Connect support
├── resource           # Resource profile API / No Spark Connect support
├── streaming          # DStream (deprecated) / No Spark Connect support
├── ml                 # ML 
│   └── connect            # Spark Connect for ML
├── pandas             # API on Spark with/without Spark Connect support
└── sql                # SQL
    └── connect            # Spark Connect for SQL
        └── streaming      # Spark Connect for Structured Streaming

There will be two packages available, pyspark and pyspark-connect.

pyspark

Same as today’s PySpark. But Core module is factored out to pyspark.core.*. User-facing interface stays the same at pyspark.*.

pyspark
├── core               # *Core / No Spark Connect support*
├── mllib              # MLlib / No Spark Connect support
├── resource           # Resource profile API / No Spark Connect support
├── streaming          # DStream (deprecated) / No Spark Connect support
├── ml                 # ML 
│   └── connect            # Spark Connect for ML
├── pandas             # API on Spark with/without Spark Connect support
└── sql                # SQL
    └── connect            # Spark Connect for SQL
        └── streaming      # Spark Connect for Structured Streaming

pyspark-connect

Package after excluding modules that do not support Spark Connect, also excluding jars, that are, ml without jars:

pyspark
├── ml
│   └── connect
├── pandas
└── sql
    └── connect
        └── streaming

Why are the changes needed?

To provide a pure Python library that does not depend on JVM.

See also SPIP: Pure Python Package in PyPI (Spark Connect).

Does this PR introduce any user-facing change?

Yes, users can install pure Python library via pip install pyspark-connect.

How was this patch tested?

Manually tested the basic set of tests.

./sbin/start-connect-server.sh --jars `ls connector/connect/server/target/**/spark-connect*SNAPSHOT.jar`
cd python
python packaging/connect/setup.py sdist
cd dist
conda create -y -n clean-py-3.11 python=3.11
conda activate clean-py-3.11
pip install pyspark-connect-4.0.0.dev0.tar.gz
python
>>> import pyspark
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.remote("sc://localhost").getOrCreate()
>>> spark.range(10).show()
+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
+---+

They will be separated added, and set as a scheduled job in CI.

Was this patch authored or co-authored using generative AI tooling?

No.

Comment thread python/pyspark/core/broadcast.py Outdated
Comment thread python/pyspark/util.py Outdated
@HyukjinKwon

Copy link
Copy Markdown
Member Author

I restored the references for our internal API. Explicitly private attributes starting _ are not restored.

@HyukjinKwon

Copy link
Copy Markdown
Member Author

Merged to master.

HyukjinKwon added a commit that referenced this pull request May 2, 2024
…spark-connect` package

### What changes were proposed in this pull request?

This PR is a followup of #45053 that includes `lib/py4j*zip` in the package. Currently it's being picked up by https://github.com/apache/spark/blob/master/python/MANIFEST.in#L26. For other files, we don't create `deps` directory in `setup.py` for `pyspark-connect` so they are not included. But `lib` is being included.

### Why are the changes needed?

To exclude unrelated files.

### Does this PR introduce _any_ user-facing change?

No, the main change has not been released out yet.

### How was this patch tested?

Manually packaged, and checked the contents via `vi`.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46331 from HyukjinKwon/SPARK-47683-followup.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants