Skip to content

[breaking] Make pyspark-client as default and pyspark package optional#60031

Merged
eladkal merged 4 commits into
apache:mainfrom
raphaelauv:feat/spark_provider_only_spark-connect-client-by-default
Mar 15, 2026
Merged

[breaking] Make pyspark-client as default and pyspark package optional#60031
eladkal merged 4 commits into
apache:mainfrom
raphaelauv:feat/spark_provider_only_spark-connect-client-by-default

Conversation

@raphaelauv
Copy link
Copy Markdown
Contributor

@raphaelauv raphaelauv commented Jan 1, 2026

current apache.spark provider need the package pyspark ( more than 400mb )

where I would like to only use the spark-client ( spark-connect ) to trigger a spark job ( that is 1.5mb )

    @task.pyspark(conn_id="spark_connect")
    def my_job(spark,sc):
        df = spark.range(100).filter("id % 2 = 0")
        print(df.count())

# or 

    def my_pyspark_job(spark):
        df = spark.range(100).filter("id % 2 = 0")
        print(df.count())

    PySparkOperator(
        python_callable=my_pyspark_job, conn_id="spark_connect", task_id="spark_pyspark_job"
    )
        

this is a breaking change but it will make things lighter by default , wdyt ? thanks

( btw spark-client is only available since spark 4.0 and need "grpcio >= 1.67.0" ( that conflict with apache-beam ) )

@raphaelauv raphaelauv changed the title feat: add pyspark-client as default and make pyspark package optional Make pyspark-client as default and pyspark package optional Jan 1, 2026
@potiuk
Copy link
Copy Markdown
Member

potiuk commented Jan 3, 2026

Good idea - but we need to wait until apache-beam fixes grpcio limit - maybe find if there is an existing issue or you can open a new issue with them ?

@raphaelauv
Copy link
Copy Markdown
Contributor Author

raphaelauv commented Jan 4, 2026

Yes we need apache-beam to support grpcio>= 1.67

you already created it -> apache/beam#34081

@potiuk
Copy link
Copy Markdown
Member

potiuk commented Jan 4, 2026

you already created it -> apache/beam#34081

Heh... Almost a year ago.. Maybe a time to follow up ?

@potiuk
Copy link
Copy Markdown
Member

potiuk commented Jan 4, 2026

I raised my comments in the discussion grpc/grpc#37710 (comment)

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions Bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Feb 19, 2026
@raphaelauv
Copy link
Copy Markdown
Contributor Author

No stale -> #61926

@github-actions github-actions Bot removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Feb 20, 2026
@raphaelauv raphaelauv force-pushed the feat/spark_provider_only_spark-connect-client-by-default branch from e25155b to da01129 Compare February 25, 2026 10:44
@raphaelauv raphaelauv force-pushed the feat/spark_provider_only_spark-connect-client-by-default branch 3 times, most recently from cd4c01b to 624e3c3 Compare March 11, 2026 09:09
Comment thread providers/apache/spark/pyproject.toml
Copy link
Copy Markdown
Contributor

@eladkal eladkal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs entry in the top of the provider changelog explaining users the reasoning and what they should do if they want to keep the behavior as is

@potiuk
Copy link
Copy Markdown
Member

potiuk commented Mar 11, 2026

this is a breaking change but it will make things lighter by default , wdyt ? thanks

Yeah. Just add entry to changelog :)

@potiuk potiuk changed the title Make pyspark-client as default and pyspark package optional [breaking] Make pyspark-client as default and pyspark package optional Mar 11, 2026
@potiuk
Copy link
Copy Markdown
Member

potiuk commented Mar 11, 2026

I also added `"[breaking]" in the title of the PR - I think that might be one of the ways @eladkal ? how we mark breaking changes. We discussed it before. I think it's very nice way to add "[breaking]" to the title - then the RM can remove it when preparing release notes. But we can also detect it automatically.

I am also going to employ LLM -- optionally, following the experience with auto-triage - to prepare the release notes - so this might naturally work when we use LLM to do it.

@eladkal
Copy link
Copy Markdown
Contributor

eladkal commented Mar 14, 2026

@raphaelauv can you please add the entry to the top of the change log? I will merge the PR after

@raphaelauv raphaelauv force-pushed the feat/spark_provider_only_spark-connect-client-by-default branch from 2698101 to aec8858 Compare March 14, 2026 10:13
Comment thread providers/apache/spark/docs/changelog.rst
@raphaelauv raphaelauv force-pushed the feat/spark_provider_only_spark-connect-client-by-default branch from bae87da to c0a8e91 Compare March 14, 2026 11:13
@raphaelauv raphaelauv force-pushed the feat/spark_provider_only_spark-connect-client-by-default branch from f69e62e to 24bb523 Compare March 14, 2026 12:33
@raphaelauv
Copy link
Copy Markdown
Contributor Author

raphaelauv commented Mar 15, 2026

hey @eladkal , I rebased and added the comment on the "why"

Thanks

@eladkal eladkal merged commit b0207a9 into apache:main Mar 15, 2026
105 checks passed
@raphaelauv raphaelauv deleted the feat/spark_provider_only_spark-connect-client-by-default branch March 15, 2026 10:33
abhijeets25012-tech pushed a commit to abhijeets25012-tech/airflow that referenced this pull request Apr 9, 2026
apache#60031)

* feat: add pyspark-client as default and make pyspark package optional

---------

Co-authored-by: raphaelauv <raphaelauv@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants