[SPARK-57247][SQL][CONNECT] Support DataFrame.zip in Spark Connect by zhengruifeng · Pull Request #56300 · apache/spark

zhengruifeng · 2026-06-03T12:20:39Z

What changes were proposed in this pull request?

This is the follow-up to #54976 ([SPARK-55886]) which implemented DataFrame.zip for the classic path and deferred Spark Connect support. This PR wires up the Connect path end-to-end.

Protocol (relations.proto): adds a Zip message with left and right Relation fields (field 48 in the Relation oneof). Python stubs regenerated via the connect-gen-protos Docker image (buf 1.66.1 + mypy 1.19.1 + mypy-protobuf 3.3.0 + ruff 0.14.8).
Server (SparkConnectPlanner): adds transformZip that directly constructs the unresolved logical.Zip(left, right) plan, dispatched via RelTypeCase.ZIP. ResolveZip then runs during analysis, same as the classic path.
Scala Connect Dataset: replaces the UnsupportedOperationException stub with sparkSession.newDataFrame { builder => builder.getZipBuilder.setLeft(...).setRight(...) }, following the crossJoin/buildJoin pattern.
Python Connect plan.py: adds class Zip(LogicalPlan) following the NearestByJoin pattern.
Python Connect dataframe.py: replaces the PySparkNotImplementedError stub with a plan.Zip call; removes the doctest suppression (del DataFrame.zip.__doc__) that was added when Connect was unsupported.

Why are the changes needed?

DataFrame.zip was merged (#54976) with Connect deferred. This PR completes the implementation so Connect users can use zip on equal footing with the classic path.

Does this PR introduce any user-facing change?

Yes. DataFrame.zip now works on the Spark Connect path. Previously it raised PySparkNotImplementedError: [NOT_IMPLEMENTED] zip is not implemented.

How was this patch tested?

test_parity_zip.py: runs the full DataFrameZipTestsMixin (basic projections, expressions, one-sided base, withColumn, chained withColumn, longer chains, parent-with-chained-child, withColumnRenamed, scalar Python UDF, pandas UDF, and two error cases) against a Connect session.
test_connect_plan.py: asserts that the proto plan for left.zip(right) has the zip field set with the expected left/right sources.
PlanGenerationTestSuite: serializes a zip plan to proto and compares against a new golden file (zip.proto.bin).
ProtoToParsedPlanTestSuite: deserializes the proto golden file, runs it through SparkConnectPlanner + Analyzer, and compares the explained plan against zip.explain.
DataFrameSuite (Connect): end-to-end test that zips two projections over a Connect session and asserts the resulting columns and values.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code

Implements the `DataFrame.zip` operation on the Spark Connect path, following up on apache#54976 which deferred Connect support. **Protocol (relations.proto)** Adds a `Zip` message with `left` and `right` `Relation` fields (field number 48 in the `Relation` oneof). Python stubs regenerated via the connect-gen-protos Docker image (buf 1.66.1 + mypy 1.19.1 + mypy-protobuf 3.3.0 + ruff 0.14.8). **Server (SparkConnectPlanner)** Adds `transformZip` that directly constructs the unresolved `logical.Zip(left, right)` plan, dispatched via `RelTypeCase.ZIP`. `ResolveZip` then runs during analysis, same as the classic path. **Scala Connect Dataset** Replaces the `UnsupportedOperationException` stub with `sparkSession.newDataFrame { builder => builder.getZipBuilder... }`, following the `crossJoin`/`buildJoin` pattern. **Python Connect** - `plan.py`: adds `class Zip(LogicalPlan)` following the `NearestByJoin` pattern. - `dataframe.py`: replaces the `PySparkNotImplementedError` stub with a `plan.Zip` call; removes the doctest suppression. - `test_parity_zip.py`: runs the full `DataFrameZipTestsMixin` against Connect instead of asserting `NOT_IMPLEMENTED`. Generated-by: Claude Code

- DataFrameSuite: end-to-end test that zips two projections of the same DataFrame and asserts the column names and collected values. - PlanGenerationTestSuite: "zip" test serializes the plan to proto and compares against the new golden file (zip.proto.bin / zip.json). - ProtoToParsedPlanTestSuite: "zip" test deserializes the proto.bin golden file, runs it through SparkConnectPlanner + Analyzer, and compares the explained plan against the new zip.explain golden file. - test_connect_plan.py: test_zip asserts that the proto plan for `left.zip(right)` has the `zip` field set with the expected left/ right read sources. Generated-by: Claude Code

### What changes were proposed in this pull request? This is the follow-up to #54976 ([SPARK-55886]) which implemented `DataFrame.zip` for the classic path and deferred Spark Connect support. This PR wires up the Connect path end-to-end. - **Protocol (`relations.proto`)**: adds a `Zip` message with `left` and `right` `Relation` fields (field 48 in the `Relation` oneof). Python stubs regenerated via the `connect-gen-protos` Docker image (buf 1.66.1 + mypy 1.19.1 + mypy-protobuf 3.3.0 + ruff 0.14.8). - **Server (`SparkConnectPlanner`)**: adds `transformZip` that directly constructs the unresolved `logical.Zip(left, right)` plan, dispatched via `RelTypeCase.ZIP`. `ResolveZip` then runs during analysis, same as the classic path. - **Scala Connect `Dataset`**: replaces the `UnsupportedOperationException` stub with `sparkSession.newDataFrame { builder => builder.getZipBuilder.setLeft(...).setRight(...) }`, following the `crossJoin`/`buildJoin` pattern. - **Python Connect `plan.py`**: adds `class Zip(LogicalPlan)` following the `NearestByJoin` pattern. - **Python Connect `dataframe.py`**: replaces the `PySparkNotImplementedError` stub with a `plan.Zip` call; removes the doctest suppression (`del DataFrame.zip.__doc__`) that was added when Connect was unsupported. ### Why are the changes needed? `DataFrame.zip` was merged (#54976) with Connect deferred. This PR completes the implementation so Connect users can use `zip` on equal footing with the classic path. ### Does this PR introduce _any_ user-facing change? Yes. `DataFrame.zip` now works on the Spark Connect path. Previously it raised `PySparkNotImplementedError: [NOT_IMPLEMENTED] zip is not implemented.` ### How was this patch tested? - `test_parity_zip.py`: runs the full `DataFrameZipTestsMixin` (basic projections, expressions, one-sided base, `withColumn`, chained `withColumn`, longer chains, parent-with-chained-child, `withColumnRenamed`, scalar Python UDF, pandas UDF, and two error cases) against a Connect session. - `test_connect_plan.py`: asserts that the proto plan for `left.zip(right)` has the `zip` field set with the expected left/right sources. - `PlanGenerationTestSuite`: serializes a `zip` plan to proto and compares against a new golden file (`zip.proto.bin`). - `ProtoToParsedPlanTestSuite`: deserializes the proto golden file, runs it through `SparkConnectPlanner` + `Analyzer`, and compares the explained plan against `zip.explain`. - `DataFrameSuite` (Connect): end-to-end test that zips two projections over a Connect session and asserts the resulting columns and values. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code Closes #56300 from zhengruifeng/spark-dev-2-df-zip-connect-dev2. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com> (cherry picked from commit f3f5677) Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>

zhengruifeng · 2026-06-06T01:40:35Z

thanks, merged to master/4.x

zhengruifeng changed the title ~~[WIP][SQL][PYTHON][CONNECT] Support DataFrame.zip in Spark Connect~~ [SPARK-57247][SQL][CONNECT] Support DataFrame.zip in Spark Connect Jun 3, 2026

zhengruifeng force-pushed the spark-dev-2-df-zip-connect-dev2 branch from 976d13c to c83b782 Compare June 3, 2026 12:48

HyukjinKwon approved these changes Jun 3, 2026

View reviewed changes

zhengruifeng force-pushed the spark-dev-2-df-zip-connect-dev2 branch from a7a6066 to 104b48d Compare June 4, 2026 01:02

zhengruifeng added 2 commits June 4, 2026 10:55

zhengruifeng force-pushed the spark-dev-2-df-zip-connect-dev2 branch from 104b48d to aaddf3a Compare June 4, 2026 10:55

zhengruifeng marked this pull request as ready for review June 4, 2026 13:06

zhengruifeng requested a review from cloud-fan June 5, 2026 09:51

cloud-fan approved these changes Jun 5, 2026

View reviewed changes

zhengruifeng closed this in f3f5677 Jun 6, 2026

zhengruifeng deleted the spark-dev-2-df-zip-connect-dev2 branch June 6, 2026 01:40

dongjoon-hyun mentioned this pull request Jun 18, 2026

[SPARK-57522] Update Spark Connect-generated Swift source code with 4.2.0-rc3 apache/spark-connect-swift#422

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-57247][SQL][CONNECT] Support DataFrame.zip in Spark Connect#56300

[SPARK-57247][SQL][CONNECT] Support DataFrame.zip in Spark Connect#56300
zhengruifeng wants to merge 2 commits into
apache:masterfrom
zhengruifeng:spark-dev-2-df-zip-connect-dev2

zhengruifeng commented Jun 3, 2026 •

edited

Loading

Uh oh!

zhengruifeng commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

zhengruifeng commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

zhengruifeng commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zhengruifeng commented Jun 3, 2026 •

edited

Loading