[SPARK-55950][PYTHON][CONNECT] Add PySpark support for CDC changes() API by gengliangwang · Pull Request #54746 · apache/spark

gengliangwang · 2026-03-10T22:38:41Z

What changes were proposed in this pull request?

Add changes() method to PySpark DataFrameReader and DataStreamReader for both classic and Spark Connect modes.

Classic PySpark:

DataFrameReader.changes(tableName) — delegates to self._jreader.changes(tableName)
DataStreamReader.changes(tableName) — delegates to self._jreader.changes(tableName) with type checking

Spark Connect PySpark:

New RelationChanges plan class in plan.py that serializes to the RelationChanges protobuf message
DataFrameReader.changes(tableName) — creates RelationChanges plan (batch)
DataStreamReader.changes(tableName) — creates RelationChanges plan with is_streaming=True

Why are the changes needed?

To expose the CDC changes() API added in #54739 to Python users.

Does this PR introduce any user-facing change?

Yes. PySpark users can now use:

# Batch
df = spark.read.option("startingVersion", "1").changes("my_table")

# Streaming
df = spark.readStream.option("startingVersion", "1").changes("my_table")

How was this patch tested?

7 plan generation tests in test_connect_plan.py covering:

Batch read with version/timestamp options
No-options and multi-part table names
Proto oneof discriminator verification
Streaming via direct plan and via DataStreamReader
print() debug output

Was this patch authored or co-authored using generative AI tooling?

Yes.

Add changes() method to PySpark DataFrameReader and DataStreamReader for both classic and Spark Connect modes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

zhengruifeng · 2026-03-20T07:07:02Z

@@ -0,0 +1,610 @@
+#


new test files should be added in modules.py

zhengruifeng · 2026-03-20T07:07:23Z

+        return conf
+
+    def _jvm(self):
+        return PySparkSession._instantiatedSession._jvm


how could this work in spark connect?

Added comment explaining why JVM access is needed and how it works through the underlying classic session

If you need both connect session and classic session, you may want to use ReusedMixedTestCase

@zhengruifeng thanks, updated

zhengruifeng · 2026-03-25T03:46:08Z

+    def _make_change_row(self, id, data, change_type, commit_version, commit_timestamp):
+        jvm = self._jvm()
+        UTF8String = jvm.org.apache.spark.unsafe.types.UTF8String
+        row = jvm.org.apache.spark.sql.catalyst.expressions.GenericInternalRow(5)


it seems the tests are heavily relying on the JVM side methods in InMemoryChangelogCatalog?

is it possible to just add basic E2E tests for the new changes API? I don't know.

@cloud-fan @HyukjinKwon do you have any ideas?

I just update the PR to make the testing light. We can't end-to-end test without the InMemoryChangelogCatalog

zhengruifeng · 2026-03-26T06:13:28Z

merged to master

gengliangwang marked this pull request as draft March 10, 2026 22:40

gengliangwang and others added 2 commits March 19, 2026 21:19

[SPARK-55950][PYTHON] Add PySpark support for CDC changes() API

fb74922

Add changes() method to PySpark DataFrameReader and DataStreamReader for both classic and Spark Connect modes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

add cdc tests

8ca8612

gengliangwang force-pushed the cdc-pyspark branch from fe1c145 to 8ca8612 Compare March 20, 2026 04:56

update comments

9541f6b

gengliangwang marked this pull request as ready for review March 20, 2026 05:40

gengliangwang requested review from HyukjinKwon, aokolnychyi and zhengruifeng March 20, 2026 05:40

zhengruifeng reviewed Mar 20, 2026

View reviewed changes

gengliangwang added 2 commits March 20, 2026 11:16

address comments

78afae1

reformat

027d399

gengliangwang requested a review from zhengruifeng March 23, 2026 05:58

gengliangwang added 2 commits March 23, 2026 13:53

try fixing tests

58d41a5

try fixing tests

5899bb9

zhengruifeng reviewed Mar 25, 2026

View reviewed changes

gengliangwang added 3 commits March 25, 2026 11:51

address new comments

6d8c339

simplify tests

ea8dd91

update tests

7ada3fc

zhengruifeng approved these changes Mar 26, 2026

View reviewed changes

zhengruifeng changed the title ~~[SPARK-55950][PYTHON] Add PySpark support for CDC changes() API~~ [SPARK-55950][PYTHON][CONNECT] Add PySpark support for CDC changes() API Mar 26, 2026

zhengruifeng closed this in 91e21ee Mar 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-55950][PYTHON][CONNECT] Add PySpark support for CDC changes() API#54746

[SPARK-55950][PYTHON][CONNECT] Add PySpark support for CDC changes() API#54746
gengliangwang wants to merge 10 commits into
apache:masterfrom
gengliangwang:cdc-pyspark

gengliangwang commented Mar 10, 2026 •

edited

Loading

Uh oh!

zhengruifeng Mar 20, 2026

Uh oh!

gengliangwang Mar 20, 2026

Uh oh!

zhengruifeng Mar 20, 2026

Uh oh!

gengliangwang Mar 20, 2026

Uh oh!

zhengruifeng Mar 23, 2026

Uh oh!

gengliangwang Mar 24, 2026

Uh oh!

zhengruifeng Mar 25, 2026

Uh oh!

gengliangwang Mar 25, 2026

Uh oh!

zhengruifeng commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

gengliangwang commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gengliangwang commented Mar 10, 2026 •

edited

Loading