[SPARK-55950][PYTHON][CONNECT] Add PySpark support for CDC changes() API#54746
[SPARK-55950][PYTHON][CONNECT] Add PySpark support for CDC changes() API#54746gengliangwang wants to merge 10 commits into
Conversation
Add changes() method to PySpark DataFrameReader and DataStreamReader for both classic and Spark Connect modes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fe1c145 to
8ca8612
Compare
| @@ -0,0 +1,610 @@ | |||
| # | |||
There was a problem hiding this comment.
new test files should be added in modules.py
| return conf | ||
|
|
||
| def _jvm(self): | ||
| return PySparkSession._instantiatedSession._jvm |
There was a problem hiding this comment.
how could this work in spark connect?
There was a problem hiding this comment.
Added comment explaining why JVM access is needed and how it works through the underlying classic session
There was a problem hiding this comment.
If you need both connect session and classic session, you may want to use ReusedMixedTestCase
| def _make_change_row(self, id, data, change_type, commit_version, commit_timestamp): | ||
| jvm = self._jvm() | ||
| UTF8String = jvm.org.apache.spark.unsafe.types.UTF8String | ||
| row = jvm.org.apache.spark.sql.catalyst.expressions.GenericInternalRow(5) |
There was a problem hiding this comment.
it seems the tests are heavily relying on the JVM side methods in InMemoryChangelogCatalog?
is it possible to just add basic E2E tests for the new changes API? I don't know.
@cloud-fan @HyukjinKwon do you have any ideas?
There was a problem hiding this comment.
I just update the PR to make the testing light. We can't end-to-end test without the InMemoryChangelogCatalog
|
merged to master |
What changes were proposed in this pull request?
Add
changes()method to PySparkDataFrameReaderandDataStreamReaderfor both classic and Spark Connect modes.Classic PySpark:
DataFrameReader.changes(tableName)— delegates toself._jreader.changes(tableName)DataStreamReader.changes(tableName)— delegates toself._jreader.changes(tableName)with type checkingSpark Connect PySpark:
RelationChangesplan class inplan.pythat serializes to theRelationChangesprotobuf messageDataFrameReader.changes(tableName)— createsRelationChangesplan (batch)DataStreamReader.changes(tableName)— createsRelationChangesplan withis_streaming=TrueWhy are the changes needed?
To expose the CDC
changes()API added in #54739 to Python users.Does this PR introduce any user-facing change?
Yes. PySpark users can now use:
How was this patch tested?
7 plan generation tests in
test_connect_plan.pycovering:oneofdiscriminator verificationDataStreamReaderprint()debug outputWas this patch authored or co-authored using generative AI tooling?
Yes.