[SPARK-52463][SDP] Add support for cluster_by in Python Pipelines APIs by sryza · Pull Request #52831 · apache/spark

sryza · 2025-11-02T01:55:05Z

What changes were proposed in this pull request?

In the @table and @materialized_view decorators, accept a cluster_by argument that determines the clustering columns.

Why are the changes needed?

Parity with the clusterBy argument accepted by DataStreamReader and DataFrameWriter.

Does this PR introduce any user-facing change?

Adds a new parameter to public APIs.

How was this patch tested?

Unit tests and integration tests.

Was this patch authored or co-authored using generative AI tooling?

gengliangwang · 2025-11-05T01:14:36Z

+                     |
+                     |spark = SparkSession.active()
+                     |
+                     |@dp.materialized_view(cluster_by = ["cluster_col1"])


let's also test non-existent columns

I also wonder what will happen if the array is empty

👍 adding a non-existent columns test to MaterializeTablesSuite and a test for empty array to PythonPipelineSuite (the table gets no clustering columns)

gengliangwang · 2025-11-05T01:14:48Z

@@ -885,4 +886,228 @@ abstract class MaterializeTablesSuite extends BaseCoreExecutionTest {
      storageRoot = storageRoot
    )
  }
+
+  test("cluster columns with user schema") {


let's also test non-existent columns

👍 adding

cloud-fan · 2025-11-05T03:49:06Z

@@ -266,22 +266,35 @@ object DatasetManager extends Logging {
    )
    val mergedProperties = resolveTableProperties(table, identifier)
    val partitioning = table.partitionCols.toSeq.flatten.map(Expressions.identity)
+    val clustering = table.clusterCols.map(cols =>
+      ClusterByTransform(cols.map(col => FieldReference(col)).toSeq)


nit: although identical, let's be consistent with partition col handling, and use Expressions.column

dongjoon-hyun · 2025-11-05T06:07:19Z

@@ -104,6 +104,9 @@ message PipelineCommand {
        spark.connect.DataType schema_data_type = 4;
        string schema_string = 5;
      }
+
+      // Optional cluster columns for the table.
+      repeated string cluster_cols = 6;


Can we use the existing name, clustering_columns, consistently instead of adding a new variant, cluster_cols, @sryza , @cloud-fan , @gengliangwang ?

spark/sql/connect/common/src/main/protobuf/spark/connect/commands.proto

Lines 144 to 145 in ada1908

// (Optional) Columns used for clustering the table.

repeated string clustering_columns = 10;

In the proto file layer, I hope we can be consistent although this can be different in the API layer (Scala/Java/Python/...)

Makes sense – I just updated the PR to reflect this.

dongjoon-hyun

Thank you. Could you fix the failures?

[info] *** 50 TESTS FAILED ***
[error] Failed tests:
[error] 	org.apache.spark.sql.connect.pipelines.PythonPipelineSuite
[error] 	org.apache.spark.sql.connect.pipelines.EndToEndAPISuite
[error] 	org.apache.spark.sql.connect.service.SparkConnectSessionHolderSuite

sryza · 2025-11-06T18:43:25Z

@dongjoon-hyun 👍 failures fixed with a rebase – it looks like there was a race condition with a Python protobuf library version bump.

dongjoon-hyun

+1, LGTM. Thank you, @sryza .

### What changes were proposed in this pull request? In the `table` and `materialized_view` decorators, accept a `cluster_by` argument that determines the clustering columns. ### Why are the changes needed? Parity with the `clusterBy` argument accepted by `DataStreamReader` and `DataFrameWriter`. ### Does this PR introduce _any_ user-facing change? Adds a new parameter to public APIs. ### How was this patch tested? Unit tests and integration tests. ### Was this patch authored or co-authored using generative AI tooling? Closes #52831 from sryza/cluster-by. Authored-by: Sandy Ryza <sandy.ryza@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit a927a14) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

dongjoon-hyun · 2025-11-06T18:45:41Z

Merged to master/4.1.

### What changes were proposed in this pull request? In the `table` and `materialized_view` decorators, accept a `cluster_by` argument that determines the clustering columns. ### Why are the changes needed? Parity with the `clusterBy` argument accepted by `DataStreamReader` and `DataFrameWriter`. ### Does this PR introduce _any_ user-facing change? Adds a new parameter to public APIs. ### How was this patch tested? Unit tests and integration tests. ### Was this patch authored or co-authored using generative AI tooling? Closes apache#52831 from sryza/cluster-by. Authored-by: Sandy Ryza <sandy.ryza@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…ift` source code ### What changes were proposed in this pull request? This PR aims to update Spark Connect-generated Swift source code with Apache Spark `4.1.0`. ### Why are the changes needed? To use the latest bug fixes and new messages to develop for new features of `4.1.0`. - apache/spark#53024 - apache/spark#52894 - apache/spark#52890 - apache/spark#52872 - apache/spark#52746 - apache/spark#52831 ``` $ git clone -b v4.1.0 https://github.com/apache/spark.git $ cd spark/sql/connect/common/src/main/protobuf/ $ protoc --swift_out=. spark/connect/*.proto $ protoc --grpc-swift_out=. spark/connect/*.proto // Remove empty GRPC files $ cd spark/connect $ grep 'This file contained no services' * | awk -F: '{print $1}' | xargs rm ``` ### Does this PR introduce _any_ user-facing change? Pass the CIs. ### How was this patch tested? Pass the CIs. I manually tested with `Apache Spark 4.1.0`. ``` $ swift test --no-parallel ... Test run with 203 tests in 21 suites passed after 33.163 seconds. ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #271 from dongjoon-hyun/SPARK-54811. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

github-actions Bot added SQL PYTHON CONNECT labels Nov 2, 2025

sryza changed the title ~~[SDP] Add support for cluster_by in Python Pipelines APIs~~ [SDP][SPARK-52463] Add support for cluster_by in Python Pipelines APIs Nov 2, 2025

sryza marked this pull request as ready for review November 4, 2025 03:18

sryza requested review from cloud-fan and gengliangwang November 4, 2025 04:25

sryza force-pushed the cluster-by branch from d041b61 to ee326da Compare November 5, 2025 00:59

gengliangwang reviewed Nov 5, 2025

View reviewed changes

cloud-fan reviewed Nov 5, 2025

View reviewed changes

cloud-fan approved these changes Nov 5, 2025

View reviewed changes

gengliangwang approved these changes Nov 5, 2025

View reviewed changes

dongjoon-hyun reviewed Nov 5, 2025

View reviewed changes

sryza requested review from cloud-fan and gengliangwang November 5, 2025 17:54

dongjoon-hyun requested changes Nov 6, 2025

View reviewed changes

sryza added 10 commits November 5, 2025 19:10

Add support for cluster_by in Python pipelines APIs

e3ca39f

use error class

c192514

add missing

b518826

protos

1a39ef0

fix

d99c6d0

use Expressions.column

1c98e94

gengliang: add more tests

6f42b43

fix test

3cf344c

clustering_cols

b992ad9

format

edc6bdf

sryza force-pushed the cluster-by branch from 90dbc09 to edc6bdf Compare November 6, 2025 03:11

sryza requested a review from dongjoon-hyun November 6, 2025 18:42

dongjoon-hyun approved these changes Nov 6, 2025

View reviewed changes

dongjoon-hyun closed this in a927a14 Nov 6, 2025

dongjoon-hyun mentioned this pull request Dec 22, 2025

[SPARK-54811] Use Spark 4.1.0 to regenerate Spark Connect-based Swift source code apache/spark-connect-swift#271

Closed

dongjoon-hyun changed the title ~~[SDP][SPARK-52463] Add support for cluster_by in Python Pipelines APIs~~ [SPARK-52463][SDP] Add support for cluster_by in Python Pipelines APIs Dec 22, 2025

	// (Optional) Columns used for clustering the table.
	repeated string clustering_columns = 10;

Conversation

sryza commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

sryza commented Nov 6, 2025

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sryza commented Nov 2, 2025 •

edited

Loading