[SPARK-52223][CONNECT] Add SDP Spark Connect Protos by aakash-db · Pull Request #50942 · apache/spark

aakash-db · 2025-05-19T18:46:17Z

What changes were proposed in this pull request?

Adds the Spark Connect API for Spark Declarative Pipelines: https://issues.apache.org/jira/browse/SPARK-51727.

This adds the following protos:

CreateDataflowGraph creates a new graph in the registry.
DefineDataset and DefineFlow register elements to the created graph. Datasets are the nodes of the dataflow graph, and are either tables or views, and flows are the edges connecting them.
StartRun starts a run, which is a single execution of a graph.
StopRun stops an existing run, while DropPipeline stops any current runs and drops the pipeline.

It also adds the new PipelineCommand object to the ExecutePlanRequest and the PipelineCommand.Response to the ExecutePlanResponse object.

Why are the changes needed?

Base API of Spark Declarative Pipelines. Implementation coming in future PRs.

Does this PR introduce any user-facing change?

Yes - creates new proto API within Spark Connect.

How was this patch tested?

N/A

Was this patch authored or co-authored using generative AI tooling?

No.

…connect-api

sryza

Awesome. A few comments – mostly cosmetic.

sryza · 2025-05-19T21:40:47Z

+    // An unresolved relation that defines the dataset's flow.
+    spark.connect.Relation plan = 4;
+
+    // Default SQL configurations set when running this flow.


Nitpick: is the word "Default" relevant here? There's nothing more specific, right?

How is this related to the session in which the flow is defined? Is this an additional way to set configurations? I assume this takes precedence over what the session has configured?

Yeah, no need to say default - there is no more specific mechanism to set confs.

How is this related to the session in which the flow is defined? Is this an additional way to set configurations? I assume this takes precedence over what the session has configured?

For now, this is not supported. Users have to set confs directly in the table / flow decorators for them to be applied to the pipeline.

sryza · 2025-05-19T21:44:51Z

+
+  message DefineSqlGraphElements {
+    optional string dataflow_graph_id = 1;
+    optional string sql_file_name = 2;


Something that occurred to me recently is that there could be SQL files with the same name in different subdirs. Should this be sql_file_path?

I think this is a filepath in implementation, actually. Let me confirm.

Where is this path pointing to?

Changed to file_path. We'll rename this in the implementation too.

Where is this path pointing to?

@hvanhovell this path is the local path to the SQL file. It's mostly used for disambiguation in our observability.

hvanhovell · 2025-05-20T01:51:55Z

+    map<string, string> sql_conf = 5;
+
+    // If true, this flow will only be run once per execution.
+    bool once = 6;


Care to elaborate? Is this a synonym for this is batch?

This corresponds to Trigger.Once in Spark - the flow runs once per update. This is similar to batch in triggered updates, but not in continuous ones (which we will add eventually).

hvanhovell · 2025-05-20T02:00:23Z

+
+// A response containing events emitted during the run of a pipeline.
+message PipelineEventsResult {
+  repeated PipelineEvent events = 1;


Batching events should not be needed. gRPC server side streaming can return multiple 'events' at the same time, provided it can fit them in a single window (~30k).

That's fair. But I think the repeated field adds more flexibility in general. We can group events logically, rather than just to avoid network latency.

Per further feedback from @grundprinzip and @hvanhovell, I'm going to take this batching out. We can always add it in in the future if we come up with a use case for logical grouping.

hvanhovell · 2025-05-20T02:02:10Z

+  repeated PipelineEvent events = 1;
+}
+
+message PipelineEvent {


Is this also supposed to include errors? If so, it'd be nice to understand what has failed... In that case adding add flow/dataset name would be nice.

Yeah, I can the see the value in adding dataset and flow name. But two things:

OTOH, we wanted to keep PipelineEvent's as a generic event bus rather than a structured logging format.

It's possible an error happens that isn't scoped to a dataset/flow, making this field unpredictably empty.

But at the very least, the dataset/flow name will be in the error message.

To add on to what @aakash-db said, our main use case for these events is to print out to the console, and the string messages will include all the context that's needed for that. Once we have a use case that involves consuming the dataset/flow name programmatically, I'd be supportive of adding more structure to this.

Btw, errors should flow the regular way through the exception process and the error details. If we were to do it differently it would just create issues later.

@grundprinzip I actually agree with you. If the pipeline fails we should fail in the normal way. However, that failure can originate from multiple places. As I user I would like to able to figure out what failed. We could embed that failure information in these events.

sryza

Just a few remaining comments

sryza · 2025-05-21T14:58:48Z

+// Parses the SQL file and registers all datasets and flows.
+message DefineSqlGraphElements {
+  // The graph to attach this dataset to.
+  optional string dataflow_graph_id = 1;


I noticed that this is marked optional, but that the corresponding field in DefineDataset is not. How should we decide when to use optional?

cc @hvanhovell if there's a general recommendation on this.

What optional does is generate a has<FIELD> method in Java. We can use that to throw an exception when a field isn't present. Else, the field always has an empty string value.

So really, all of our primitives should have an optional designation. I will change that.

made all of these optional.

sryza · 2025-05-21T15:01:08Z

+  repeated PipelineEvent events = 1;
+}
+
+message PipelineEvent {


To add on to what @aakash-db said, our main use case for these events is to print out to the console, and the string messages will include all the context that's needed for that. Once we have a use case that involves consuming the dataset/flow name programmatically, I'd be supportive of adding more structure to this.

sryza

LGTM! Will of course defer to @hvanhovell on any Spark Connect / proto / gRPC conventions.

grundprinzip

Mostly nits, but looks good.

grundprinzip · 2025-05-26T20:24:39Z

+  repeated PipelineEvent events = 1;
+}
+
+message PipelineEvent {


Btw, errors should flow the regular way through the exception process and the error details. If we were to do it differently it would just create issues later.

grundprinzip · 2025-05-26T20:27:47Z

+
+// A response containing events emitted during the run of a pipeline.
+message PipelineEventsResult {
+  repeated PipelineEvent events = 1;


The doc should be more explicit about how "complete" the set of events is that you receive here. Are these all events or just some? How do you know if more are coming or not.

Generally, I'd stand with Herman that if you don't expect to emit thousands of events per second, your code will be easier and simpler if you don't use a repeated field here and simply emit one event per message.

grundprinzip · 2025-05-27T10:14:36Z

+// The type of dataset.
+enum DatasetType {
+  // Safe default value. Should not be used.
+  DATASET_UNSPECIFIED = 0;


Linter rule should say: DATASET_TYPE_UNSPECIFIED

https://protobuf.dev/programming-guides/style/#enums

sryza · 2025-05-27T14:27:16Z

One more thing - should we regenerate the Python stubs as part of this PR?

grundprinzip · 2025-05-27T14:28:41Z

yes please run dev/generate-connect-protos.sh (or similar ;) )

### What changes were proposed in this pull request? Adds the Spark Connect API for Spark Declarative Pipelines: https://issues.apache.org/jira/browse/SPARK-51727. This adds the following protos: 1. `CreateDataflowGraph` creates a new graph in the registry. 2. `DefineDataset` and `DefineFlow` register elements to the created graph. Datasets are the nodes of the dataflow graph, and are either tables or views, and flows are the edges connecting them. 3. `StartRun` starts a run, which is a single execution of a graph. 4. `StopRun` stops an existing run, while `DropPipeline` stops any current runs and drops the pipeline. It also adds the new `PipelineCommand` object to the `ExecutePlanRequest` and the `PipelineCommand.Response` to the `ExecutePlanResponse` object. ### Why are the changes needed? Base API of Spark Declarative Pipelines. Implementation coming in future PRs. ### Does this PR introduce _any_ user-facing change? Yes - creates new proto API within Spark Connect. ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#50942 from aakash-db/pipeline-spark-connect-api. Lead-authored-by: Aakash Japi <aakash.japi@databricks.com> Co-authored-by: Sandy Ryza <sandy.ryza@databricks.com> Signed-off-by: Sandy Ryza <sandyryza@gmail.com>

dongjoon-hyun

Is there any reason for us to have an exception here? Otherwise, do you think we can rename DefineFlow.plan to DefineFlow.relation for consistency, @aakash-db , @sryza , @HyukjinKwon ?

dongjoon-hyun · 2025-07-10T07:57:44Z

+    optional string target_dataset_name = 3;
+
+    // An unresolved relation that defines the dataset's flow.
+    optional spark.connect.Relation plan = 4;


This looks like a typo. To be consistent with other Apache Spark code, can we rename plan to relation? This instance seems to be the only exception.

$ git grep 'plan: Spark_Connect_Plan' | wc -l 16

$ git grep 'relation: Spark_Connect_Relation' | wc -l 10

Thanks for catching this @dongjoon-hyun – I filed an issue to track: https://issues.apache.org/jira/browse/SPARK-52757.

Thank you so much, @sryza !

I've opened a PR: #51442

Thank you, @peter-toth .

### What changes were proposed in this pull request? This is a minor follow-up change to #50942 (comment) ### Why are the changes needed? Naming consistency. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? No. Closes #51442 from peter-toth/SPARK-52757-rename-plan-to-relation. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

aakash-db added 2 commits May 16, 2025 15:27

1

137f060

2

dcc3b68

github-actions Bot added SQL CONNECT labels May 19, 2025

Merge branch 'master' of github.com:apache/spark into pipeline-spark-…

ac564b5

…connect-api

aakash-db changed the title ~~[WIP] Add SDP Spark Connect Protos~~ [SPARK-52223] Add SDP Spark Connect Protos May 19, 2025

sryza reviewed May 19, 2025

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-52223] Add SDP Spark Connect Protos~~ [SPARK-52223][CONNECT] Add SDP Spark Connect Protos May 20, 2025

hvanhovell reviewed May 20, 2025

View reviewed changes

aakash-db added 2 commits May 20, 2025 13:03

1

2b2f637

license

ad5b039

aakash-db requested review from hvanhovell and sryza May 20, 2025 20:04

sryza reviewed May 21, 2025

View reviewed changes

aakash-db added 2 commits May 21, 2025 12:18

2

988a304

comments

f855349

aakash-db requested a review from sryza May 21, 2025 19:27

sryza approved these changes May 22, 2025

View reviewed changes

sryza self-assigned this May 25, 2025

grundprinzip reviewed May 26, 2025

View reviewed changes

grundprinzip reviewed May 27, 2025

View reviewed changes

hvanhovell approved these changes May 27, 2025

View reviewed changes

grundprinzip approved these changes May 27, 2025

View reviewed changes

sryza added 2 commits May 27, 2025 08:48

fix once docstring

5042b88

take out event batching

4f86648

sryza force-pushed the pipeline-spark-connect-api branch from c1f1f49 to 4f86648 Compare May 27, 2025 15:53

Generate python bindings

b1c2a4b

github-actions Bot added the PYTHON label May 27, 2025

sryza closed this in da68d10 May 27, 2025

dongjoon-hyun reviewed Jul 10, 2025

View reviewed changes

peter-toth mentioned this pull request Jul 10, 2025

[SPARK-52757][CONNECT] Rename "plan" field in DefineFlow to "relation" #51442

Closed

JiaqiWang18 mentioned this pull request Jul 10, 2025

[SPARK-52757][CONNECT] Rename DefineFlow.plan to relation #51443

Closed

Conversation

aakash-db commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

sryza left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sryza left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sryza left a comment

Choose a reason for hiding this comment

Uh oh!

grundprinzip left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

aakash-db commented May 19, 2025 •

edited

Loading