Skip to content

[SPARK-44971][PYTHON] StreamingQueryProgress event fromJson bug fix#42686

Closed
WweiL wants to merge 4 commits into
apache:branch-3.5from
WweiL:SPARK-44971-fromJson-bugfix
Closed

[SPARK-44971][PYTHON] StreamingQueryProgress event fromJson bug fix#42686
WweiL wants to merge 4 commits into
apache:branch-3.5from
WweiL:SPARK-44971-fromJson-bugfix

Conversation

@WweiL

@WweiL WweiL commented Aug 25, 2023

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

The fromJson method for StreamingQueryProgress excepts the field batchDuration is in the dict.
That method is used internally for converting a json representation of StreamingQueryProgress into python object, commonly created in the Scala side json method of the same object.

But the batchDuration field is not there before #42077, which is only merged to 4.0. Therefore we add a catch there to prevent this method from failing.

Why are the changes needed?

Necessary bug fix

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added unit test

Was this patch authored or co-authored using generative AI tooling?

No

@dongjoon-hyun

Copy link
Copy Markdown
Member

Thank you, @WweiL . SPARK-44971 seems to be filed as a new feature (maybe by default value). You can switch to Bug issue type.

Screenshot 2023-08-25 at 11 51 54 AM

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-44971][BUG Fix][Python] StreamingQueryProgress event fromJson bug fix [SPARK-44971][Python] StreamingQueryProgress event fromJson bug fix Aug 25, 2023
timestamp=j["timestamp"],
batchId=j["batchId"],
batchDuration=j["batchDuration"],
batchDuration=j["batchDuration"] if "batchDuration" in j else None,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering when Spark receive a JSON file without batchDuration? We are inside the same Spark versions, aren't we? Is this listener able to listen old Spark?

@WweiL WweiL Aug 25, 2023

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is currently how this method is used: in spark connect, the way the listener works, is that

  1. The user's listener code is serialized and sent to spark server
  2. The server starts a scala listener, in which starts a python process (essentially as another connect client), that runs the user's code
  3. Each time a new event comes in, the event on java side is serialized to json and passed to server python process, which calls this fromJson method to convert it back to the actual StreamingQueryProgress object

But before #42077 in 3.5, that field is not added in the jvm json method of StreamingQueryProgress. Here it excepts that to always be presented, hence we get an error

@HyukjinKwon HyukjinKwon changed the title [SPARK-44971][Python] StreamingQueryProgress event fromJson bug fix [SPARK-44971][PYTHON] StreamingQueryProgress event fromJson bug fix Aug 28, 2023
Comment thread python/pyspark/sql/streaming/listener.py Outdated
@WweiL

WweiL commented Aug 30, 2023

Copy link
Copy Markdown
Contributor Author

Hi @HyukjinKwon can we merge this : )

@HyukjinKwon

HyukjinKwon commented Aug 31, 2023

Copy link
Copy Markdown
Member

Merged to master and branch-3.5.

HyukjinKwon pushed a commit that referenced this pull request Aug 31, 2023
### What changes were proposed in this pull request?

The `fromJson` method for `StreamingQueryProgress` excepts the field `batchDuration` is in the dict.
That method is used internally for converting a json representation of `StreamingQueryProgress` into python object, commonly created in the Scala side `json` method of the same object.

But the `batchDuration` field is not there before #42077, which is only merged to 4.0. Therefore we add a catch there to prevent this method from failing.

### Why are the changes needed?

Necessary bug fix

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added unit test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #42686 from WweiL/SPARK-44971-fromJson-bugfix.

Lead-authored-by: Wei Liu <wei.liu@databricks.com>
Co-authored-by: Wei Liu <z920631580@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@WweiL

WweiL commented Aug 31, 2023

Copy link
Copy Markdown
Contributor Author

@HyukjinKwon Can we just merge this to 3.5

@WweiL

WweiL commented Aug 31, 2023

Copy link
Copy Markdown
Contributor Author

ah nvm I set the merge target to 3.5, should be fine

@HyukjinKwon

Copy link
Copy Markdown
Member

Yup, merged to 3.5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants