[WIP][BEAM-10670] Make SparkRunner opt-out for using an SDF powered Read transform.#12603
[WIP][BEAM-10670] Make SparkRunner opt-out for using an SDF powered Read transform.#12603lukecwik wants to merge 3 commits intoapache:masterfrom
Conversation
|
Run Spark ValidatesRunner |
|
Run Spark StructuredStreaming ValidatesRunner |
|
CC: @iemejia I got this passing all the tests but what is the state of streaming pipelines? I see that there is code for evaluating UnboundedSources but don't see any test coverage. |
|
Run Spark Runner Nexmark Tests |
iemejia
left a comment
There was a problem hiding this comment.
You are right Streaming pipelines are not well tested on the Spark Runner, at the moment there is not a full run of the ValidatesRunner suite for streaming on the classic runner (the only one supporting streaming at the moment). I remember there were issues with test pipelines never stopping when we tried to enable them ~2.5 years ago.
The 'consistency' of watermark handling is validated using a Spark specific transform called CreateStream that precedes TestStream but well that's probably not really useful for this use case where I suppose we intend to validate that Read is not broken for both the direct translation and the SDF based one.
I don't have immediately a suggestion for how to do so, maybe try to enable the Read VR Test only for streaming, but still I doubt it will work easily out of the box otherwise maybe add a specific test for the runner temporarily.
runners/spark/build.gradle
Outdated
|
cc: @ibzib @annaqin418 |
920da35 to
92b6e67
Compare
|
@iemejia I have updated the code and added a |
|
@iemejia I figured out that the issue is that watermark holds aren't implemented for spark so the first batch completes which computes new watermarks so the watermark hold that was set by the splittable dofn implementation is ignored. This leads to timers being dropped and hence only some of the results being produced. This is also the likely cause for why the PAssert is dropping the elements that were produced as well but I haven't validated this yet. Can you explain how the GlobalWatermarkHolder works, can I register anything as a Since watermark holds don't seem to be implemented, does the GroupAlsoViaWindowSet hold back the watermark for elements that it currently has buffered? |
|
The phenomenon of microbatches producing results early I noticed it too in the past when trying to enable the Read.Unbounded tests. I could not understand why, and I thought it was probably due to some glitch in Spark implementation or us screwing their scheduling but I struggled to debug the issue properly.
Probably, at least that may explain some of the inconsistencies.
In all honesty I am not so familiar with watermark handling on the Spark runner. I took a look at the GlobalWatermarkHolder class and tried to figure out but it was not really evident. My impression is that the sourceId is aligned somehow with Spark's assigned streamId, but I might be misinterpreting it. I wish I could help more but that part of the code is also not so well documented. I doubt that the original authors of the code still remember the details but maybe they remember at least the intentions of |
The Java based trigger implementation relies on this to produce correct results. Implementing this would like enable a bunch of streaming use cases.
That would be great if someone could give guidance here. |
|
@iemejia Since streaming is effectively broken due to lack of support for watermark holds. What do you think about enabling SDF for Spark and it only working in batch? I'll see what I can do about watermark holds but decoupling the two would be convenient. |
Can you be more explicit on what you mean on 'it only working in batch'? Isn't it the current case for Bounded SDF? |
BoundedSource converted to an SDF will work just as well as the current BoundedSource implementation since they don't rely on watermark holds. The current implementation of UnboundedSource and the new implementation using UnboundedSource as an SDF both set the watermark but triggers don't honor it since there is lack of support for watermark holds. |
|
I see, so it is the full switch from Read.Bounded/Unbounded to SDF by default. Can you get this one green so we can test it and then merge it, I would like to see if there is some perf impact, and probably that we document how to get the previous If I understood correctly you might intend to tackle watermark holds in the 'future'? Just for learning curiosity I assume this will be done in |
I'll try to see what I can get working with the GlobalWatermarkHolder implementation that exists. I think we should be able to use arbitrary ids in it it just might be really slow since the readers/writers should really care about their upstream watermarks (main and side input) so having a global broadcast seems less then desirable. For now lets break up this change into multiple PRs (Spark already supports bounded SDFs via the SplittableParDoNaiveBounded.OverrideFactory):
|
|
Excellent idea to break the PR into little ones, first one merged. waiting for the next! |
|
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username).[BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replaceBEAM-XXXwith the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.