Re-add iceberg bounded source; test splitting#30805
Re-add iceberg bounded source; test splitting#30805kennknowles merged 3 commits intoapache:masterfrom
Conversation
|
R: @chamikaramj |
|
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control |
|
The failure is in the basic reading of data - none is read. I'll try to grok things and see about that. Still interested in commentary on the approaches here. |
There was a problem hiding this comment.
Noting that this also has Write stuff but I'm ignoring those for this review.
There was a problem hiding this comment.
Thanks for noticing. I will revert any added write stuff in this PR.
There was a problem hiding this comment.
Seems like we are missing the Read transform ?
There was a problem hiding this comment.
Will do soon. I realize all the tests are a vanilla Read.from(IcebergBoundedSource)
There was a problem hiding this comment.
Ideally we should use SDF but I think it would suffice to add a TODO to convert this to an SDF in the future.
There was a problem hiding this comment.
String.valueOf(desiredBundleSizeBytes)
There was a problem hiding this comment.
We should fail here to prevent data loss.
There was a problem hiding this comment.
Update/remove comment ?
There was a problem hiding this comment.
Actually I added this comment to explain why the type was fake. I reworded it.
There was a problem hiding this comment.
Why are we skipping data tasks ? Was this supposed to be if (!fileTask.isDataTask()) ?
Seems like DataTasks contain actual data: https://iceberg.apache.org/javadoc/0.11.0/org/apache/iceberg/DataTask.html
There was a problem hiding this comment.
I think this was obsolete code. The issue is that ScanTask is a oneof and this is checking a oneof that it is not.
sdks/java/io/iceberg/src/main/java/org/apache/beam/io/iceberg/CombinedScanReader.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Haha it means that this was a fake implementation. I added conversions.
21e0a47 to
177aea2
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #30805 +/- ##
============================================
+ Coverage 70.96% 71.47% +0.51%
============================================
Files 1257 710 -547
Lines 140931 104815 -36116
Branches 4307 0 -4307
============================================
- Hits 100007 74915 -25092
+ Misses 37444 28268 -9176
+ Partials 3480 1632 -1848
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
177aea2 to
e26cacd
Compare
b53e022 to
0b5e391
Compare
|
Looks like there is something broke in the GHA workflow FYI just so you don't wait for that to go green. I pushed some possible fixes up for that too but I'm not sure if in this case the workflow comes from master or the PR. |
0b5e391 to
37b23f7
Compare
There was a problem hiding this comment.
What will happen if the sources produced at line 104 above get re-split by the runner ?
If such sources cannot be re-split, we should have a trivial case where we just return the original source to prevent data loss.
There was a problem hiding this comment.
Is it possible to also implement getFractionConsumed to support progress reporting in a meaningful way ?
I think it will be very useful for autoscaling when using this with Dataflow.
There was a problem hiding this comment.
This can be in a future PR since we want to get this in by release cut.
There was a problem hiding this comment.
Let's also add a test for splitting using SourceTestUtils.assertSourcesEqualReferenceSource.
sdks/java/io/iceberg/src/main/java/org/apache/beam/io/iceberg/SchemaAndRowConversions.java
Outdated
Show resolved
Hide resolved
19037fb to
a5b995a
Compare
There was a problem hiding this comment.
Also fail for the default path prevent dataloss (or just return the original source if that can be read directly).
|
LGTM other than handling the re-splitting case above. |
a5b995a to
a169e6a
Compare
kennknowles
left a comment
There was a problem hiding this comment.
Added a test for double-splitting. To make splitting behavior clearer, I factored into two different kinds of sources.
Currently we have failing in the exhaustive splitting, but I would also like feedback early.
This is exactly #30797 plus one commit, so if you click on just the last commit you should be able to see just the read/source diff
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.