[BEAM-5191] Support for BigQuery clustering#7061
[BEAM-5191] Support for BigQuery clustering#7061wscheep wants to merge 1 commit intoapache:masterfrom
Conversation
|
Run JavaPortabilityApi PreCommit |
|
cc: @reuvenlax |
|
@chamikaramj @reuvenlax any update on this? |
chamikaramj
left a comment
There was a problem hiding this comment.
Thanks. Added some comments.
Please add an integration test for the new feature. I know that we did not add this for time-based partitioning but we recently added some BQ integration tests and you can probably follow that (it's great if you can add a test for time partitioning as well :) )
https://github.com/apache/beam/blob/master/examples/java/src/test/java/org/apache/beam/examples/cookbook/BigQueryTornadoesIT.java
https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIOReadIT.java
| } | ||
| } | ||
|
|
||
| static class ClusteringToJson implements SerializableFunction<Clustering, String> { |
There was a problem hiding this comment.
Please document why this is needed.
| return withJsonClustering(NestedValueProvider.of(clustering, new ClusteringToJson())); | ||
| } | ||
|
|
||
| public Write<T> withJsonClustering(ValueProvider<String> clustering) { |
There was a problem hiding this comment.
Any idea why we would need 'withJsonClustering' ? (I understand that you are following time partitioning here).
There was a problem hiding this comment.
You're right, I don't need to expose it. Removed it.
| } | ||
| if (getJsonClustering() != null) { | ||
| checkArgument( | ||
| getJsonTimePartitioning() != null, |
There was a problem hiding this comment.
So that combinations (getJsonClustering and getTimePartitioning) and (getClustering and getJsonTimePartitioning) are not accepted ?
There was a problem hiding this comment.
It should be covered now if I'm not mistaken
| dynamicDestinations = | ||
| new ConstantTimePartitioningDestinations( | ||
| dynamicDestinations, getJsonTimePartitioning()); | ||
| if (getJsonClustering() != null) { |
There was a problem hiding this comment.
How about adding a single class instead of forking here ?
| } | ||
| } | ||
|
|
||
| static class ConstantClusteringDestinations<T> extends ConstantTimePartitioningDestinations<T> { |
There was a problem hiding this comment.
Possibly cleaner rename and update existing class instead of sub-classing here.
9186b55 to
2df3611
Compare
|
@chamikaramj Thanks for the review. Sorry it took a while, lost track of it. I made some changes and added an IT covering timepartitioning and clustering tests. |
|
@chamikaramj @wscheep Created |
08cb56b to
5c3320d
Compare
|
Run Java PreCommit |
|
Run Java PostCommit |
|
Run Dataflow ValidatesRunner |
|
Thanks. LGTM. I'll merge after post-commit tests pass. |
685df30 to
3f728c9
Compare
|
Run Java PreCommit |
|
Run Java PostCommit |
|
Sorry, can you please resolve the conflict. |
| .setCoder( | ||
| KvCoder.of( | ||
| VoidCoder.of(), KvCoder.of(TableDestinationCoderV2.of(), StringUtf8Coder.of()))) | ||
| VoidCoder.of(), KvCoder.of(TableDestinationCoderV3.of(), StringUtf8Coder.of()))) |
There was a problem hiding this comment.
I think this will break some Dataflow users who expect to be able to update their pipelines, and can't if these coders change. @chamikaramj what do you think here?
There was a problem hiding this comment.
I'm bit confused by this comment (and probably missing something). Isn't this similar to the TableDestinationCoder to TableDestinationCoderV2 change we did in following commit to preserve update compatibility ?
There was a problem hiding this comment.
I'm also not sure, I based myself on the commit referenced above. Why doesn't this preserve update compatibility?
There was a problem hiding this comment.
@reuvenlax in retrospec looking at both the TableDestinationCoder and TableDestinationCoderV2 I see that they are complexly incompatible and could understand requiring the V2 coder. Am I correct that iso adding another Nullable field (clustering) in the TableDestinationCoderV2 will not break the upgradeability? I think the topic of Coder compatibility on upgrades is not a well understood topic.
There was a problem hiding this comment.
Coder compatibility is unfortunately not well defined. The best assumption today is that any change to coder is incompatible (Beam needs to define a better story around updating coders).
I don't think a Nullable field helps here. The nullable coder will still expect something there, and there will be nothing there.
We need to either:
- Decide that this is important enough that we are willing to break update.
2 . Make the use of the new coder somehow contingent on the user using clustering.
There was a problem hiding this comment.
How can we make that decision ?
Also, what did we do previously ? Did we cluster changes that will unavoidably break update compatibility ?
There was a problem hiding this comment.
@wscheep is approach (2) described by Reuven viable here ? I.e., Can we only use new coder when user use BQ clustering ?
There was a problem hiding this comment.
For this particular instance, I think we can safely continue to use TableDestinationCoderV2 since none of the subsequent processing steps in this method reference the clustering configuration. The TableDestinations we code here are used to configure a copy job, which does not need to know about clustering.
| /** | ||
| * A {@link Coder} for {@link TableDestination} that includes time partitioning information. This is | ||
| * a new coder (instead of extending the old {@link TableDestinationCoder}) for compatibility | ||
| * reasons. The old coder is kept around for the same compatibility reasons. |
There was a problem hiding this comment.
I'm not sure I understand the logic. We are replacing the old coder with the new coder, which has the same compatibility issues. Keeping the old coder around in the codebase doesn't really change much.
There was a problem hiding this comment.
I think the biggest confusion started when the V2 coder was created. @wscheep thought this is the patten of updating coders. If you can confirm that adding the field (as Nullable) would not break upgrades the V3 can be removed and added to the V2.
3f728c9 to
21e5c0b
Compare
|
Run Python PreCommit |
|
Sorry for the delay here, I was out of the office for a few days. I'm working on a way so this can be done in a compatible way. I'll update you once I have something. |
|
How is this PR going on? Soon to be completed? |
|
Getting this PR to completion would be very useful, as table clustering is no longer in beta and likely to become more popular. |
| tempTables | ||
| .apply("ReifyRenameInput", new ReifyAsIterable<>()) | ||
| .setCoder(IterableCoder.of(KvCoder.of(TableDestinationCoderV2.of(), StringUtf8Coder.of()))) | ||
| .setCoder(IterableCoder.of(KvCoder.of(TableDestinationCoderV3.of(), StringUtf8Coder.of()))) |
There was a problem hiding this comment.
Same here. The coded value is only passed to WriteRename which creates a copy job and never references the clustering configuration.
|
Since this PR seems to have stalled out, I've taken the commits here and posted a new PR #8945 that is rebased on master and addresses the coder evolution issue by providing an interface for users to opt in to the new coder. |
|
The contents of this PR were incorporated into #8945 which is now merged, so this issue can be closed. This feature should be available in the 2.15 release. |
|
Closing since this was included in #8945 |
Implemented BigQuery clustering: https://cloud.google.com/bigquery/docs/clustered-tables.
As this is related to BigQuery TimePartitioning, I based my implementation on this commit:
b0e03a3
As far as I know, there are no integration tests covering time partitioning, so I did not add tests for clustering. If needed I can write some if someone points me in the right direction.
This is my first feature PR, so I'm eager to get some proper feedback.
@jkff, @reuvenlax as you committed & authored time partitioning, can you have a look?
Thanks,
Wout
Follow this checklist to help us incorporate your contribution quickly and easily:
[BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replaceBEAM-XXXwith the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.It will help us expedite review of your Pull Request if you tag someone (e.g.
@username) to look at it.Post-Commit Tests Status (on master branch)