[BEAM-5191] Support for BigQuery clustering by wscheep · Pull Request #7061 · apache/beam

wscheep · 2018-11-16T09:00:31Z

Implemented BigQuery clustering: https://cloud.google.com/bigquery/docs/clustered-tables.

As this is related to BigQuery TimePartitioning, I based my implementation on this commit:
b0e03a3

As far as I know, there are no integration tests covering time partitioning, so I did not add tests for clustering. If needed I can write some if someone points me in the right direction.

This is my first feature PR, so I'm eager to get some proper feedback.
@jkff, @reuvenlax as you committed & authored time partitioning, can you have a look?

Thanks,
Wout

Follow this checklist to help us incorporate your contribution quickly and easily:

Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

It will help us expedite review of your Pull Request if you tag someone (e.g. @username) to look at it.

Post-Commit Tests Status (on master branch)

Lang	Apex	Dataflow	Flink	Gearpump	Samza	Spark
Go	---	---	---	---	---	---
Java
Python	---			---	---	---

mxm · 2018-11-20T09:33:21Z

Run JavaPortabilityApi PreCommit

chamikaramj · 2018-11-28T18:27:55Z

cc: @reuvenlax

robertwb · 2018-12-10T13:17:31Z

@chamikaramj @reuvenlax any update on this?

chamikaramj

Thanks. Added some comments.

Please add an integration test for the new feature. I know that we did not add this for time-based partitioning but we recently added some BQ integration tests and you can probably follow that (it's great if you can add a test for time partitioning as well :) )
https://github.com/apache/beam/blob/master/examples/java/src/test/java/org/apache/beam/examples/cookbook/BigQueryTornadoesIT.java
https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIOReadIT.java

chamikaramj · 2018-12-21T19:07:53Z

...google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryHelpers.java

    }
  }

+  static class ClusteringToJson implements SerializableFunction<Clustering, String> {


Please document why this is needed.

It wasn't, removed it.

chamikaramj · 2018-12-21T19:14:59Z

...a/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java

+      return withJsonClustering(NestedValueProvider.of(clustering, new ClusteringToJson()));
+    }
+
+    public Write<T> withJsonClustering(ValueProvider<String> clustering) {


Any idea why we would need 'withJsonClustering' ? (I understand that you are following time partitioning here).

You're right, I don't need to expose it. Removed it.

chamikaramj · 2018-12-21T19:16:18Z

...a/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java

      }
+      if (getJsonClustering() != null) {
+        checkArgument(
+            getJsonTimePartitioning() != null,


So that combinations (getJsonClustering and getTimePartitioning) and (getClustering and getJsonTimePartitioning) are not accepted ?

It should be covered now if I'm not mistaken

chamikaramj · 2018-12-21T19:22:04Z

...a/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java

-          dynamicDestinations =
-              new ConstantTimePartitioningDestinations(
-                  dynamicDestinations, getJsonTimePartitioning());
+          if (getJsonClustering() != null) {


How about adding a single class instead of forking here ?

chamikaramj · 2018-12-21T19:24:39Z

...d-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/DynamicDestinationsHelpers.java

    }
  }

+  static class ConstantClusteringDestinations<T> extends ConstantTimePartitioningDestinations<T> {


Possibly cleaner rename and update existing class instead of sub-classing here.

wscheep · 2019-01-28T12:17:08Z

@chamikaramj Thanks for the review. Sorry it took a while, lost track of it. I made some changes and added an IT covering timepartitioning and clustering tests.

lgajowy · 2019-01-28T13:41:26Z

@chamikaramj @wscheep Created BigQueryTimePartitioningIT dataset for the test in the apache-beam-testing project.

wscheep · 2019-01-28T15:57:54Z

Run Java PreCommit

chamikaramj · 2019-01-31T21:29:33Z

Run Java PostCommit

chamikaramj · 2019-01-31T21:29:41Z

Run Dataflow ValidatesRunner

chamikaramj · 2019-01-31T21:30:45Z

Thanks. LGTM.

I'll merge after post-commit tests pass.

wscheep · 2019-02-03T18:22:10Z

Run Java PreCommit

wscheep · 2019-02-03T20:24:03Z

Run Java PostCommit

chamikaramj · 2019-02-06T03:25:11Z

Sorry, can you please resolve the conflict.

reuvenlax · 2019-02-06T03:57:20Z

...a/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java

        .setCoder(
            KvCoder.of(
-                VoidCoder.of(), KvCoder.of(TableDestinationCoderV2.of(), StringUtf8Coder.of())))
+                VoidCoder.of(), KvCoder.of(TableDestinationCoderV3.of(), StringUtf8Coder.of())))


I think this will break some Dataflow users who expect to be able to update their pipelines, and can't if these coders change. @chamikaramj what do you think here?

I'm bit confused by this comment (and probably missing something). Isn't this similar to the TableDestinationCoder to TableDestinationCoderV2 change we did in following commit to preserve update compatibility ?

b0e03a3#diff-b2706c94bc268b2bc2b78820ab23b0fe

I'm also not sure, I based myself on the commit referenced above. Why doesn't this preserve update compatibility?

@reuvenlax in retrospec looking at both the TableDestinationCoder and TableDestinationCoderV2 I see that they are complexly incompatible and could understand requiring the V2 coder. Am I correct that iso adding another Nullable field (clustering) in the TableDestinationCoderV2 will not break the upgradeability? I think the topic of Coder compatibility on upgrades is not a well understood topic.

Coder compatibility is unfortunately not well defined. The best assumption today is that any change to coder is incompatible (Beam needs to define a better story around updating coders).

I don't think a Nullable field helps here. The nullable coder will still expect something there, and there will be nothing there.

We need to either:

Decide that this is important enough that we are willing to break update.
2 . Make the use of the new coder somehow contingent on the user using clustering.

How can we make that decision ?

Also, what did we do previously ? Did we cluster changes that will unavoidably break update compatibility ?

@wscheep is approach (2) described by Reuven viable here ? I.e., Can we only use new coder when user use BQ clustering ?

For this particular instance, I think we can safely continue to use TableDestinationCoderV2 since none of the subsequent processing steps in this method reference the clustering configuration. The TableDestinations we code here are used to configure a copy job, which does not need to know about clustering.

reuvenlax · 2019-02-06T03:59:39Z

...loud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/TableDestinationCoderV3.java

+/**
+ * A {@link Coder} for {@link TableDestination} that includes time partitioning information. This is
+ * a new coder (instead of extending the old {@link TableDestinationCoder}) for compatibility
+ * reasons. The old coder is kept around for the same compatibility reasons.


I'm not sure I understand the logic. We are replacing the old coder with the new coder, which has the same compatibility issues. Keeping the old coder around in the codebase doesn't really change much.

I think the biggest confusion started when the V2 coder was created. @wscheep thought this is the patten of updating coders. If you can confirm that adding the field (as Nullable) would not break upgrades the V3 can be removed and added to the V2.

wscheep · 2019-02-06T15:06:28Z

Run Python PreCommit

reuvenlax · 2019-02-22T05:47:27Z

Sorry for the delay here, I was out of the office for a few days.

I'm working on a way so this can be done in a compatible way. I'll update you once I have something.

juancho088 · 2019-03-29T22:17:11Z

How is this PR going on? Soon to be completed?

jklukas · 2019-05-20T18:32:13Z

Getting this PR to completion would be very useful, as table clustering is no longer in beta and likely to become more popular.

jklukas · 2019-06-25T19:36:53Z

...a/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java

    tempTables
        .apply("ReifyRenameInput", new ReifyAsIterable<>())
-        .setCoder(IterableCoder.of(KvCoder.of(TableDestinationCoderV2.of(), StringUtf8Coder.of())))
+        .setCoder(IterableCoder.of(KvCoder.of(TableDestinationCoderV3.of(), StringUtf8Coder.of())))


Same here. The coded value is only passed to WriteRename which creates a copy job and never references the clustering configuration.

jklukas · 2019-06-26T13:59:31Z

Since this PR seems to have stalled out, I've taken the commits here and posted a new PR #8945 that is rebased on master and addresses the coder evolution issue by providing an interface for users to opt in to the new coder.

jklukas · 2019-07-22T15:23:45Z

The contents of this PR were incorporated into #8945 which is now merged, so this issue can be closed. This feature should be available in the 2.15 release.

chamikaramj · 2019-08-14T01:23:58Z

Closing since this was included in #8945

chamikaramj self-requested a review November 28, 2018 18:27

chamikaramj reviewed Dec 21, 2018

View reviewed changes

wscheep force-pushed the bq_clustering branch from 9186b55 to 2df3611 Compare January 28, 2019 12:05

wscheep force-pushed the bq_clustering branch 2 times, most recently from 08cb56b to 5c3320d Compare January 28, 2019 15:14

wscheep force-pushed the bq_clustering branch 2 times, most recently from 685df30 to 3f728c9 Compare February 3, 2019 17:18

reuvenlax reviewed Feb 6, 2019

View reviewed changes

[BEAM-5191] Support for BigQuery clustering

21e5c0b

wscheep force-pushed the bq_clustering branch from 3f728c9 to 21e5c0b Compare February 6, 2019 10:40

juancho088 approved these changes Mar 29, 2019

View reviewed changes

jklukas mentioned this pull request May 20, 2019

Recreate core_v10 as clustered on submission_timestamp mozilla/gcp-ingestion#610

Closed

jklukas mentioned this pull request Jun 25, 2019

[BEAM-5191] Support for BigQuery clustering #8945

Merged

3 tasks

jklukas reviewed Jun 25, 2019

View reviewed changes

chamikaramj closed this Aug 14, 2019

Conversation

wscheep commented Nov 16, 2018

Post-Commit Tests Status (on master branch)

Uh oh!

mxm commented Nov 20, 2018

Uh oh!

chamikaramj commented Nov 28, 2018

Uh oh!

robertwb commented Dec 10, 2018

Uh oh!

chamikaramj left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wscheep commented Jan 28, 2019

Uh oh!

lgajowy commented Jan 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wscheep commented Jan 28, 2019

Uh oh!

chamikaramj commented Jan 31, 2019

Uh oh!

chamikaramj commented Jan 31, 2019

Uh oh!

chamikaramj commented Jan 31, 2019

Uh oh!

wscheep commented Feb 3, 2019

Uh oh!

wscheep commented Feb 3, 2019

Uh oh!

chamikaramj commented Feb 6, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wscheep commented Feb 6, 2019

Uh oh!

reuvenlax commented Feb 22, 2019

Uh oh!

juancho088 commented Mar 29, 2019

Uh oh!

jklukas commented May 20, 2019

Uh oh!

lgajowy commented Jan 28, 2019 •

edited

Loading