[BEAM-5191] Support for BigQuery clustering by jklukas · Pull Request #8945 · apache/beam

jklukas · 2019-06-25T19:23:44Z

This takes the commits from #7061, rebases on master, and adds an enableClustering method to allow users to opt in to the updated coder when using dynamic destinations.

This follows the pattern of #6914 where we added a new version of MetadataCoder, documenting how a user could opt in and why they might want to.

This should solve the coder compatibility concerns that were the blockers to merging #7061.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

Post-Commit Tests Status (on master branch)

Lang	Apex	Dataflow	Gearpump	Samza
Go	---	---	---	---
Java
Python	---		---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website
Non-portable
Portable	---		---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

jklukas · 2019-06-26T13:56:41Z

R: @juancho088 @reuvenlax @alexvanboxel @chamikaramj who were reviewers on #7061

Also cc @wscheep who authored #7061 which is still the bulk of the code here.

jklukas · 2019-06-26T14:09:02Z

Run JavaPortabilityApi PreCommit

jklukas · 2019-06-26T15:20:28Z

...le-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIOWriteTest.java

+                .withTestServices(fakeBqServices)
+                .withMethod(BigQueryIO.Write.Method.FILE_LOADS)
+                .withSchema(schema)
+                .enableClustering()


Notably, this test fails if enableClustering() is not called because the default TableDestinationCoderV2 is used and clustering information is dropped before the table is created. This is exactly the behavior we want for backwards compatibility.

chamikaramj

Thanks.

chamikaramj · 2019-07-01T18:30:44Z

...d-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/DynamicDestinationsHelpers.java

    Coder<DestinationT> getDestinationCoderWithDefault(CoderRegistry registry)
        throws CannotProvideCoderException {
-      return inner.getDestinationCoderWithDefault(registry);
+      Coder<DestinationT> destinationCoder = getDestinationCoder();


This seems to be a behavior change for non-clustering case ?
Previously we returned inner.getDestinationCoderWithDefault(registry) and now we return TableDestinationCoderV2.of().

DynamicDestinations#getDestinationCoderWithDefault is commented as:

// Gets the destination coder. If the user does not provide one, try to find one in the coder // registry. If no coder can be found, throws CannotProvideCoderException.

This code is written with potentially multiple layers of delegation, and I think the correct behavior here is to return the first non-delegated implementation of getDestinationCoder() that appears as we move down the delegation chain.

I would argue that the existing behavior is incorrect. Currently, if an implementing class defines a custom return value for getDestinationCoder, that value is ignored when you call getDestinationCoderWithDefault. My expectation is that getDestinationCoderWithDefault would always return the same value as getDestinationCoder except in the null case, in which getDestinationCoderWithDefault would then attempt to look up a coder in the registry.

So the change here is intended to fix broken behavior.

It's possible that a user has written a custom class that extends DelegatingDynamicDestinations and relies on the incorrect behavior, but it feels unlikely to me.

For the scope of the coders provided here, I don't believe this change affects behavior (the method was already returning TableDestinationCoderV2 in all cases).

It may actually be better to remove this override altogether. getDestinationCoderWithDefault is defined as package-private and probably shouldn't be overridden. Instead, we let getDestinationCoder handle delegation, and the base implementation of getDestinationCoderWithDefault will make the appropriate getDestinationCoder call before falling back to the registry.

I've removed the override getDestinationCoderWithDefault completely.

I've removed the override getDestinationCoderWithDefault completely.

Looks like that caused some regressions where destination coder couldn't be found from type. I'm looking into it.

The override turns out to be necessary because the DynamicDestinations instance itself is eventually passed into extractFromTypeParameters and the inner class potentially has richer type information than the delegating class.

I've put the override method back in place, including the modification of first checking if we have a non-null return from getDestinationCoder().

chamikaramj · 2019-07-01T18:56:12Z

.../src/test/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryTimePartitioningClusteringIT.java

+                .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
+                .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));
+
+    p.run().waitUntilFinish();


Thanks for adding these ITs :)

I believe these ITs are the work of @wscheep in the previous PR, and I am going to borrow liberally from these examples to add the dynamic destinations ITs.

chamikaramj · 2019-07-01T19:01:38Z

.../src/test/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryTimePartitioningClusteringIT.java

+    p.apply(BigQueryIO.readTableRows().from(options.getInput()))
+        .apply(ParDo.of(new KeepStationNumberAndConvertDate()))
+        .apply(
+            BigQueryIO.writeTableRows()


Can we also add one with dynamic destinations and clustering ?

Added. It doesn't look like the ITs are run by Jenkins by default and I also can't find that there's a command to kick off BigQueryIO integration tests. Do you have any advice about how we can get results for these?

Post-commit test suite should capture these. Just triggered it.

chamikaramj · 2019-07-01T19:02:57Z

...a/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java

+     * option, since {@link TableDestinationCoderV3} will not be able to read state written with a
+     * previous version.
+     */
+    public Write<T> enableClustering() {


Having these two methods seems to make the API pretty brittle.

How about just having one function withClustering() that optionally takes a clustering object ? In dynamic destinations case optional Clustering object can be skipped/null and the method will behave similar to enableClustering() ?

Implemented. enableClustering() is now withClustering() and it sets clustering to a default Clustering instance with no fields set, which we use as a flag for enabling clustering on dynamic destinations.

chamikaramj · 2019-07-02T22:08:36Z

Run Java PostCommit

chamikaramj

Thanks. Looks great.

Just a couple of comments.

...d-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/DynamicDestinationsHelpers.java

chamikaramj · 2019-07-02T22:50:07Z

...le-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIOWriteTest.java

  }

+  @Test
+  public void testClusteringStreamingInserts() throws Exception {


Probably also add a test that confirms that we fail if withClustering() is set without withTimePartitioning().

Added

@Test(expected = IllegalArgumentException.class) public void testClusteringThrowsWithoutPartitioning() throws Exception {

chamikaramj · 2019-07-03T00:09:22Z

FYI: I'm OOO till 7/9 so please add another reviewer if you need to get this merged early.

LGTM from my side other than the two comments I mentioned and integration tests passing.

jklukas · 2019-07-03T01:23:26Z

Run Java PostCommit

jklukas · 2019-07-03T12:40:52Z

LGTM from my side other than the two comments I mentioned and integration tests passing.

Addressed those comments, integration tests passed, and now I've squashed and rebased on master.

FYI: I'm OOO till 7/9 so please add another reviewer if you need to get this merged early.

I'm also OOO next week, so if there are any further changes that need to be made, I'll be back on 7/15. We have several weeks before the branch cut date for 2.15, so I don't see any point in rushing.

And thanks for the thoughtful review, @chamikaramj .

jklukas · 2019-07-05T14:29:33Z

Run Java PostCommit

jklukas · 2019-07-05T17:04:06Z

Run Java PostCommit

reuvenlax · 2019-07-08T17:06:14Z

...loud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/TableDestinationCoderV3.java

+ * A {@link Coder} for {@link TableDestination} that includes time partitioning and clustering
+ * information. Users must opt in to this version of the coder by setting one of the clustering
+ * options on {@link BigQueryIO.Write}, otherwise {@link TableDestinationCoderV2} will be used and
+ * clustering information will be discarded.


if a user forgets to set a clustering option but creates a TableDestination with clustering information, will we warn them or just silently discard?

jklukas · 2019-07-08T20:10:56Z

There is no warning at this point. For the case of loading to an existing table, the operation will throw exceptions anyway if clustering is mismatched. Perhaps there are compile time cases we could warn about. I will have to think more about when I'm back in office next week if we want to block on that.

…

On Mon, Jul 8, 2019, 1:06 PM reuvenlax ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/TableDestinationCoderV3.java <#8945 (comment)>: > + */ +package org.apache.beam.sdk.io.gcp.bigquery; + +import java.io.IOException; +import java.io.InputStream; +import java.io.OutputStream; +import org.apache.beam.sdk.coders.AtomicCoder; +import org.apache.beam.sdk.coders.Coder; +import org.apache.beam.sdk.coders.NullableCoder; +import org.apache.beam.sdk.coders.StringUtf8Coder; + +/** + * A ***@***.*** Coder} for ***@***.*** TableDestination} that includes time partitioning and clustering + * information. Users must opt in to this version of the coder by setting one of the clustering + * options on ***@***.*** BigQueryIO.Write}, otherwise ***@***.*** TableDestinationCoderV2} will be used and + * clustering information will be discarded. if a user forgets to set a clustering option but creates a TableDestination with clustering information, will we warn them or just silently discard? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#8945?email_source=notifications&email_token=AAFIW5DW44XOYCDV56PR263P6NXZFA5CNFSM4H3LPCH2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOB5YJ5IY#pullrequestreview-259038883>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAFIW5HCBJWDJJP3EH4GJE3P6NXZFANCNFSM4H3LPCHQ> .

reuvenlax · 2019-07-08T20:18:32Z

A "good enough" solution might be to always print a warning when clustering is not enabled. We will want to wordsmith the warning so it's not too frightening, and it's clear that it only applies if trying to specify clustering.

…

On Mon, Jul 8, 2019 at 1:11 PM Jeff Klukas ***@***.***> wrote: There is no warning at this point. For the case of loading to an existing table, the operation will throw exceptions anyway if clustering is mismatched. Perhaps there are compile time cases we could warn about. I will have to think more about when I'm back in office next week if we want to block on that. On Mon, Jul 8, 2019, 1:06 PM reuvenlax ***@***.***> wrote: > ***@***.**** commented on this pull request. > ------------------------------ > > In > sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/TableDestinationCoderV3.java > <#8945 (comment)>: > > > + */ > +package org.apache.beam.sdk.io.gcp.bigquery; > + > +import java.io.IOException; > +import java.io.InputStream; > +import java.io.OutputStream; > +import org.apache.beam.sdk.coders.AtomicCoder; > +import org.apache.beam.sdk.coders.Coder; > +import org.apache.beam.sdk.coders.NullableCoder; > +import org.apache.beam.sdk.coders.StringUtf8Coder; > + > +/** > + * A ***@***.*** Coder} for ***@***.*** TableDestination} that includes time partitioning and clustering > + * information. Users must opt in to this version of the coder by setting one of the clustering > + * options on ***@***.*** BigQueryIO.Write}, otherwise ***@***.*** TableDestinationCoderV2} will be used and > + * clustering information will be discarded. > > if a user forgets to set a clustering option but creates a > TableDestination with clustering information, will we warn them or just > silently discard? > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > < #8945?email_source=notifications&email_token=AAFIW5DW44XOYCDV56PR263P6NXZFA5CNFSM4H3LPCH2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOB5YJ5IY#pullrequestreview-259038883 >, > or mute the thread > < https://github.com/notifications/unsubscribe-auth/AAFIW5HCBJWDJJP3EH4GJE3P6NXZFANCNFSM4H3LPCHQ > > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8945?email_source=notifications&email_token=AFAYJVMDVF3DPJ4JDFBEIJLP6ONN3A5CNFSM4H3LPCH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZOGXUY#issuecomment-509373395>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFAYJVKO7SCMPFTQGZWTTLTP6ONN3ANCNFSM4H3LPCHQ> .

chamikaramj · 2019-07-10T18:10:02Z

Thanks. I agree that a compile time (or even run-time) failure is better but a warning should be OK if this is not possible for some reason.

I'll wait till this is addressed for the merge.

jklukas · 2019-07-17T18:04:23Z

Run Java PostCommit

jklukas · 2019-07-17T20:50:43Z

Run Java PreCommit

jklukas · 2019-07-17T20:50:52Z

Run JavaPortabilityApi PreCommit

jklukas · 2019-07-18T12:31:10Z

...io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/CreateTables.java

+              + " to use TableDestinationCoderV2. Set withClustering() on BigQueryIO.write() and, "
+              + " if you provided a custom DynamicDestinations instance, override"
+              + " getDestinationCoder() to return TableDestinationCoderV3.",
+          dynamicDestinations);


We cannot know at compile-time whether a custom table function or DynamicDestinations instance will produce any destination with clustering enabled, so we have to check the produced destinations at runtime.

This check will cause a runtime failure for StreamingInserts if a clustered destination is produced without the relevant configuration to use the newer destination coder.

jklukas · 2019-07-18T12:31:48Z

.../io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java

+              + " to use TableDestinationCoderV2. Set withClustering() on BigQueryIO.write() and, "
+              + " if you provided a custom DynamicDestinations instance, override"
+              + " getDestinationCoder() to return TableDestinationCoderV3.",
+          dynamicDestinations);


This check will cause a runtime failure for BatchLoads if a clustered destination is produced without the relevant configuration to use the newer destination coder.

jklukas · 2019-07-18T12:34:18Z

R: @chamikaramj

I've added in runtime checks for both the streaming insert and batch load cases where the transform will fail with an exception that includes instructions about how to configure clustering if a clustered destination is produced without the right coder in place.

Does this look ready to merge?

chamikaramj · 2019-07-18T17:59:14Z

LGTM. Thanks Jeff.

chamikaramj · 2019-07-18T17:59:22Z

Run Java PostCommit

chamikaramj · 2019-07-18T17:59:31Z

Run Dataflow ValidatesRunner

jklukas · 2019-07-18T18:03:34Z

Run Java PostCommit

jklukas · 2019-07-19T16:30:42Z

Run Java PostCommit

jklukas · 2019-07-19T17:33:53Z

Run Java PostCommit

jklukas · 2019-07-19T18:35:14Z

Run Java PostCommit

jklukas · 2019-07-19T18:35:23Z

Run Dataflow ValidatesRunner

jklukas · 2019-07-22T01:17:16Z

@chamikaramj All the tests here are finally passing again after transient issues.

chamikaramj · 2019-07-22T04:27:02Z

Thanks. Merging.

jklukas force-pushed the bq_clustering branch 2 times, most recently from eb18664 to 5acc4bf Compare June 26, 2019 13:52

This was referenced Jun 26, 2019

[BEAM-5191] Support for BigQuery clustering #7061

Closed

Patch Beam to allow writing into clustered tables mozilla/gcp-ingestion#672

Merged

jklukas commented Jun 26, 2019

View reviewed changes

chamikaramj self-requested a review June 27, 2019 22:30

chamikaramj reviewed Jul 1, 2019

View reviewed changes

jklukas force-pushed the bq_clustering branch 2 times, most recently from bb77061 to 328d14b Compare July 2, 2019 12:22

chamikaramj reviewed Jul 2, 2019

View reviewed changes

jklukas force-pushed the bq_clustering branch from c0cbbff to 763f4c7 Compare July 3, 2019 12:36

jklukas force-pushed the bq_clustering branch from 763f4c7 to dca4068 Compare July 5, 2019 17:03

juancho088 approved these changes Jul 8, 2019

View reviewed changes

reuvenlax reviewed Jul 8, 2019

View reviewed changes

jklukas commented Jul 18, 2019

View reviewed changes

[BEAM-5191] Support for BigQuery clustering

304882c

jklukas force-pushed the bq_clustering branch from ed064d3 to 304882c Compare July 19, 2019 18:34

chamikaramj merged commit 94a3c85 into apache:master Jul 22, 2019

jklukas deleted the bq_clustering branch July 22, 2019 16:24

Conversation

jklukas commented Jun 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

Uh oh!

jklukas commented Jun 26, 2019

Uh oh!

jklukas commented Jun 26, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chamikaramj left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jklukas Jul 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chamikaramj commented Jul 2, 2019

Uh oh!

chamikaramj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chamikaramj commented Jul 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jklukas commented Jul 3, 2019

Uh oh!

jklukas commented Jul 3, 2019

Uh oh!

jklukas commented Jul 5, 2019

Uh oh!

jklukas commented Jul 5, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jklukas commented Jul 8, 2019 via email

Uh oh!

reuvenlax commented Jul 8, 2019 via email

Uh oh!

chamikaramj commented Jul 10, 2019

Uh oh!

jklukas commented Jul 17, 2019

Uh oh!

jklukas commented Jul 17, 2019

Uh oh!

jklukas commented Jul 17, 2019

Uh oh!

jklukas commented Jun 25, 2019 •

edited

Loading

jklukas Jul 1, 2019 •

edited

Loading

chamikaramj commented Jul 3, 2019 •

edited

Loading