[BEAM-3566] Replace apply_* hooks in DirectRunner with PTransformOverrides#4529
[BEAM-3566] Replace apply_* hooks in DirectRunner with PTransformOverrides#4529robertwb merged 2 commits intoapache:masterfrom
Conversation
39be828 to
27878b6
Compare
|
R: @robertwb This is ready for review. |
robertwb
left a comment
There was a problem hiding this comment.
Nice, thanks! Just a couple of comments.
|
|
||
| import hamcrest as hc | ||
|
|
||
| from apache_beam import Map |
There was a problem hiding this comment.
Use import apache_beam as beam, and then beam.Map for consistency with the rest of the codebase.
| overrides = _get_transform_overrides(p.options) | ||
| p.replace_all(overrides) | ||
|
|
||
| # Note that the direct output of ReadStringsFromPubSub will be replaced |
There was a problem hiding this comment.
These tests seem rather brittle. Is there a better way to test this transform application than grabbing the internal source and verifying a couple of properties on it. https://beam.apache.org/documentation/pipelines/test-your-pipeline/
There was a problem hiding this comment.
I've addressed this here: #4529 (comment)
| ProcessKeyedElementsViaKeyedWorkItemsOverride()] | ||
|
|
||
| class CombinePerKeyOverride(PTransformOverride): | ||
| def get_matcher(self): |
There was a problem hiding this comment.
@chamikaramj I'm curious why there isn't just a is_match() method on PTransformOverride. (Or, even simpler, perhpas get_replacement_transform would return None for non-matches.)
There was a problem hiding this comment.
I think this promotes developing simple (and possibly lightweight) matchers that might be shared between different overrides. We apply the matchers to every transform in the tree. get_replacement_transform() might be more heavy weight. This is inspired by the Java design.
There was a problem hiding this comment.
Java (at least Java 7) didn't let you pass methods around as first class objects, thus forcing the design. Even with methods, sharding can happen (e.g. is_match = some_common_utility). We should make the common case the simple one (but probably not this PR). Could you file a bug?
There was a problem hiding this comment.
Bug to replace get_matcher() by is_match() ? I think it's still good to keep is_match() and get_replacement_transform() separate to promote lightweight matching.
There was a problem hiding this comment.
Yes. (The advantage of not having separate methods is that you can get rid of the whole need for a class, and just let them be callables (again, not possible in Java), but I'm fine with is_match (or matches).
There was a problem hiding this comment.
|
|
||
| overrides = [SplittableParDoOverride(), | ||
| ProcessKeyedElementsViaKeyedWorkItemsOverride(), | ||
| CombinePerKeyOverride(),] |
There was a problem hiding this comment.
Nit: only leave trailing commas if the closing is on its own line.
| from apache_beam.io.gcp import pubsub | ||
| except ImportError: | ||
| pass | ||
| if pubsub: |
There was a problem hiding this comment.
Factor this out into a method, and invoke it from the try block.
| def curry_combine_fn(fn, args, kwargs): | ||
| if not args and not kwargs: | ||
| return fn | ||
| class _CurriedFn(core.CombineFn): |
There was a problem hiding this comment.
This may be a performance regression.
There was a problem hiding this comment.
Given that we're calling fn(..., *args, **kwargs) anyways, this shouldn't matter. Please ignore.
| return fn | ||
|
|
||
| return CurriedFn() | ||
| return _CurriedFn(fn, args, kwargs) |
There was a problem hiding this comment.
Put this in an else for symmetry.
27878b6 to
e1a96eb
Compare
charlesccychen
left a comment
There was a problem hiding this comment.
Thanks! PTAL.
| overrides = _get_transform_overrides(p.options) | ||
| p.replace_all(overrides) | ||
|
|
||
| # Note that the direct output of ReadStringsFromPubSub will be replaced |
There was a problem hiding this comment.
robertwb wrote:
These tests seem rather brittle. Is there a better way to test this transform application than grabbing the internal source and verifying a couple of properties on it. https://beam.apache.org/documentation/pipelines/test-your-pipeline/
Thanks, and I agree, but the issue is that there really isn't anything to test.
The test here isn't a test of the transform; rather, it was (and still is) testing the behavior of replacing the transform with the correct DirectRunner replacement. The transform itself is just a wrapper with runner-specific overrides.
I changed the test name to reflect the intended target of the test.
There was a problem hiding this comment.
Sorry, I meant https://beam.apache.org/contribute/ptransform-style-guide/#testing-transform-construction-and-validation
If this is about the direct runner, we should put it into the direct runner tests. Best is if we could create a mock/in memory PubSub and make sure this works end-to-end (on any runner).
There was a problem hiding this comment.
But, as mentioned, fixing these existing tests should not block this PR. Please file a JIRA.
There was a problem hiding this comment.
Thanks, filed https://issues.apache.org/jira/browse/BEAM-3619.
| ProcessKeyedElementsViaKeyedWorkItemsOverride()] | ||
|
|
||
| class CombinePerKeyOverride(PTransformOverride): | ||
| def get_matcher(self): |
There was a problem hiding this comment.
robertwb wrote:
Yes. (The advantage of not having separate methods is that you can get rid of the whole need for a class, and just let them be callables (again, not possible in Java), but I'm fine with is_match (or matches).
Done.
|
|
||
| import hamcrest as hc | ||
|
|
||
| from apache_beam import Map |
There was a problem hiding this comment.
robertwb wrote:
Useimport apache_beam as beam, and thenbeam.Mapfor consistency with the rest of the codebase.
Done.
|
|
||
| overrides = [SplittableParDoOverride(), | ||
| ProcessKeyedElementsViaKeyedWorkItemsOverride(), | ||
| CombinePerKeyOverride(),] |
There was a problem hiding this comment.
robertwb wrote:
Nit: only leave trailing commas if the closing is on its own line.
Done.
| def curry_combine_fn(fn, args, kwargs): | ||
| if not args and not kwargs: | ||
| return fn | ||
| class _CurriedFn(core.CombineFn): |
There was a problem hiding this comment.
robertwb wrote:
Given that we're calling fn(..., *args, **kwargs) anyways, this shouldn't matter. Please ignore.
Acknowledged.
| from apache_beam.io.gcp import pubsub | ||
| except ImportError: | ||
| pass | ||
| if pubsub: |
There was a problem hiding this comment.
robertwb wrote:
Factor this out into a method, and invoke it from the try block.
Done.
| return fn | ||
|
|
||
| return CurriedFn() | ||
| return _CurriedFn(fn, args, kwargs) |
There was a problem hiding this comment.
robertwb wrote:
Put this in an else for symmetry.
Done.
| overrides = _get_transform_overrides(p.options) | ||
| p.replace_all(overrides) | ||
|
|
||
| # Note that the direct output of ReadStringsFromPubSub will be replaced |
There was a problem hiding this comment.
But, as mentioned, fixing these existing tests should not block this PR. Please file a JIRA.
In the Python DirectRunner, we currently use apply_* overrides to override the operation of the default .expand() operation for certain transforms. For example, GroupByKey has a special implementation in the DirectRunner, so we use an apply_* override hook to replace the implementation of GroupByKey.expand().
However, this strategy has drawbacks. Because this override operation happens eagerly during graph construction, the pipeline graph is specialized and modified before a specific runner is bound to the pipeline's execution. This makes the pipeline graph non-portable and blocks full migration to using the Runner API pipeline representation in the DirectRunner.
By contrast, the SDK's PTransformOverride mechanism allows the expression of matchers that operate on the unspecialized graph, replacing PTransforms as necessary to produce a DirectRunner-specialized pipeline graph for execution.
We therefore want to replace these eager apply_* overrides with PTransformOverrides that operate on the completely constructed graph.
https://issues.apache.org/jira/browse/BEAM-3566