[FLINK-35237] Allow Sink to Choose HashFunction in PrePartitionOperator#3414
Conversation
lvyanquan
left a comment
There was a problem hiding this comment.
Thanks for this great contribution, LGTM. left some minor comments about java doc.
lvyanquan
left a comment
There was a problem hiding this comment.
LGTM.
And CC @PatrickRen @leonardBang
|
What's more, considering that the number of buckets and parallelism may not be consistent, should we remove the constraint on EventPartitioner? |
Although the number of buckets and parallelism will differ, we can only distribute based on parallelism rather than buckets, right? We have already distributed the hash values to the various parallelisms here, so I think there's no need to change anything here. |
|
Got it, There is indeed no need for adjustment. |
|
Can this PR be merged before the other PR? Both PRs are marked for inclusion in version 3.2, but the other PR depends on this one. I will need some time to make the necessary adjustments. |
|
@leonardBang @PatrickRen can you help to review and merge this? |
|
@yuxiqian CC. |
yuxiqian
left a comment
There was a problem hiding this comment.
Thanks for @dingxin-tech's contribution. I wonder if DefaultHashFunctionProvider implementation could be improved when migrating from existing PrePartitionOperator#HashFunction.
| // -------------------------------------------------------------------------------------------- | ||
| default void open() {} | ||
|
|
||
| default void close() {} |
There was a problem hiding this comment.
Why a simple Provider interface need these life cycle methods? do we use them in any implementation classes?
There was a problem hiding this comment.
Since we provided the getHashFunction method based on TableId, connector implementers might use TableId to obtain the actual schema of the database and perform some caching operations. We can establish connections and initialize caches in the open method. These two lifecycle methods were added following @lvyanquan 's suggestion, and he might respond with further additions.
There was a problem hiding this comment.
Usually, it will need to obtain partition or bucket information based on TableId.
I wonder if it is possible to cache catalog or connection here to reuse objects or connections.
There was a problem hiding this comment.
A function with lifecycle management makes sense to me, why the factory need to care the resource management? could we push the logic that use TableId to obtain partition or buckets info to HashFunction internal? The HashFunction need to open/close may be better, what do you think?
There was a problem hiding this comment.
Actually, HashFunctionalProvider is similar to the role of a catalog, and HashFunction is similar to the role of a table, and the open/lose method is called on the catalog role.
There was a problem hiding this comment.
Well, Catalog is used as a metadata manager instance instead of a table factory, table is not constructed by the instance. A metadata manager has its life cycle makes sense to me, a factory own lifecycle confuse me a little. Come back to the function itself, Flink also has many functions as well as the function factories, I didn't see the necessary why a function factory need to care the required resource in runtime, but function manages its resources is pretty common.
There was a problem hiding this comment.
You can also visit flink code about:
-
RichFunction & SourceFunctionProvider
-
InputFormat & InputFormatProvider
There was a problem hiding this comment.
So the key point is that what we need is a metadata manager or a function factory. I agree that what we need is one function factory, though it may need the assistance of specific metadata manager.
And the resource utilization is indeed overthinking, because we can release database connection after extracting the parameters for calculation.
There was a problem hiding this comment.
Yeah, that's what I want to propose
There was a problem hiding this comment.
In my opinion, I believe that HashFunction should not have lifecycle functions. It is merely a hash function, not a Flink operator, and it will be cached and disappear automatically. Its lifecycle can be perceived as a kind of "constant" after creation, so we should not concern ourselves with its lifecycle.
Based on this viewpoint, whether HashFunctionProvider should have lifecycle functions depends solely on whether we need to reuse some resources created at runtime when creating a HashFunction.
In fact, I also think that currently, the connectors do not have any tasks that need to be done within lifecycle functions. Therefore, I have removed the lifecycle functions for now.
|
hi, @leonardBang, can you help to review again and merge this? |
9084d60 to
5222dfa
Compare
5222dfa to
7ac39fd
Compare
I append one commit to polish the interface and package, Could you also take a look ? |
sure and looks good for me. |
https://issues.apache.org/jira/browse/FLINK-35237