Skip to content

change mongo scans to filter client-side#347

Merged
agavra merged 1 commit intomainfrom
new_mongo_scans
Sep 11, 2024
Merged

change mongo scans to filter client-side#347
agavra merged 1 commit intomainfrom
new_mongo_scans

Conversation

@agavra
Copy link
Copy Markdown
Contributor

@agavra agavra commented Sep 11, 2024

Turns out this was a bit more of a refactoring nightmare than I wanted it to be! Strategy was to pass in alongside the TablePartitioner a method that allows us to determine whether or not a key belongs to a Kafka partition. We can consider refactoring that to return a specific partition instead of whether or not a key "belongs" in a given one, but this makes it simpler to bypass the method in cases where it doesn't need to be implemented.

90% of this PR is just adding generics all over the place to somewhere that were necessary to get the TablePartitioner piped into the tables instead of being created in the flush manager

@agavra agavra requested a review from ableegoldman September 11, 2024 01:47
Copy link
Copy Markdown
Contributor

@rodesai rodesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM, one structural comment inline.

public Integer metadataTablePartition(final int kafkaPartition) {
return kafkaPartition;
}
boolean belongs(final Bytes key, final int kafkaPartition);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems cleaner to just have a method that returns the partition for a key, or the number of partitions so then we don't have a bunch of impls that throw unsupportedoperationexception than to have those impls understand the context of why it's being called and throw accordingly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed offline, the exceptions are less about the API and more a shortcut in order to avoid needing to pass in the numPartitions to all the other implementations. eventually it would make sense to implement this method in those as well.

@agavra agavra merged commit 8e9b0d8 into main Sep 11, 2024
@agavra agavra deleted the new_mongo_scans branch September 11, 2024 20:22
rodesai added a commit that referenced this pull request Sep 15, 2024
This reverts commit 8e9b0d8 which
removed the partition from the mongo value schema and used client-side
filtering that computed the partition from the mongo key.

It turns out that this approach won't actually work because the mongo
key may not be the original record key that was used to compute the
changelog partition. We therefore cannot use the mongo key to compute
the partition.:
- some dsl operators include a timestamp in the key
- a user writing their own PAPI processor is free to construct their
  own key, which we cannot predict.
rodesai added a commit that referenced this pull request Sep 20, 2024
This reverts commit 8e9b0d8 which
removed the partition from the mongo value schema and used client-side
filtering that computed the partition from the mongo key.

It turns out that this approach won't actually work because the mongo
key may not be the original record key that was used to compute the
changelog partition. We therefore cannot use the mongo key to compute
the partition.:
- some dsl operators include a timestamp in the key
- a user writing their own PAPI processor is free to construct their
  own key, which we cannot predict.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants