Support Embeddings in mltransform#29564
Merged
AnandInguva merged 58 commits intoapache:masterfrom Dec 11, 2023
Merged
Conversation
Fix tox.ini Fix pydoc Fix indent in pydoc
Contributor
Author
|
R: @damccorm This is ready for review. |
Contributor
|
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control |
damccorm
reviewed
Dec 1, 2023
Contributor
damccorm
left a comment
There was a problem hiding this comment.
Chunking the review a bit. Just reviewed base.py thus far, though I imagine that's the place I'll have the most comments
AnandInguva
commented
Dec 4, 2023
sdks/python/apache_beam/ml/transforms/embeddings/tensorflow_hub_test.py
Outdated
Show resolved
Hide resolved
damccorm
reviewed
Dec 4, 2023
sdks/python/apache_beam/ml/transforms/embeddings/sentence_transformer.py
Outdated
Show resolved
Hide resolved
sdks/python/apache_beam/ml/transforms/embeddings/tensorflow_hub.py
Outdated
Show resolved
Hide resolved
sdks/python/apache_beam/ml/transforms/embeddings/sentence_transformer.py
Show resolved
Hide resolved
sdks/python/apache_beam/ml/transforms/embeddings/sentence_transformer_test.py
Outdated
Show resolved
Hide resolved
… on different machines
This reverts commit cfb1883.
damccorm
reviewed
Dec 8, 2023
Contributor
damccorm
left a comment
There was a problem hiding this comment.
This mostly looks good, had a few more comments though
16 tasks
damccorm
reviewed
Dec 8, 2023
| raise FileExistsError( | ||
| "The artifact location %s already exists and contains %s. Please " | ||
| "specify a different location." % | ||
| (artifact_location, _ATTRIBUTE_FILE_NAME)) |
Contributor
There was a problem hiding this comment.
Good call - one possible future enhancement would be to support an overwrite argument that allows users to do this
sdks/python/apache_beam/ml/transforms/embeddings/vertex_ai_test.py
Outdated
Show resolved
Hide resolved
damccorm
reviewed
Dec 11, 2023
damccorm
reviewed
Dec 11, 2023
damccorm
approved these changes
Dec 11, 2023
Contributor
damccorm
left a comment
There was a problem hiding this comment.
One last nit, otherwise this LGTM. Feel free to make the change and merge once checks pass
Co-authored-by: Danny McCormick <dannymccormick@google.com>
16 tasks
16 tasks
16 tasks
16 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Each config will define its own model handler so that model handler can be compatible with the inputs passed to the MLTransform.
This PR supports
vertex_ai,hugging face(sentence-transformers).Changes made:
ProcessHandlera PTransform._TextEmbeddingHandlerwhich takes aModelHandlerbut it is responsible to work onDict[str, Any]inputs.MLTransformwould be the container that holds the list of transforms and these list of transforms are passed to the_MLTransformToPTransformMapperwhich maps each data processing transforms to its PTransformjsonpickleto store and load the ptransforms instances. This is done to seal the gap between training and inference.jsonpickleis cross compatible across python versions and has backward compatibility with older versions of its versions.This PR will throw if input is a
Dict[str, List[str]]. Works forDict[str, str]Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.