Skip to content

feature: MLFlow E2E Example Notebook (5513)#5701

Closed
aviruthen wants to merge 2 commits intoaws:masterfrom
aviruthen:feature/mlflow-e2e-example-notebook-5513-2
Closed

feature: MLFlow E2E Example Notebook (5513)#5701
aviruthen wants to merge 2 commits intoaws:masterfrom
aviruthen:feature/mlflow-e2e-example-notebook-5513-2

Conversation

@aviruthen
Copy link
Copy Markdown
Collaborator

Description

The issue requests adding a new MLflow E2E example notebook. The notebook already exists at v3-examples/ml-ops-examples/v3-mlflow-train-inference-e2e-example.ipynb but has several issues that need to be fixed:

  1. Session usage bug (Step 1): Session.boto_region_name is accessed as a class attribute, but it's an instance property. Should be Session().boto_region_name.
  2. Missing import in Step 7: The notebook uses boto3.client('sagemaker-runtime') directly for inference testing instead of using the core_endpoint.invoke() pattern established in other V3 example notebooks (see train-inference-e2e-example.ipynb Step 7).
  3. Inconsistent MLflow version pinning: The %pip install cell pins mlflow==3.4.0, requirements.txt in training code pins mlflow==3.4.0, and ModelBuilder dependencies pin mlflow==3.4.0 and sagemaker==3.3.1 and numpy==2.4.1 - these should be kept consistent but the hard-pinned versions may become outdated quickly. Should use minimum version constraints or reference a variable.
  4. Missing sagemaker_session parameter in ModelTrainer: Unlike the existing train-inference-e2e-example.ipynb which explicitly passes sagemaker_session, the MLflow notebook omits it.
  5. Custom translators may be unnecessary for MLflow deployment: When deploying from MLflow model registry with MLFLOW_MODEL_PATH, the MLflow model already includes signature information. The custom PyTorchInputTranslator/PyTorchOutputTranslator may conflict with MLflow's built-in serialization. The SchemaBuilder should use plain JSON sample_input/output without custom translators for MLflow models.
  6. Step 5 API usage: registered_model.latest_versions is deprecated in MLflow 3.x. Should use client.search_model_versions() instead.
  7. Missing initial import block: Unlike the pattern in train-inference-e2e-example.ipynb, the MLflow notebook scatters imports across cells instead of collecting key imports upfront.

Related Issue

Related issue: 5513

Changes Made

  • v3-examples/ml-ops-examples/v3-mlflow-train-inference-e2e-example.ipynb

AI-Generated PR

This PR was automatically generated by the PySDK Issue Agent.

  • Confidence score: 85%
  • Classification: type: feature request
  • SDK version target: V3

Merge Checklist

  • Changes are backward compatible
  • Commit message follows prefix: description format
  • Unit tests added/updated
  • Integration tests added (if applicable)
  • Documentation updated (if applicable)

Copy link
Copy Markdown
Collaborator

@sagemaker-bot sagemaker-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review

This PR fixes several issues in the MLflow E2E example notebook: correcting Session instantiation, using search_model_versions instead of deprecated API, removing unnecessary custom translators, using core_endpoint.invoke() instead of direct boto3 calls, and adding sagemaker_session to ModelTrainer. The changes are well-motivated and align with V3 SDK conventions. A few minor issues remain.

"\n",
"# Training on SageMaker managed infrastructure\n",
"model_trainer = ModelTrainer(\n",
" sagemaker_session=sagemaker_session,\n",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good addition of sagemaker_session=sagemaker_session. However, per SDK conventions, note that sagemaker_session is typically placed as the last parameter in constructor calls (matching the convention that optional session parameters come at the end). Consider moving it after the other parameters for consistency with other V3 example notebooks, though this is a minor style point for a notebook.

"# Use search_model_versions (compatible with MLflow 3.x)\n",
"model_versions = client.search_model_versions(\n",
" filter_string=f\"name='{MLFLOW_REGISTERED_MODEL_NAME}'\",\n",
" order_by=['version_number DESC'],\n",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order_by=['version_number DESC'] parameter — please verify this is the correct field name for MLflow 3.x's search_model_versions. In some MLflow versions, the field is version_number, but in others it may be creation_timestamp or just version. If this is incorrect, the notebook will fail at runtime. The MLflow 3.x docs indicate order_by supports "version_number DESC" but it's worth double-checking against the version pinned in the install cell.

@aviruthen
Copy link
Copy Markdown
Collaborator Author

🤖 Iteration #1 — Review Comments Addressed

Description

Address reviewer feedback on the MLflow E2E example notebook:

  1. Remove unused boto3 import: The previous iteration added boto3 to the top-level imports cell but it's no longer used since inference was switched to core_endpoint.invoke().

  2. Use minimum version constraints: Changed all hard-pinned versions (mlflow==3.4.0, etc.) to minimum version constraints (mlflow>=3.4.0) in the pip install cell, ModelBuilder dependencies, and training requirements.txt to avoid the notebook breaking when users install it later.

  3. Move sagemaker_session to end of ModelTrainer constructor: Per SDK conventions, optional session parameters are placed last in constructor calls, matching the pattern in other V3 example notebooks.

  4. Use creation_timestamp for order_by in search_model_versions: Changed from version_number DESC to creation_timestamp DESC which is more broadly supported across MLflow versions, with an explanatory comment.

  5. Handle invoke() response body robustly: The response body from core_endpoint.invoke() may be a streaming object or bytes depending on the implementation. Added defensive handling that checks for both .read() method and bytes type to ensure the notebook works correctly.

Related Issue

Related issue: 5513

Changes Made

  • v3-examples/ml-ops-examples/v3-mlflow-train-inference-e2e-example.ipynb

Merge Checklist

  • Changes are backward compatible
  • Commit message follows prefix: description format
  • Unit tests added/updated
  • Integration tests added (if applicable)
  • Documentation updated (if applicable)

Comments reviewed: 5
Files modified: v3-examples/ml-ops-examples/v3-mlflow-train-inference-e2e-example.ipynb

  • v3-examples/ml-ops-examples/v3-mlflow-train-inference-e2e-example.ipynb: Address all reviewer comments: remove unused boto3 import, use minimum version constraints for mlflow, move sagemaker_session to end of ModelTrainer params, use creation_timestamp for order_by in sear

Copy link
Copy Markdown
Collaborator

@sagemaker-bot sagemaker-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review

This PR fixes several real issues in the MLflow E2E example notebook: Session instantiation bug, deprecated MLflow API usage, unnecessary custom translators, direct boto3 usage replaced with core_endpoint.invoke(), and version pinning improvements. The changes are well-motivated and align with V3 SDK conventions. A few minor concerns around dependency pinning and robustness.

" \"MLFLOW_TRACKING_ARN\": MLFLOW_TRACKING_ARN\n",
" },\n",
" dependencies={\"auto\": False, \"custom\": [\"mlflow==3.4.0\", \"sagemaker==3.3.1\", \"numpy==2.4.1\", \"cloudpickle==3.1.2\"]},\n",
" dependencies={\"auto\": False, \"custom\": [\"mlflow>=3.4.0\", \"sagemaker>=3.3.1\", \"numpy>=2.4.1\", \"cloudpickle>=3.1.2\"]},\n",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using >= without an upper bound for deployment dependencies is risky — these packages are installed at inference time inside the container, and a future major version bump of mlflow, sagemaker, numpy, or cloudpickle could break the deployed endpoint silently. Consider using compatible-release constraints (e.g., mlflow>=3.4.0,<4, numpy>=2.4.1,<3, cloudpickle>=3.1.2,<4) to balance freshness with stability. This is especially important for sagemaker which had a major v2→v3 transition.

"# Note: order_by field name may vary across MLflow versions;\n",
"# 'creation_timestamp' is broadly supported.\n",
"model_versions = client.search_model_versions(\n",
" filter_string=f\"name='{MLFLOW_REGISTERED_MODEL_NAME}'\",\n",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: trailing closing paren on max_results=1 line is missing a trailing comma, and the list ['creation_timestamp DESC'] uses single quotes inside double-quoted JSON string — this is fine for Python but worth noting for consistency. More importantly, consider adding a guard for the case where model_versions is empty (no versions registered yet), e.g.:

if not model_versions:
    raise RuntimeError(f"No model versions found for '{MLFLOW_REGISTERED_MODEL_NAME}'")

This would give users a clear error message instead of an IndexError.

@@ -35,7 +35,8 @@
"outputs": [],
"source": [
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The %pip install cell only installs mlflow>=3.4.0, but the training code's requirements.txt (referenced in Step 4) presumably still pins mlflow==3.4.0. The PR description mentions this inconsistency but the diff only shows changes to the notebook cells and the dependencies dict in ModelBuilder. If requirements.txt is a separate file in the repo, it should also be updated to use >= constraints for consistency. Could you confirm whether requirements.txt is part of this PR or needs a separate change?

")\n",
"\n",
"prediction = json.loads(response['Body'].read().decode('utf-8'))\n",
"# The invoke() response body may be a streaming object or bytes;\n",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The response body handling logic is reasonable for robustness, but it would be good to add a brief comment or reference to the core_endpoint.invoke() return type from sagemaker-core so future maintainers know what to expect. Also, consider whether result.body could be str already — the current code handles bytes and stream but not str, though in practice it's likely always one of the first two.

@aviruthen aviruthen closed this Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants