Skip to content

[ML] Add bypass for graph validation#3013

Open
edsavage wants to merge 4 commits intoelastic:mainfrom
edsavage:feature/model-validation-kill-switch
Open

[ML] Add bypass for graph validation#3013
edsavage wants to merge 4 commits intoelastic:mainfrom
edsavage:feature/model-validation-kill-switch

Conversation

@edsavage
Copy link
Copy Markdown
Contributor

@edsavage edsavage commented Mar 26, 2026

Summary

  • Adds a --skipModelValidation command-line flag to pytorch_inference to bypass TorchScript model graph validation
  • When the flag is passed, the allowlist check is skipped and a warning is logged
  • This can be wired to an Elasticsearch cluster setting (e.g. xpack.ml.model_graph_validation.enabled) so that operators can disable validation without infrastructure access, covering all deployment types including serverless
  • Default behaviour (validation enabled) is unchanged

Test plan

  • Built and ran CModelGraphValidatorTest suite locally — all tests pass
  • Integration test: --skipModelValidation bypasses validation for a malicious model (PASS)
  • Integration test: without the flag, validation runs normally (PASS)
  • Integration test: benign model passes validation as before (PASS)
  • CI passes
  • ES-side: add cluster setting that passes --skipModelValidation to the native process

Provides an emergency escape hatch to bypass TorchScript model graph
validation without requiring a code change or rebuild. When
ML_SKIP_MODEL_VALIDATION is set (to any value), the pytorch_inference
process skips the graph validator and logs a warning.

Elasticsearch can set this environment variable for the native
process via its ML settings, allowing operators to unblock model
deployments immediately if the validator incorrectly rejects a
legitimate model.

Made-with: Cursor
@prodsecmachine
Copy link
Copy Markdown

prodsecmachine commented Mar 26, 2026

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues
Licenses 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

Extends the evil model integration test to verify that:
- ML_SKIP_MODEL_VALIDATION=true bypasses graph validation (with
  warning logged)
- ML_SKIP_MODEL_VALIDATION=false still validates (only exact "true"
  activates the bypass)

Made-with: Cursor
@edsavage edsavage requested review from Copilot and valeriy42 and removed request for Copilot March 26, 2026 03:47
@edsavage edsavage changed the title [ML] Add ML_SKIP_MODEL_VALIDATION kill switch for graph validation [ML] Add ML_SKIP_MODEL_VALIDATION bypass for graph validation Mar 26, 2026
@edsavage edsavage requested a review from Copilot March 26, 2026 21:17
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an environment-variable “kill switch” to bypass TorchScript model graph validation in pytorch_inference, plus a Python integration script intended to exercise validator behavior (including the bypass).

Changes:

  • Add ML_SKIP_MODEL_VALIDATION=true env-var check to skip verifySafeModel() and emit a warning.
  • Add a standalone Python script that generates known-malicious TorchScript models and runs pytorch_inference to confirm rejection/bypass behavior.

Reviewed changes

Copilot reviewed 1 out of 2 changed files in this pull request and generated 3 comments.

File Description
bin/pytorch_inference/Main.cc Adds the ML_SKIP_MODEL_VALIDATION env-var bypass around verifySafeModel() with warning logging.
test/test_pytorch_inference_evil_models.py Adds a standalone integration script to generate “evil” models and validate expected pytorch_inference behavior (including bypass).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

generate_model(spec["class"], model_path)
print(f" Model generated: {model_path.name} ({model_path.stat().st_size} bytes)")
except Exception as e:
print(f" SKIP: could not generate model: {e}")
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If TorchScript scripting fails for a model (e.g., due to Torch version differences), this test currently prints SKIP and continues, which can result in an overall PASS without having exercised the validator at all. For a security regression test, it would be safer to treat model-generation failures as a test failure (or at least fail when the expected-rejected models can’t be generated).

Suggested change
print(f" SKIP: could not generate model: {e}")
print(f" FAIL: could not generate model: {e}")
all_passed = False

Copilot uses AI. Check for mistakes.
Comment on lines +216 to +219
raise FileNotFoundError(
"Could not find pytorch_inference binary. "
"Build from the feature/harden_pytorch_inference branch, or pass --binary."
)
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script’s requirements/error message still references building from the "feature/harden_pytorch_inference" branch. That’s likely to become stale/confusing once this change is on main; consider updating the wording to refer to a built pytorch_inference binary (or a minimum version) rather than a specific branch name.

Copilot uses AI. Check for mistakes.
Comment on lines +24 to +25
Requires: torch, a built pytorch_inference binary with graph validation
(feature/harden_pytorch_inference branch or later).
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring says this requires a binary built from the "feature/harden_pytorch_inference" branch. Since this file is being added to the mainline repo, consider updating this to a stable requirement (e.g., “a pytorch_inference binary built from this repo at/after ”) to avoid confusion for future readers.

Suggested change
Requires: torch, a built pytorch_inference binary with graph validation
(feature/harden_pytorch_inference branch or later).
Requires: torch, and a built pytorch_inference binary from this repository
with graph validation enabled (i.e., including the
CModelGraphValidator checks).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@valeriy42 valeriy42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the reason for wanting an escape patch, but setting an environment variable is not a practical solution. You need a cluster setting and a --skipValidation flag on the pytorch_inference process.

Adds a command-line flag to bypass TorchScript model graph validation.
When --skipModelValidation is passed to pytorch_inference, the
allowlist check is skipped and a warning is logged.

This can be wired to an Elasticsearch cluster setting
(e.g. xpack.ml.model_graph_validation.enabled) so that operators
can disable validation without infrastructure access, covering all
deployment types including serverless.

Made-with: Cursor
@edsavage edsavage force-pushed the feature/model-validation-kill-switch branch from 242ddfd to e49d6c2 Compare March 29, 2026 21:32
@edsavage
Copy link
Copy Markdown
Contributor Author

Updated per Valeriy's review — replaced the ML_SKIP_MODEL_VALIDATION environment variable with a --skipModelValidation CLI flag on the pytorch_inference process.

This is the better approach because:

  • Elasticsearch already passes CLI args to native processes (--numThreadsPerAllocation, --validElasticLicenseKeyConfirmed, etc.)
  • It can be wired to a dynamic cluster setting (changeable at runtime without restart)
  • Works for all deployment types (self-managed, Cloud, serverless) once the ES-side setting is added
  • No infrastructure access needed — operators can toggle it via the ES API

The ES-side change (adding a cluster setting like xpack.ml.model_graph_validation.enabled that passes the flag) would be a separate PR in the elasticsearch repo.

edsavage added a commit to elastic/elasticsearch that referenced this pull request Mar 29, 2026
Adds a dynamic node-scope setting to control TorchScript model graph
validation. When set to false, the pytorch_inference process is
launched with --skipModelValidation, bypassing the operation
allowlist/forbidden list check.

This provides an operator-accessible escape hatch for all deployment
types (self-managed, Cloud, serverless) via the cluster settings API,
without requiring infrastructure access or a rebuild.

The setting is dynamic — changes take effect on the next model
deployment without restarting the node.

Companion to elastic/ml-cpp#3013 which adds the --skipModelValidation
CLI flag to the pytorch_inference binary.

Made-with: Cursor
@edsavage edsavage changed the title [ML] Add ML_SKIP_MODEL_VALIDATION bypass for graph validation [ML] Add bypass for graph validation Mar 29, 2026
Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants