Skip to content

design: Machine Validation Reliability and Control Redesign#2070

Open
sunilkumar-nvidia wants to merge 1 commit into
NVIDIA:mainfrom
sunilkumar-nvidia:mv_2.0
Open

design: Machine Validation Reliability and Control Redesign#2070
sunilkumar-nvidia wants to merge 1 commit into
NVIDIA:mainfrom
sunilkumar-nvidia:mv_2.0

Conversation

@sunilkumar-nvidia
Copy link
Copy Markdown
Contributor

Description

Adds a Machine Validation design proposal focused on preventing stuck validation runs and improving operator visibility. The design scopes the first implementation slice to durable run items, attempts, heartbeats, stale-run reconciliation, and compatibility with existing run/result APIs, while deferring live logs, richer controls, active cancellation, and retry workflows to later milestones.

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

#454
#453

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@ajf
Copy link
Copy Markdown
Collaborator

ajf commented Jun 2, 2026

Anything in docs/ will wind up on docs.nvidia.com. I've been told that that's not a great place for design docs. However I do really think we need a place to publish them; how about for now we just create a designs/ directory, and put in in there so we can publish them separately at some point.

@sunilkumar-nvidia
Copy link
Copy Markdown
Contributor Author

Anything in docs/ will wind up on docs.nvidia.com. I've been told that that's not a great place for design docs. However I do really think we need a place to publish them; how about for now we just create a designs/ directory, and put in in there so we can publish them separately at some point.

Done. Moved to designs/ dir. Thanks

@ajf ajf requested review from a015758 and nvcoop June 2, 2026 18:15
@ajf
Copy link
Copy Markdown
Collaborator

ajf commented Jun 2, 2026

@nvcoop do you want to provide feedback here?

@ajf ajf removed this from the v2.0 milestone Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

2 participants