Skip to content

fix: Issue all queued nmxm operations even if we get an error.#273

Merged
tmcroberts97 merged 1 commit into
NVIDIA:mainfrom
tmcroberts97:fix/nvl-monitor-early-exit
Feb 16, 2026
Merged

fix: Issue all queued nmxm operations even if we get an error.#273
tmcroberts97 merged 1 commit into
NVIDIA:mainfrom
tmcroberts97:fix/nvl-monitor-early-exit

Conversation

@tmcroberts97

Copy link
Copy Markdown
Contributor

Description

Do not exit early from execute_nmx_m_operations() if we cannot issue an operation to NMX-M. Instead add only successfully enqueued operations to pending list and ignore any that errored out. Exiting execute_nmx_m_operations() early with an error was resulting in skipped db updates even for those operations that were successfully enqueued and completed by NMX-M.

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

Do not exit early from execute_nmx_m_operations() if we cannot issue an
operation to NMX-M. Instead add only successfully enqueued operations to
pending list and ignore any that errored out. Exiting
execute_nmx_m_operations() early with an error was resulting in skipped
db updates even for those operations that were successfully enqueued and
completed by NMX-M.

Signed-off-by: Thomas McRoberts <tmcroberts@nvidia.com>
@tmcroberts97 tmcroberts97 merged commit 2c57e4d into NVIDIA:main Feb 16, 2026
33 of 34 checks passed
jd-nv pushed a commit that referenced this pull request Feb 19, 2026
## Description
<!-- Describe what this PR does -->
Do not exit early from execute_nmx_m_operations() if we cannot issue an
operation to NMX-M. Instead add only successfully enqueued operations to
pending list and ignore any that errored out. Exiting
execute_nmx_m_operations() early with an error was resulting in skipped
db updates even for those operations that were successfully enqueued and
completed by NMX-M.

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality  
- [x] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [ ] Unit tests added/updated
- [x] Integration tests added/updated  
- [ ] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->

Signed-off-by: Thomas McRoberts <tmcroberts@nvidia.com>
Co-authored-by: Roopesh Tamma <rtamma@nvidia.com>
tmcroberts97 added a commit to tmcroberts97/infra-controller-core that referenced this pull request Mar 12, 2026
…A#273)

## Description
<!-- Describe what this PR does -->
Do not exit early from execute_nmx_m_operations() if we cannot issue an
operation to NMX-M. Instead add only successfully enqueued operations to
pending list and ignore any that errored out. Exiting
execute_nmx_m_operations() early with an error was resulting in skipped
db updates even for those operations that were successfully enqueued and
completed by NMX-M.

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality  
- [x] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [ ] Unit tests added/updated
- [x] Integration tests added/updated  
- [ ] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->

Signed-off-by: Thomas McRoberts <tmcroberts@nvidia.com>
Co-authored-by: Roopesh Tamma <rtamma@nvidia.com>
tmcroberts97 added a commit to tmcroberts97/infra-controller-core that referenced this pull request Mar 12, 2026
…A#273)

## Description
<!-- Describe what this PR does -->
Do not exit early from execute_nmx_m_operations() if we cannot issue an
operation to NMX-M. Instead add only successfully enqueued operations to
pending list and ignore any that errored out. Exiting
execute_nmx_m_operations() early with an error was resulting in skipped
db updates even for those operations that were successfully enqueued and
completed by NMX-M.

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [x] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [ ] Unit tests added/updated
- [x] Integration tests added/updated
- [ ] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->

Signed-off-by: Thomas McRoberts <tmcroberts@nvidia.com>
Co-authored-by: Roopesh Tamma <rtamma@nvidia.com>
Signed-off-by: Thomas McRoberts <tmcroberts@nvidia.com>
jd-nv pushed a commit that referenced this pull request Mar 12, 2026
## Description
<!-- Describe what this PR does -->
Do not exit early from execute_nmx_m_operations() if we cannot issue an
operation to NMX-M. Instead add only successfully enqueued operations to
pending list and ignore any that errored out. Exiting
execute_nmx_m_operations() early with an error was resulting in skipped
db updates even for those operations that were successfully enqueued and
completed by NMX-M.

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [x] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [ ] Unit tests added/updated
- [x] Integration tests added/updated
- [ ] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->

Signed-off-by: Thomas McRoberts <tmcroberts@nvidia.com>
Co-authored-by: Roopesh Tamma <rtamma@nvidia.com>
Signed-off-by: Thomas McRoberts <tmcroberts@nvidia.com>
nvcoop pushed a commit to nvcoop/bare-metal-manager-core that referenced this pull request Mar 12, 2026
…A#273)

## Description
<!-- Describe what this PR does -->
Do not exit early from execute_nmx_m_operations() if we cannot issue an
operation to NMX-M. Instead add only successfully enqueued operations to
pending list and ignore any that errored out. Exiting
execute_nmx_m_operations() early with an error was resulting in skipped
db updates even for those operations that were successfully enqueued and
completed by NMX-M.

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality  
- [x] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [ ] Unit tests added/updated
- [x] Integration tests added/updated  
- [ ] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->

Signed-off-by: Thomas McRoberts <tmcroberts@nvidia.com>
Co-authored-by: Roopesh Tamma <rtamma@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants