Skip to content

Add fix for instances pending network config and add back missing nvlink changes to v0.3.0#540

Closed
tmcroberts97 wants to merge 5 commits into
NVIDIA:release/v0.3.0from
tmcroberts97:thomas/release/v0.3.0-including-hotfixes
Closed

Add fix for instances pending network config and add back missing nvlink changes to v0.3.0#540
tmcroberts97 wants to merge 5 commits into
NVIDIA:release/v0.3.0from
tmcroberts97:thomas/release/v0.3.0-including-hotfixes

Conversation

@tmcroberts97
Copy link
Copy Markdown
Contributor

@tmcroberts97 tmcroberts97 commented Mar 12, 2026

Description

This PR adds #538, a formatting fix for DPU_TO_EVPN_DROP_PREFIX_LIST that was causing stuck instances, to v0.3.0. It also adds back 4 NVLink partitioning commits that were present in the v0.2.0 branch but missing in v0.3.0.

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@tmcroberts97 tmcroberts97 requested a review from a team March 12, 2026 17:00
@tmcroberts97 tmcroberts97 requested a review from a team as a code owner March 12, 2026 17:00
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Mar 12, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

tmcroberts97 and others added 5 commits March 12, 2026 17:06
…NVIDIA#276)

## Description
<!-- Describe what this PR does -->
Don't log when nvl partition monitor encounters a GPU in the default
partition when the machine has no instance on it. Also, properly log the NMX-M error.

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [x] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [ ] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->

---------

Signed-off-by: Thomas McRoberts <tmcroberts@nvidia.com>
## Description
<!-- Describe what this PR does -->
This PR fixes two issues with nvlink partitioning metrics:
- Always record metrics, even if the partition monitor encounters an
error.
- populate the operation name correctly in metrics' applied changes
tracker.

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [x] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [ ] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->

---------

Signed-off-by: Thomas McRoberts <tmcroberts@nvidia.com>
## Description
<!-- Describe what this PR does -->
Limit partition names to 240 characters so that we are under NMX-M API's
limit of 244.

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [x] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [ ] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->

Signed-off-by: Thomas McRoberts <tmcroberts@nvidia.com>
…A#273)

## Description
<!-- Describe what this PR does -->
Do not exit early from execute_nmx_m_operations() if we cannot issue an
operation to NMX-M. Instead add only successfully enqueued operations to
pending list and ignore any that errored out. Exiting
execute_nmx_m_operations() early with an error was resulting in skipped
db updates even for those operations that were successfully enqueued and
completed by NMX-M.

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [x] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [ ] Unit tests added/updated
- [x] Integration tests added/updated
- [ ] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->

Signed-off-by: Thomas McRoberts <tmcroberts@nvidia.com>
Co-authored-by: Roopesh Tamma <rtamma@nvidia.com>
Signed-off-by: Thomas McRoberts <tmcroberts@nvidia.com>
…EFIX_LIST (NVIDIA#538)

## Description
- Rule 65535 in the DPU_TO_EVPN_DROP_PREFIX_LIST prefix-list was using
route-map object syntax (action: {deny: {}}) instead of the string
syntax required by prefix-list rules (action: deny), causing NVUE schema
validation to reject the entire network config with the error: Config
invalid at
router.policy.prefix-list.DPU_TO_EVPN_DROP_PREFIX_LIST.rule.65535.action:
{'deny': {}} is not of type 'string'
- This rule was added in NVIDIA#457 to handle the case where no VPCs are
configured (making rule 10 conditional), but was copy-pasted from the
adjacent route-map section which uses a different schema for action
- Fix updates the template and 4 test expected files

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [x] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [x] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->

Signed-off-by: Laura Gray <lgray@nvidia.com>
Signed-off-by: Thomas McRoberts <tmcroberts@nvidia.com>
@tmcroberts97 tmcroberts97 force-pushed the thomas/release/v0.3.0-including-hotfixes branch from 00f2dfc to af4b26a Compare March 12, 2026 17:07
@ajf
Copy link
Copy Markdown
Collaborator

ajf commented Mar 18, 2026

There's three reverts on the 0.3.0 branch from Friday that we never deployed. This change seems to be re-reverting only one of the reverts?

'65535':
action:
deny: {}
action: deny
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, @ajf is correct. this won't work without the follow-up fix

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit 0a4f3b8

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and then you need 56d6a51 (which is not related to this file, but was a bug in the initial fix)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wminckler Is 56d6a51 the actual fix to the original change that we worked around by restarting nl2doca regularly?

I guess I propose we delete my reverts out of 0.3.0 (or revert the revert) to match up 0.3.0 with main, and then cherry-pick your change onto 0.3.0? I feel like deploying 0.3.0 now is actually unsafe.

Copy link
Copy Markdown
Contributor

@wminckler wminckler Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the confusion. there were 4 commits.

  1. e508409 was the original change for the admin IP issue, which broke the nvue config
  2. 875ab94 fixed 1 issue with the nvue config
  3. 29fae19 fixed another issue with the nvue config
  4. 56d6a51 fixes reprovisioning and dpu ingestion

1,2 and 3 were put in 3.0 and then reverted when 4 was discovered
4 was never in 3.0

2, 3 and 4 were bugs introduced in 1
I don't think anything was reverted in main. all 4 are needed to solve the original problem.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's switch to declarative, what do we want the end-state on 0.3.0 to be?

Seems like we want the original change to the multi-DPU logic, the two fixes fixing the NVUE syntax, the nl2doca restart, and 56d6a51?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but this PR also has changes for NVLink. do we want those too?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine so, but hopefully as a single clean commit rebased on the other changes we're talking about.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I made #634 but I dont know why lines 449 to 464 are getting deleted. I'll ping you over there.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when you look at all 4 commits, its all new

@ajf
Copy link
Copy Markdown
Collaborator

ajf commented Mar 19, 2026

@tmcroberts97 we merged #634 which brings 0.3.0 release branch back into sync with main. Please rebase on top of that for just the change you want in 0.3.0

@github-actions
Copy link
Copy Markdown

This PR has been inactive for 30 days and will be closed soon.
Please push commits or comment to keep it open.

@github-actions
Copy link
Copy Markdown

This pull request was automatically closed due to inactivity after being marked as stale.

If you would like to continue this work:

  • Reopen and push new commits or add a comment with an update
  • Add the lifecycle/frozen label to prevent it from being closed again

@github-actions github-actions Bot closed this Apr 26, 2026
ajf pushed a commit to ajf/infra-controller that referenced this pull request Jun 1, 2026
)

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
chet added a commit to chet/bare-metal-manager-core that referenced this pull request Jun 2, 2026
Brings `InfiniBandPartition` in line with the proto-modeling changes. The handler keeps its REST surface; Create and Update route through `ToProto` on the API request types and the entity gains the full round-trip plus a delete-request builder. The entity also picks up the typed `Labels` field per the `InstanceType` reference in NVIDIA#540 -- it reaches `Labels.ToProto()` / `Labels.FromProto(...)` directly.

Primary callouts are:
- API request side: `APIInfiniBandPartitionCreateRequest.ToProto(ibp)` and `APIInfiniBandPartitionUpdateRequest.ToProto(ibp)`; both source corresponding fields via `ibp.ToProto()`.
- Entity side: `InfiniBandPartition.ToProto`, `FromProto`, and `ToDeletionRequestProto`. There's also a new `InfiniBandPartition.Validate()` so site-driven `FromProto` callers can do `FromProto` + `Validate` -- mirroring the MachineCapability pattern (which went in at NVIDIA#569).
- `InfiniBandPartitionStatus` is now a typed string with its own `.FromProto(state)` and `.Message()` methods, replacing the `InfiniBandPartitionStatusFromProto(state)` helper that split a single proto state into `(status, message)`. The workflow activity caller is just a two-liner now.
- Handler simplified to one-liners, and the inline `cwssaws` / `model/util` imports drop.

As part of the new typed `InfiniBandPartitionStatus` (and the already-typed `Labels`), the typing propagates through the API and DAO layers, but nothing changes on the wire -- JSON serialization of `type X string` is identical to `string`, the OpenAPI spec still just says "string", and the SDK is unaffected.

Tests added!

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
chet added a commit to chet/bare-metal-manager-core that referenced this pull request Jun 2, 2026
Brings `InfiniBandPartition` in line with the proto-modeling changes. The handler keeps its REST surface; Create and Update route through `ToProto` on the API request types and the entity gains the full round-trip plus a delete-request builder. The entity also picks up the typed `Labels` field per the `InstanceType` reference in NVIDIA#540 -- it reaches `Labels.ToProto()` / `Labels.FromProto(...)` directly.

Primary callouts are:
- API request side: `APIInfiniBandPartitionCreateRequest.ToProto(ibp)` and `APIInfiniBandPartitionUpdateRequest.ToProto(ibp)`; both source corresponding fields via `ibp.ToProto()`.
- Entity side: `InfiniBandPartition.ToProto`, `FromProto`, and `ToDeletionRequestProto`. There's also a new `InfiniBandPartition.Validate()` so site-driven `FromProto` callers can do `FromProto` + `Validate` -- mirroring the MachineCapability pattern (which went in at NVIDIA#569).
- `InfiniBandPartitionStatus` is now a typed string with its own `.FromProto(state)` and `.Message()` methods, replacing the `InfiniBandPartitionStatusFromProto(state)` helper that split a single proto state into `(status, message)`. The workflow activity caller is just a two-liner now.
- Handler simplified to one-liners, and the inline `cwssaws` / `model/util` imports drop.

As part of the new typed `InfiniBandPartitionStatus` (and the already-typed `Labels`), the typing propagates through the API and DAO layers, but nothing changes on the wire -- JSON serialization of `type X string` is identical to `string`, the OpenAPI spec still just says "string", and the SDK is unaffected.

Tests added!

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
chet added a commit that referenced this pull request Jun 3, 2026
…rtition (#2116)

## Description

This is NVIDIA/infra-controller-rest#594,
re-opened over here (and had already gone through @coderabbitai feedback
loops, curious to see what happens here).

Brings `InfiniBandPartition` in line with the proto-modeling changes.
The handler keeps its REST surface; Create and Update route through
`ToProto` on the API request types and the entity gains the full
round-trip plus a delete-request builder. The entity also picks up the
typed `Labels` field per the `InstanceType` reference in #540 -- it
reaches `Labels.ToProto()` / `Labels.FromProto(...)` directly.

Primary callouts are:
- API request side: `APIInfiniBandPartitionCreateRequest.ToProto(ibp)`
and `APIInfiniBandPartitionUpdateRequest.ToProto(ibp)`; both source
corresponding fields via `ibp.ToProto()`.
- Entity side: `InfiniBandPartition.ToProto`, `FromProto`, and
`ToDeletionRequestProto`. There's also a new
`InfiniBandPartition.Validate()` so site-driven `FromProto` callers can
do `FromProto` + `Validate` -- mirroring the MachineCapability pattern
(which went in at #569).
- `InfiniBandPartitionStatus` is now a typed string with its own
`.FromProto(state)` and `.Message()` methods, replacing the
`InfiniBandPartitionStatusFromProto(state)` helper that split a single
proto state into `(status, message)`. The workflow activity caller is
just a two-liner now.
- Handler simplified to one-liners, and the inline `cwssaws` /
`model/util` imports drop.

As part of the new typed `InfiniBandPartitionStatus` (and the
already-typed `Labels`), the typing propagates through the API and DAO
layers, but nothing changes on the wire -- JSON serialization of `type X
string` is identical to `string`, the OpenAPI spec still just says
"string", and the SDK is unaffected.

Tests added!

Signed-off-by: Chet Nichols III <chetn@nvidia.com>

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality  
- [ ] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [x] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [x] Unit tests added/updated
- [ ] Integration tests added/updated  
- [ ] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
  * Corrected InfiniBand partition status handling in test scenarios.

* **Refactor**
* Improved internal type safety for InfiniBand partition status
management.
* Enhanced validation for InfiniBand partition operations and request
parameters.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
chet added a commit to chet/bare-metal-manager-core that referenced this pull request Jun 4, 2026
Brings `NetworkSecurityGroup` in line with the proto-modeling changes. Create and Update route through `ToProto` on the API request types, the entity gets `ToProto` / `FromProto` plus `ToDeletionRequestProto`, and a new rule-level `Validate` gates every request before any `ToProto` runs. Also picks up the typed `Labels` field per the `InstanceType` reference (NVIDIA#540).

Primary callouts are:
- API request side: `APINetworkSecurityGroupCreateRequest.ToProto(nsg, id)` and `APINetworkSecurityGroupUpdateRequest.ToProto(nsg)`, plus `APINetworkSecurityGroupRule.ToProto` / `FromProto` as pure mappers.
- Entity side: `NetworkSecurityGroup.ToProto()`, `FromProto`, and `ToDeletionRequestProto()`.
- New `APINetworkSecurityGroupRule.Validate()` lifts the priority / enum / port-protocol / CIDR / required-prefix checks out of `ToProto`. Both Create and Update request validators call it per rule.
- Handler simplified to one-liners through the new receivers, and the corresponding error-handling branches drop.

Tests added!

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
chet added a commit to chet/bare-metal-manager-core that referenced this pull request Jun 4, 2026
Brings `NetworkSecurityGroup` in line with the proto-modeling changes. Create and Update route through `ToProto` on the API request types, the entity gets `ToProto` / `FromProto` plus `ToDeletionRequestProto`, and a new rule-level `Validate` gates every request before any `ToProto` runs. Also picks up the typed `Labels` field per the `InstanceType` reference (NVIDIA#540).

Primary callouts are:
- API request side: `APINetworkSecurityGroupCreateRequest.ToProto(nsg, id)` and `APINetworkSecurityGroupUpdateRequest.ToProto(nsg)`, plus `APINetworkSecurityGroupRule.ToProto` / `FromProto` as pure mappers.
- Entity side: `NetworkSecurityGroup.ToProto()`, `FromProto`, and `ToDeletionRequestProto()`.
- New `APINetworkSecurityGroupRule.Validate()` lifts the priority / enum / port-protocol / CIDR / required-prefix checks out of `ToProto`. Both Create and Update request validators call it per rule.
- Handler simplified to one-liners through the new receivers, and the corresponding error-handling branches drop.

Tests added!

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants