Add fix for instances pending network config and add back missing nvlink changes to v0.3.0 by tmcroberts97 · Pull Request #540 · NVIDIA/infra-controller

tmcroberts97 · 2026-03-12T17:00:45Z

Description

This PR adds #538, a formatting fix for DPU_TO_EVPN_DROP_PREFIX_LIST that was causing stuck instances, to v0.3.0. It also adds back 4 NVLink partitioning commits that were present in the v0.2.0 branch but missing in v0.3.0.

Type of Change

Add - New feature or capability
Change - Changes in existing functionality
Fix - Bug fixes
Remove - Removed features or deprecated functionality
Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

This PR contains breaking changes

Testing

Unit tests added/updated
Integration tests added/updated
Manual testing performed
No testing required (docs, internal refactor, etc.)

Additional Notes

copy-pr-bot · 2026-03-12T17:00:50Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…NVIDIA#276) ## Description  Don't log when nvl partition monitor encounters a GPU in the default partition when the machine has no instance on it. Also, properly log the NMX-M error. ## Type of Change  - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [x] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional)  ## Breaking Changes - [ ] This PR contains breaking changes  ## Testing  - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes  --------- Signed-off-by: Thomas McRoberts <tmcroberts@nvidia.com>

## Description  This PR fixes two issues with nvlink partitioning metrics: - Always record metrics, even if the partition monitor encounters an error. - populate the operation name correctly in metrics' applied changes tracker. ## Type of Change  - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [x] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional)  ## Breaking Changes - [ ] This PR contains breaking changes  ## Testing  - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes  --------- Signed-off-by: Thomas McRoberts <tmcroberts@nvidia.com>

## Description  Limit partition names to 240 characters so that we are under NMX-M API's limit of 244. ## Type of Change  - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [x] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional)  ## Breaking Changes - [ ] This PR contains breaking changes  ## Testing  - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes  Signed-off-by: Thomas McRoberts <tmcroberts@nvidia.com>

…A#273) ## Description  Do not exit early from execute_nmx_m_operations() if we cannot issue an operation to NMX-M. Instead add only successfully enqueued operations to pending list and ignore any that errored out. Exiting execute_nmx_m_operations() early with an error was resulting in skipped db updates even for those operations that were successfully enqueued and completed by NMX-M. ## Type of Change  - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [x] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional)  ## Breaking Changes - [ ] This PR contains breaking changes  ## Testing  - [ ] Unit tests added/updated - [x] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes  Signed-off-by: Thomas McRoberts <tmcroberts@nvidia.com> Co-authored-by: Roopesh Tamma <rtamma@nvidia.com> Signed-off-by: Thomas McRoberts <tmcroberts@nvidia.com>

…EFIX_LIST (NVIDIA#538) ## Description - Rule 65535 in the DPU_TO_EVPN_DROP_PREFIX_LIST prefix-list was using route-map object syntax (action: {deny: {}}) instead of the string syntax required by prefix-list rules (action: deny), causing NVUE schema validation to reject the entire network config with the error: Config invalid at router.policy.prefix-list.DPU_TO_EVPN_DROP_PREFIX_LIST.rule.65535.action: {'deny': {}} is not of type 'string' - This rule was added in NVIDIA#457 to handle the case where no VPCs are configured (making rule 10 conditional), but was copy-pasted from the adjacent route-map section which uses a different schema for action - Fix updates the template and 4 test expected files ## Type of Change  - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [x] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional)  ## Breaking Changes - [ ] This PR contains breaking changes  ## Testing  - [x] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes  Signed-off-by: Laura Gray <lgray@nvidia.com> Signed-off-by: Thomas McRoberts <tmcroberts@nvidia.com>

ajf · 2026-03-18T00:05:49Z

There's three reverts on the 0.3.0 branch from Friday that we never deployed. This change seems to be re-reverting only one of the reverts?

wminckler · 2026-03-18T14:21:29Z

              '65535':
-                action:
-                  deny: {}
+                action: deny


yea, @ajf is correct. this won't work without the follow-up fix

commit 0a4f3b8

and then you need 56d6a51 (which is not related to this file, but was a bug in the initial fix)

@wminckler Is 56d6a51 the actual fix to the original change that we worked around by restarting nl2doca regularly?

I guess I propose we delete my reverts out of 0.3.0 (or revert the revert) to match up 0.3.0 with main, and then cherry-pick your change onto 0.3.0? I feel like deploying 0.3.0 now is actually unsafe.

sorry for the confusion. there were 4 commits.

e508409 was the original change for the admin IP issue, which broke the nvue config

875ab94 fixed 1 issue with the nvue config

29fae19 fixed another issue with the nvue config

56d6a51 fixes reprovisioning and dpu ingestion

1,2 and 3 were put in 3.0 and then reverted when 4 was discovered
4 was never in 3.0

2, 3 and 4 were bugs introduced in 1
I don't think anything was reverted in main. all 4 are needed to solve the original problem.

Ok, let's switch to declarative, what do we want the end-state on 0.3.0 to be?

Seems like we want the original change to the multi-DPU logic, the two fixes fixing the NVUE syntax, the nl2doca restart, and 56d6a51?

yes, but this PR also has changes for NVLink. do we want those too?

I imagine so, but hopefully as a single clean commit rebased on the other changes we're talking about.

Ok, I made #634 but I dont know why lines 449 to 464 are getting deleted. I'll ping you over there.

when you look at all 4 commits, its all new

ajf · 2026-03-19T16:30:03Z

@tmcroberts97 we merged #634 which brings 0.3.0 release branch back into sync with main. Please rebase on top of that for just the change you want in 0.3.0

github-actions · 2026-04-19T06:35:18Z

This PR has been inactive for 30 days and will be closed soon.
Please push commits or comment to keep it open.

github-actions · 2026-04-26T06:38:49Z

This pull request was automatically closed due to inactivity after being marked as stale.

If you would like to continue this work:

Reopen and push new commits or add a comment with an update
Add the lifecycle/frozen label to prevent it from being closed again

) Signed-off-by: Chet Nichols III <chetn@nvidia.com>

Brings `InfiniBandPartition` in line with the proto-modeling changes. The handler keeps its REST surface; Create and Update route through `ToProto` on the API request types and the entity gains the full round-trip plus a delete-request builder. The entity also picks up the typed `Labels` field per the `InstanceType` reference in NVIDIA#540 -- it reaches `Labels.ToProto()` / `Labels.FromProto(...)` directly. Primary callouts are: - API request side: `APIInfiniBandPartitionCreateRequest.ToProto(ibp)` and `APIInfiniBandPartitionUpdateRequest.ToProto(ibp)`; both source corresponding fields via `ibp.ToProto()`. - Entity side: `InfiniBandPartition.ToProto`, `FromProto`, and `ToDeletionRequestProto`. There's also a new `InfiniBandPartition.Validate()` so site-driven `FromProto` callers can do `FromProto` + `Validate` -- mirroring the MachineCapability pattern (which went in at NVIDIA#569). - `InfiniBandPartitionStatus` is now a typed string with its own `.FromProto(state)` and `.Message()` methods, replacing the `InfiniBandPartitionStatusFromProto(state)` helper that split a single proto state into `(status, message)`. The workflow activity caller is just a two-liner now. - Handler simplified to one-liners, and the inline `cwssaws` / `model/util` imports drop. As part of the new typed `InfiniBandPartitionStatus` (and the already-typed `Labels`), the typing propagates through the API and DAO layers, but nothing changes on the wire -- JSON serialization of `type X string` is identical to `string`, the OpenAPI spec still just says "string", and the SDK is unaffected. Tests added! Signed-off-by: Chet Nichols III <chetn@nvidia.com>

…rtition (#2116) ## Description This is NVIDIA/infra-controller-rest#594, re-opened over here (and had already gone through @coderabbitai feedback loops, curious to see what happens here). Brings `InfiniBandPartition` in line with the proto-modeling changes. The handler keeps its REST surface; Create and Update route through `ToProto` on the API request types and the entity gains the full round-trip plus a delete-request builder. The entity also picks up the typed `Labels` field per the `InstanceType` reference in #540 -- it reaches `Labels.ToProto()` / `Labels.FromProto(...)` directly. Primary callouts are: - API request side: `APIInfiniBandPartitionCreateRequest.ToProto(ibp)` and `APIInfiniBandPartitionUpdateRequest.ToProto(ibp)`; both source corresponding fields via `ibp.ToProto()`. - Entity side: `InfiniBandPartition.ToProto`, `FromProto`, and `ToDeletionRequestProto`. There's also a new `InfiniBandPartition.Validate()` so site-driven `FromProto` callers can do `FromProto` + `Validate` -- mirroring the MachineCapability pattern (which went in at #569). - `InfiniBandPartitionStatus` is now a typed string with its own `.FromProto(state)` and `.Message()` methods, replacing the `InfiniBandPartitionStatusFromProto(state)` helper that split a single proto state into `(status, message)`. The workflow activity caller is just a two-liner now. - Handler simplified to one-liners, and the inline `cwssaws` / `model/util` imports drop. As part of the new typed `InfiniBandPartitionStatus` (and the already-typed `Labels`), the typing propagates through the API and DAO layers, but nothing changes on the wire -- JSON serialization of `type X string` is identical to `string`, the OpenAPI spec still just says "string", and the SDK is unaffected. Tests added! Signed-off-by: Chet Nichols III <chetn@nvidia.com> ## Type of Change  - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [x] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional)  ## Breaking Changes - [ ] This PR contains breaking changes  ## Testing  - [x] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes   ## Summary by CodeRabbit * **Bug Fixes** * Corrected InfiniBand partition status handling in test scenarios. * **Refactor** * Improved internal type safety for InfiniBand partition status management. * Enhanced validation for InfiniBand partition operations and request parameters.  Signed-off-by: Chet Nichols III <chetn@nvidia.com>

Brings `NetworkSecurityGroup` in line with the proto-modeling changes. Create and Update route through `ToProto` on the API request types, the entity gets `ToProto` / `FromProto` plus `ToDeletionRequestProto`, and a new rule-level `Validate` gates every request before any `ToProto` runs. Also picks up the typed `Labels` field per the `InstanceType` reference (NVIDIA#540). Primary callouts are: - API request side: `APINetworkSecurityGroupCreateRequest.ToProto(nsg, id)` and `APINetworkSecurityGroupUpdateRequest.ToProto(nsg)`, plus `APINetworkSecurityGroupRule.ToProto` / `FromProto` as pure mappers. - Entity side: `NetworkSecurityGroup.ToProto()`, `FromProto`, and `ToDeletionRequestProto()`. - New `APINetworkSecurityGroupRule.Validate()` lifts the priority / enum / port-protocol / CIDR / required-prefix checks out of `ToProto`. Both Create and Update request validators call it per rule. - Handler simplified to one-liners through the new receivers, and the corresponding error-handling branches drop. Tests added! Signed-off-by: Chet Nichols III <chetn@nvidia.com>

tmcroberts97 requested a review from a team March 12, 2026 17:00

tmcroberts97 requested a review from a team as a code owner March 12, 2026 17:00

tmcroberts97 and others added 5 commits March 12, 2026 17:06

tmcroberts97 force-pushed the thomas/release/v0.3.0-including-hotfixes branch from 00f2dfc to af4b26a Compare March 12, 2026 17:07

nvcoop approved these changes Mar 12, 2026

View reviewed changes

wminckler reviewed Mar 18, 2026

View reviewed changes

github-actions Bot added the lifecycle/stale label Apr 19, 2026

github-actions Bot closed this Apr 26, 2026

ajf pushed a commit to ajf/infra-controller that referenced this pull request Jun 1, 2026

chore: Apply the ToProto modeling convention to InstanceType (NVIDIA#540

598b4fa

) Signed-off-by: Chet Nichols III <chetn@nvidia.com>

chet mentioned this pull request Jun 2, 2026

chore(rest-api): Apply our FromProto/ToProto modeling to InfiniBandPartition #2116

Merged

10 tasks

Conversation

tmcroberts97 commented Mar 12, 2026 • edited by nvcoop Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Related Issues (Optional)

Breaking Changes

Testing

Additional Notes

Uh oh!

copy-pr-bot Bot commented Mar 12, 2026

Uh oh!

ajf commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wminckler Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajf commented Mar 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tmcroberts97 commented Mar 12, 2026 •

edited by nvcoop

Loading

ajf commented Mar 18, 2026 •

edited

Loading

wminckler Mar 19, 2026 •

edited

Loading