Add fix for instances pending network config and add back missing nvlink changes to v0.3.0#540
Conversation
…NVIDIA#276) ## Description <!-- Describe what this PR does --> Don't log when nvl partition monitor encounters a GPU in the default partition when the machine has no instance on it. Also, properly log the NMX-M error. ## Type of Change <!-- Check one that best describes this PR --> - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [x] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) <!-- If applicable, provide GitHub Issue. --> ## Breaking Changes - [ ] This PR contains breaking changes <!-- If checked above, describe the breaking changes and migration steps --> ## Testing <!-- How was this tested? Check all that apply --> - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes <!-- Any additional context, deployment notes, or reviewer guidance --> --------- Signed-off-by: Thomas McRoberts <tmcroberts@nvidia.com>
## Description <!-- Describe what this PR does --> This PR fixes two issues with nvlink partitioning metrics: - Always record metrics, even if the partition monitor encounters an error. - populate the operation name correctly in metrics' applied changes tracker. ## Type of Change <!-- Check one that best describes this PR --> - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [x] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) <!-- If applicable, provide GitHub Issue. --> ## Breaking Changes - [ ] This PR contains breaking changes <!-- If checked above, describe the breaking changes and migration steps --> ## Testing <!-- How was this tested? Check all that apply --> - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes <!-- Any additional context, deployment notes, or reviewer guidance --> --------- Signed-off-by: Thomas McRoberts <tmcroberts@nvidia.com>
## Description <!-- Describe what this PR does --> Limit partition names to 240 characters so that we are under NMX-M API's limit of 244. ## Type of Change <!-- Check one that best describes this PR --> - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [x] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) <!-- If applicable, provide GitHub Issue. --> ## Breaking Changes - [ ] This PR contains breaking changes <!-- If checked above, describe the breaking changes and migration steps --> ## Testing <!-- How was this tested? Check all that apply --> - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes <!-- Any additional context, deployment notes, or reviewer guidance --> Signed-off-by: Thomas McRoberts <tmcroberts@nvidia.com>
…A#273) ## Description <!-- Describe what this PR does --> Do not exit early from execute_nmx_m_operations() if we cannot issue an operation to NMX-M. Instead add only successfully enqueued operations to pending list and ignore any that errored out. Exiting execute_nmx_m_operations() early with an error was resulting in skipped db updates even for those operations that were successfully enqueued and completed by NMX-M. ## Type of Change <!-- Check one that best describes this PR --> - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [x] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) <!-- If applicable, provide GitHub Issue. --> ## Breaking Changes - [ ] This PR contains breaking changes <!-- If checked above, describe the breaking changes and migration steps --> ## Testing <!-- How was this tested? Check all that apply --> - [ ] Unit tests added/updated - [x] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes <!-- Any additional context, deployment notes, or reviewer guidance --> Signed-off-by: Thomas McRoberts <tmcroberts@nvidia.com> Co-authored-by: Roopesh Tamma <rtamma@nvidia.com> Signed-off-by: Thomas McRoberts <tmcroberts@nvidia.com>
…EFIX_LIST (NVIDIA#538) ## Description - Rule 65535 in the DPU_TO_EVPN_DROP_PREFIX_LIST prefix-list was using route-map object syntax (action: {deny: {}}) instead of the string syntax required by prefix-list rules (action: deny), causing NVUE schema validation to reject the entire network config with the error: Config invalid at router.policy.prefix-list.DPU_TO_EVPN_DROP_PREFIX_LIST.rule.65535.action: {'deny': {}} is not of type 'string' - This rule was added in NVIDIA#457 to handle the case where no VPCs are configured (making rule 10 conditional), but was copy-pasted from the adjacent route-map section which uses a different schema for action - Fix updates the template and 4 test expected files ## Type of Change <!-- Check one that best describes this PR --> - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [x] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) <!-- If applicable, provide GitHub Issue. --> ## Breaking Changes - [ ] This PR contains breaking changes <!-- If checked above, describe the breaking changes and migration steps --> ## Testing <!-- How was this tested? Check all that apply --> - [x] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes <!-- Any additional context, deployment notes, or reviewer guidance --> Signed-off-by: Laura Gray <lgray@nvidia.com> Signed-off-by: Thomas McRoberts <tmcroberts@nvidia.com>
00f2dfc to
af4b26a
Compare
|
There's three reverts on the 0.3.0 branch from Friday that we never deployed. This change seems to be re-reverting only one of the reverts? |
| '65535': | ||
| action: | ||
| deny: {} | ||
| action: deny |
There was a problem hiding this comment.
yea, @ajf is correct. this won't work without the follow-up fix
There was a problem hiding this comment.
and then you need 56d6a51 (which is not related to this file, but was a bug in the initial fix)
There was a problem hiding this comment.
@wminckler Is 56d6a51 the actual fix to the original change that we worked around by restarting nl2doca regularly?
I guess I propose we delete my reverts out of 0.3.0 (or revert the revert) to match up 0.3.0 with main, and then cherry-pick your change onto 0.3.0? I feel like deploying 0.3.0 now is actually unsafe.
There was a problem hiding this comment.
sorry for the confusion. there were 4 commits.
- e508409 was the original change for the admin IP issue, which broke the nvue config
- 875ab94 fixed 1 issue with the nvue config
- 29fae19 fixed another issue with the nvue config
- 56d6a51 fixes reprovisioning and dpu ingestion
1,2 and 3 were put in 3.0 and then reverted when 4 was discovered
4 was never in 3.0
2, 3 and 4 were bugs introduced in 1
I don't think anything was reverted in main. all 4 are needed to solve the original problem.
There was a problem hiding this comment.
Ok, let's switch to declarative, what do we want the end-state on 0.3.0 to be?
Seems like we want the original change to the multi-DPU logic, the two fixes fixing the NVUE syntax, the nl2doca restart, and 56d6a51?
There was a problem hiding this comment.
yes, but this PR also has changes for NVLink. do we want those too?
There was a problem hiding this comment.
I imagine so, but hopefully as a single clean commit rebased on the other changes we're talking about.
There was a problem hiding this comment.
Ok, I made #634 but I dont know why lines 449 to 464 are getting deleted. I'll ping you over there.
There was a problem hiding this comment.
when you look at all 4 commits, its all new
|
@tmcroberts97 we merged #634 which brings 0.3.0 release branch back into sync with main. Please rebase on top of that for just the change you want in 0.3.0 |
|
This PR has been inactive for 30 days and will be closed soon. |
|
This pull request was automatically closed due to inactivity after being marked as stale. If you would like to continue this work:
|
Brings `InfiniBandPartition` in line with the proto-modeling changes. The handler keeps its REST surface; Create and Update route through `ToProto` on the API request types and the entity gains the full round-trip plus a delete-request builder. The entity also picks up the typed `Labels` field per the `InstanceType` reference in NVIDIA#540 -- it reaches `Labels.ToProto()` / `Labels.FromProto(...)` directly. Primary callouts are: - API request side: `APIInfiniBandPartitionCreateRequest.ToProto(ibp)` and `APIInfiniBandPartitionUpdateRequest.ToProto(ibp)`; both source corresponding fields via `ibp.ToProto()`. - Entity side: `InfiniBandPartition.ToProto`, `FromProto`, and `ToDeletionRequestProto`. There's also a new `InfiniBandPartition.Validate()` so site-driven `FromProto` callers can do `FromProto` + `Validate` -- mirroring the MachineCapability pattern (which went in at NVIDIA#569). - `InfiniBandPartitionStatus` is now a typed string with its own `.FromProto(state)` and `.Message()` methods, replacing the `InfiniBandPartitionStatusFromProto(state)` helper that split a single proto state into `(status, message)`. The workflow activity caller is just a two-liner now. - Handler simplified to one-liners, and the inline `cwssaws` / `model/util` imports drop. As part of the new typed `InfiniBandPartitionStatus` (and the already-typed `Labels`), the typing propagates through the API and DAO layers, but nothing changes on the wire -- JSON serialization of `type X string` is identical to `string`, the OpenAPI spec still just says "string", and the SDK is unaffected. Tests added! Signed-off-by: Chet Nichols III <chetn@nvidia.com>
Brings `InfiniBandPartition` in line with the proto-modeling changes. The handler keeps its REST surface; Create and Update route through `ToProto` on the API request types and the entity gains the full round-trip plus a delete-request builder. The entity also picks up the typed `Labels` field per the `InstanceType` reference in NVIDIA#540 -- it reaches `Labels.ToProto()` / `Labels.FromProto(...)` directly. Primary callouts are: - API request side: `APIInfiniBandPartitionCreateRequest.ToProto(ibp)` and `APIInfiniBandPartitionUpdateRequest.ToProto(ibp)`; both source corresponding fields via `ibp.ToProto()`. - Entity side: `InfiniBandPartition.ToProto`, `FromProto`, and `ToDeletionRequestProto`. There's also a new `InfiniBandPartition.Validate()` so site-driven `FromProto` callers can do `FromProto` + `Validate` -- mirroring the MachineCapability pattern (which went in at NVIDIA#569). - `InfiniBandPartitionStatus` is now a typed string with its own `.FromProto(state)` and `.Message()` methods, replacing the `InfiniBandPartitionStatusFromProto(state)` helper that split a single proto state into `(status, message)`. The workflow activity caller is just a two-liner now. - Handler simplified to one-liners, and the inline `cwssaws` / `model/util` imports drop. As part of the new typed `InfiniBandPartitionStatus` (and the already-typed `Labels`), the typing propagates through the API and DAO layers, but nothing changes on the wire -- JSON serialization of `type X string` is identical to `string`, the OpenAPI spec still just says "string", and the SDK is unaffected. Tests added! Signed-off-by: Chet Nichols III <chetn@nvidia.com>
…rtition (#2116) ## Description This is NVIDIA/infra-controller-rest#594, re-opened over here (and had already gone through @coderabbitai feedback loops, curious to see what happens here). Brings `InfiniBandPartition` in line with the proto-modeling changes. The handler keeps its REST surface; Create and Update route through `ToProto` on the API request types and the entity gains the full round-trip plus a delete-request builder. The entity also picks up the typed `Labels` field per the `InstanceType` reference in #540 -- it reaches `Labels.ToProto()` / `Labels.FromProto(...)` directly. Primary callouts are: - API request side: `APIInfiniBandPartitionCreateRequest.ToProto(ibp)` and `APIInfiniBandPartitionUpdateRequest.ToProto(ibp)`; both source corresponding fields via `ibp.ToProto()`. - Entity side: `InfiniBandPartition.ToProto`, `FromProto`, and `ToDeletionRequestProto`. There's also a new `InfiniBandPartition.Validate()` so site-driven `FromProto` callers can do `FromProto` + `Validate` -- mirroring the MachineCapability pattern (which went in at #569). - `InfiniBandPartitionStatus` is now a typed string with its own `.FromProto(state)` and `.Message()` methods, replacing the `InfiniBandPartitionStatusFromProto(state)` helper that split a single proto state into `(status, message)`. The workflow activity caller is just a two-liner now. - Handler simplified to one-liners, and the inline `cwssaws` / `model/util` imports drop. As part of the new typed `InfiniBandPartitionStatus` (and the already-typed `Labels`), the typing propagates through the API and DAO layers, but nothing changes on the wire -- JSON serialization of `type X string` is identical to `string`, the OpenAPI spec still just says "string", and the SDK is unaffected. Tests added! Signed-off-by: Chet Nichols III <chetn@nvidia.com> ## Type of Change <!-- Check one that best describes this PR --> - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [x] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) <!-- If applicable, provide GitHub Issue. --> ## Breaking Changes - [ ] This PR contains breaking changes <!-- If checked above, describe the breaking changes and migration steps --> ## Testing <!-- How was this tested? Check all that apply --> - [x] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes <!-- Any additional context, deployment notes, or reviewer guidance --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Corrected InfiniBand partition status handling in test scenarios. * **Refactor** * Improved internal type safety for InfiniBand partition status management. * Enhanced validation for InfiniBand partition operations and request parameters. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Chet Nichols III <chetn@nvidia.com>
Brings `NetworkSecurityGroup` in line with the proto-modeling changes. Create and Update route through `ToProto` on the API request types, the entity gets `ToProto` / `FromProto` plus `ToDeletionRequestProto`, and a new rule-level `Validate` gates every request before any `ToProto` runs. Also picks up the typed `Labels` field per the `InstanceType` reference (NVIDIA#540). Primary callouts are: - API request side: `APINetworkSecurityGroupCreateRequest.ToProto(nsg, id)` and `APINetworkSecurityGroupUpdateRequest.ToProto(nsg)`, plus `APINetworkSecurityGroupRule.ToProto` / `FromProto` as pure mappers. - Entity side: `NetworkSecurityGroup.ToProto()`, `FromProto`, and `ToDeletionRequestProto()`. - New `APINetworkSecurityGroupRule.Validate()` lifts the priority / enum / port-protocol / CIDR / required-prefix checks out of `ToProto`. Both Create and Update request validators call it per rule. - Handler simplified to one-liners through the new receivers, and the corresponding error-handling branches drop. Tests added! Signed-off-by: Chet Nichols III <chetn@nvidia.com>
Brings `NetworkSecurityGroup` in line with the proto-modeling changes. Create and Update route through `ToProto` on the API request types, the entity gets `ToProto` / `FromProto` plus `ToDeletionRequestProto`, and a new rule-level `Validate` gates every request before any `ToProto` runs. Also picks up the typed `Labels` field per the `InstanceType` reference (NVIDIA#540). Primary callouts are: - API request side: `APINetworkSecurityGroupCreateRequest.ToProto(nsg, id)` and `APINetworkSecurityGroupUpdateRequest.ToProto(nsg)`, plus `APINetworkSecurityGroupRule.ToProto` / `FromProto` as pure mappers. - Entity side: `NetworkSecurityGroup.ToProto()`, `FromProto`, and `ToDeletionRequestProto()`. - New `APINetworkSecurityGroupRule.Validate()` lifts the priority / enum / port-protocol / CIDR / required-prefix checks out of `ToProto`. Both Create and Update request validators call it per rule. - Handler simplified to one-liners through the new receivers, and the corresponding error-handling branches drop. Tests added! Signed-off-by: Chet Nichols III <chetn@nvidia.com>
Description
This PR adds #538, a formatting fix for DPU_TO_EVPN_DROP_PREFIX_LIST that was causing stuck instances, to v0.3.0. It also adds back 4 NVLink partitioning commits that were present in the v0.2.0 branch but missing in v0.3.0.
Type of Change
Related Issues (Optional)
Breaking Changes
Testing
Additional Notes