Skip to content

Remove config of pf0hpf_if instead of setting it to down after configuration#457

Merged
wminckler merged 14 commits into
NVIDIA:mainfrom
wminckler:minckler/disable-2nd-dpu
Mar 10, 2026
Merged

Remove config of pf0hpf_if instead of setting it to down after configuration#457
wminckler merged 14 commits into
NVIDIA:mainfrom
wminckler:minckler/disable-2nd-dpu

Conversation

@wminckler

@wminckler wminckler commented Mar 5, 2026

Copy link
Copy Markdown
Contributor

When applying a new nvue config, it sets the interface to up and for the non-primary DPUs, agent was going back and bringing the interface down. This leaves a small window where the host can get a dhcp response, leaking the admin IP address assigned to the host.

  • removes the interface and vpcs from the nvue config
    -- bridge was always include in the non-fnn case. this was removed if not needed
  • disables dhcp server (it fails without the vpc)
  • ignores the dhcp server health when it is disabled
  • updates tests

Tested both etv and fnn configs in the dev env.

Description

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

nvbug # 5931831

@wminckler wminckler requested a review from a team as a code owner March 5, 2026 16:00

@bcavnvidia bcavnvidia left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic around port_configs is a little funny and probably needs clean up. Could be now or could be later. No strong preference from me. 😄

@wminckler wminckler changed the title DRAFT: Remove config of pf0hpf_if instead of setting it to down after configuration Remove config of pf0hpf_if instead of setting it to down after configuration Mar 7, 2026
@wminckler wminckler force-pushed the minckler/disable-2nd-dpu branch from 3ad59a4 to 8c7a078 Compare March 7, 2026 00:15
@wminckler wminckler enabled auto-merge (squash) March 7, 2026 10:27
Comment thread crates/agent/templates/nvue_startup_fnn.conf Outdated

@abvarshney-nv abvarshney-nv left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked from dhcp-server pov, lgtm.

@wminckler wminckler disabled auto-merge March 9, 2026 14:39
@wminckler wminckler force-pushed the minckler/disable-2nd-dpu branch from 65a898a to d181762 Compare March 9, 2026 14:54
// Note that the nvue config handles the blocking of traffic on the interface. This is only so that the host link reflects the correct state.
if let Err(err) = ethernet_virtualization::update_interface_state(&conf).await {
tracing::error!(error = format!("{err:#}"), "Updating interface state.");
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brian showed me that if we down the pf0hpf interface (in the dpu os itself) it will be reflected by the host, so instead of removing all of it, I removed the hbn stuff and targeted the new interface.

Also, I removed the cached state. I didn't feel like running ip link show was a big deal now that its not in the container and its way easier to test and the "is it down already" check is always right.

@wminckler wminckler requested a review from bcavnvidia March 9, 2026 19:32
@wminckler wminckler force-pushed the minckler/disable-2nd-dpu branch from 156a575 to 04538bf Compare March 9, 2026 20:27

@bcavnvidia bcavnvidia left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes in this crate are always a little 😅 , but these changes look ok.

@wminckler wminckler merged commit 443d318 into NVIDIA:main Mar 10, 2026
36 checks passed
yoks pushed a commit to yoks/bare-metal-manager-core that referenced this pull request Mar 12, 2026
…uration (NVIDIA#457)

When applying a new nvue config, it sets the interface to up and for the
non-primary DPUs, agent was going back and bringing the interface down.
This leaves a small window where the host can get a dhcp response,
leaking the admin IP address assigned to the host.

- removes the interface and vpcs from the nvue config
-- bridge was always include in the non-fnn case. this was removed if
not needed
- disables dhcp server (it fails without the vpc)
- ignores the dhcp server health when it is disabled
- instead of bringing down the interface in the container, it brings down the interface on the dpu os, which is reflected in the host. (this is not required, but is a nice to have)
- updates tests

Tested both etv and fnn configs in the dev env.

## Description
<!-- Describe what this PR does -->

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality  
- [X] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [X] Unit tests added/updated
- [ ] Integration tests added/updated  
- [X] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->
nvbug # 5931831
jd-nv pushed a commit that referenced this pull request Mar 12, 2026
…uration (#457)

When applying a new nvue config, it sets the interface to up and for the
non-primary DPUs, agent was going back and bringing the interface down.
This leaves a small window where the host can get a dhcp response,
leaking the admin IP address assigned to the host.

- removes the interface and vpcs from the nvue config
-- bridge was always include in the non-fnn case. this was removed if
not needed
- disables dhcp server (it fails without the vpc)
- ignores the dhcp server health when it is disabled
- instead of bringing down the interface in the container, it brings down the interface on the dpu os, which is reflected in the host. (this is not required, but is a nice to have)
- updates tests

Tested both etv and fnn configs in the dev env.

## Description
<!-- Describe what this PR does -->

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality  
- [X] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [X] Unit tests added/updated
- [ ] Integration tests added/updated  
- [X] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->
nvbug # 5931831
nvcoop pushed a commit that referenced this pull request Mar 12, 2026
…EFIX_LIST (#538)

## Description
- Rule 65535 in the DPU_TO_EVPN_DROP_PREFIX_LIST prefix-list was using
route-map object syntax (action: {deny: {}}) instead of the string
syntax required by prefix-list rules (action: deny), causing NVUE schema
validation to reject the entire network config with the error: Config
invalid at
router.policy.prefix-list.DPU_TO_EVPN_DROP_PREFIX_LIST.rule.65535.action:
{'deny': {}} is not of type 'string'
- This rule was added in #457 to handle the case where no VPCs are
configured (making rule 10 conditional), but was copy-pasted from the
adjacent route-map section which uses a different schema for action
- Fix updates the template and 4 test expected files

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality  
- [x] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [x] Unit tests added/updated
- [ ] Integration tests added/updated  
- [ ] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->

Signed-off-by: Laura Gray <lgray@nvidia.com>
tmcroberts97 pushed a commit to tmcroberts97/infra-controller-core that referenced this pull request Mar 12, 2026
…EFIX_LIST (NVIDIA#538)

## Description
- Rule 65535 in the DPU_TO_EVPN_DROP_PREFIX_LIST prefix-list was using
route-map object syntax (action: {deny: {}}) instead of the string
syntax required by prefix-list rules (action: deny), causing NVUE schema
validation to reject the entire network config with the error: Config
invalid at
router.policy.prefix-list.DPU_TO_EVPN_DROP_PREFIX_LIST.rule.65535.action:
{'deny': {}} is not of type 'string'
- This rule was added in NVIDIA#457 to handle the case where no VPCs are
configured (making rule 10 conditional), but was copy-pasted from the
adjacent route-map section which uses a different schema for action
- Fix updates the template and 4 test expected files

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality  
- [x] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [x] Unit tests added/updated
- [ ] Integration tests added/updated  
- [ ] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->

Signed-off-by: Laura Gray <lgray@nvidia.com>
tmcroberts97 pushed a commit to tmcroberts97/infra-controller-core that referenced this pull request Mar 12, 2026
…EFIX_LIST (NVIDIA#538)

## Description
- Rule 65535 in the DPU_TO_EVPN_DROP_PREFIX_LIST prefix-list was using
route-map object syntax (action: {deny: {}}) instead of the string
syntax required by prefix-list rules (action: deny), causing NVUE schema
validation to reject the entire network config with the error: Config
invalid at
router.policy.prefix-list.DPU_TO_EVPN_DROP_PREFIX_LIST.rule.65535.action:
{'deny': {}} is not of type 'string'
- This rule was added in NVIDIA#457 to handle the case where no VPCs are
configured (making rule 10 conditional), but was copy-pasted from the
adjacent route-map section which uses a different schema for action
- Fix updates the template and 4 test expected files

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [x] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [x] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->

Signed-off-by: Laura Gray <lgray@nvidia.com>
Signed-off-by: Thomas McRoberts <tmcroberts@nvidia.com>
jd-nv pushed a commit that referenced this pull request Mar 12, 2026
…EFIX_LIST (#538)

## Description
- Rule 65535 in the DPU_TO_EVPN_DROP_PREFIX_LIST prefix-list was using
route-map object syntax (action: {deny: {}}) instead of the string
syntax required by prefix-list rules (action: deny), causing NVUE schema
validation to reject the entire network config with the error: Config
invalid at
router.policy.prefix-list.DPU_TO_EVPN_DROP_PREFIX_LIST.rule.65535.action:
{'deny': {}} is not of type 'string'
- This rule was added in #457 to handle the case where no VPCs are
configured (making rule 10 conditional), but was copy-pasted from the
adjacent route-map section which uses a different schema for action
- Fix updates the template and 4 test expected files

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [x] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [x] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->

Signed-off-by: Laura Gray <lgray@nvidia.com>
Signed-off-by: Thomas McRoberts <tmcroberts@nvidia.com>
nvcoop pushed a commit to nvcoop/bare-metal-manager-core that referenced this pull request Mar 12, 2026
…EFIX_LIST (NVIDIA#538)

## Description
- Rule 65535 in the DPU_TO_EVPN_DROP_PREFIX_LIST prefix-list was using
route-map object syntax (action: {deny: {}}) instead of the string
syntax required by prefix-list rules (action: deny), causing NVUE schema
validation to reject the entire network config with the error: Config
invalid at
router.policy.prefix-list.DPU_TO_EVPN_DROP_PREFIX_LIST.rule.65535.action:
{'deny': {}} is not of type 'string'
- This rule was added in NVIDIA#457 to handle the case where no VPCs are
configured (making rule 10 conditional), but was copy-pasted from the
adjacent route-map section which uses a different schema for action
- Fix updates the template and 4 test expected files

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality  
- [x] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [x] Unit tests added/updated
- [ ] Integration tests added/updated  
- [ ] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->

Signed-off-by: Laura Gray <lgray@nvidia.com>
ajf pushed a commit that referenced this pull request Mar 12, 2026
## Description

#457 introduced a
new policy rule missing a match-block.

This PR adds that missing block.

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality  
- [x] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [x] Unit tests added/updated
- [ ] Integration tests added/updated  
- [x] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->
jd-nv pushed a commit that referenced this pull request Mar 12, 2026
## Description

#457 introduced a
new policy rule missing a match-block.

This PR adds that missing block.

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality  
- [x] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [x] Unit tests added/updated
- [ ] Integration tests added/updated  
- [x] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->
ajf added a commit to ajf/infra-controller that referenced this pull request Mar 13, 2026
ajf added a commit that referenced this pull request Mar 13, 2026
ajf added a commit to ajf/infra-controller that referenced this pull request Mar 19, 2026
kfelternv added a commit to kfelternv/infra-controller that referenced this pull request May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants