Skip to content

fix: disable nvidia-fabricmanager on single-GPU VMs with MIG#8049

Open
ganeshkumarashok wants to merge 4 commits intomainfrom
aganeshkumar/fix-fabric-manager-mig
Open

fix: disable nvidia-fabricmanager on single-GPU VMs with MIG#8049
ganeshkumarashok wants to merge 4 commits intomainfrom
aganeshkumar/fix-fabric-manager-mig

Conversation

@ganeshkumarashok
Copy link
Contributor

What this PR does / why we need it:

Fixes the Test_Ubuntu2404_NvidiaDevicePluginRunning_MIG e2e test failure caused by the nvidia-fabricmanager service entering a failed state on single-GPU systems.

The NVIDIA driver installation automatically installs and enables the nvidia-fabricmanager systemd service. On single-GPU VM sizes like Standard_NC24ads_A100_v4, fabric manager is not needed and will fail to start. The e2e test's ValidateNoFailedSystemdUnits() check then fails when it detects the service in a failed state.

Root cause:

  • Fabric manager is only required for multi-GPU systems (e.g., ND96asr_v4 with 8 A100 GPUs) where it manages NVLink connections between GPUs
  • Single-GPU systems with MIG partitions (e.g., NC24ads_A100_v4 with 1 A100 GPU) do not need fabric manager
  • The FabricManagerGPUSizes map correctly sets standard_nc24ads_a100_v4: false, but the CSE script didn't handle disabling the service when not needed

The fix:

  • Explicitly disable and stop the nvidia-fabricmanager service when GPU_NEEDS_FABRIC_MANAGER=false
  • Prevents the service from entering a failed state on single-GPU systems
  • Multi-GPU systems continue to have fabric manager enabled and started as before

Which issue(s) this PR fixes:

Fixes #

Testing:

  • Regenerated all test snapshots with make generate
  • The fix ensures single-GPU VMs don't have a failed fabricmanager service

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts Linux CSE GPU driver configuration to prevent nvidia-fabricmanager from causing e2e failures on single-GPU MIG VM sizes by disabling the service when GPU_NEEDS_FABRIC_MANAGER=false, and regenerates snapshot testdata accordingly.

Changes:

  • Disable/stop nvidia-fabricmanager when fabric manager isn’t required.
  • Regenerate pkg/agent/testdata/** CustomData snapshots (make generate) to reflect the new CSE output.

Reviewed changes

Copilot reviewed 66 out of 72 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
parts/linux/cloud-init/artifacts/cse_main.sh Adds logic to disable/stop nvidia-fabricmanager when not needed.
pkg/agent/testdata/MarinerV2+Kata/CustomData Regenerated snapshot CustomData payload.
pkg/agent/testdata/MarinerV2+CustomCloud/CustomData Regenerated snapshot CustomData payload.
pkg/agent/testdata/Flatcar/CustomData.inner Regenerated snapshot CustomData payload.
pkg/agent/testdata/Flatcar+CustomCloud/CustomData.inner Regenerated snapshot CustomData payload.
pkg/agent/testdata/CustomizedImageLinuxGuard/CustomData Regenerated snapshot CustomData payload.
pkg/agent/testdata/CustomizedImageKata/CustomData Regenerated snapshot CustomData payload.
pkg/agent/testdata/CustomizedImage/CustomData Regenerated snapshot CustomData payload.
pkg/agent/testdata/AzureLinuxV3+Kata/CustomData Regenerated snapshot CustomData payload.
pkg/agent/testdata/AzureLinuxV2+Kata/CustomData Regenerated snapshot CustomData payload.
pkg/agent/testdata/ACL/CustomData.inner Regenerated snapshot CustomData payload.
pkg/agent/testdata/ACL+CustomCloud/CustomData.inner Regenerated snapshot CustomData payload.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +384 to +385
# but it will fail on single-GPU systems, so we explicitly disable it
systemctlDisableAndStop nvidia-fabricmanager || true
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

systemctlDisableAndStop only runs if the unit shows up in systemctl list-units (loaded units). If the driver install merely enables the unit (or it failed earlier and is now inactive), this may not actually disable it, and it also won’t clear an already-failed state—ValidateNoFailedSystemdUnits() uses systemctl list-units --failed, so a previously failed fabricmanager can still trip the e2e even after stop/disable. Consider disabling based on list-unit-files (unit file existence) and resetting the failed state (e.g., systemctl reset-failed) after stopping/disabling so the unit no longer appears as failed.

Suggested change
# but it will fail on single-GPU systems, so we explicitly disable it
systemctlDisableAndStop nvidia-fabricmanager || true
# but it will fail on single-GPU systems, so we explicitly disable it.
# Use list-unit-files so we catch units that are merely enabled or previously failed,
# and reset-failed so they no longer show up in `systemctl list-units --failed`.
if systemctl list-unit-files | grep -q '^nvidia-fabricmanager\.service'; then
systemctlDisableAndStop nvidia-fabricmanager || true
systemctl reset-failed nvidia-fabricmanager || true
fi

Copilot uses AI. Check for mistakes.
@ganeshkumarashok
Copy link
Contributor Author

Updated based on review feedback. The previous approach using systemctlDisableAndStop had issues:

  1. Only checks loaded units: systemctl list-units only shows loaded units, missing unit files that exist but aren't loaded
  2. Doesn't clear failed state: If the service already failed, stop/disable won't clear it from systemctl list-units --failed

New approach:

  • Uses systemctl list-unit-files to detect the unit file based on existence, not load state
  • Explicitly calls systemctl reset-failed to clear any failed state
  • Ensures the service won't appear in the e2e validator's systemctl list-units --failed check

This handles all edge cases:

  • ✅ Unit file exists but not loaded yet
  • ✅ Unit failed and is now inactive
  • ✅ Unit is in failed state but unloaded
  • ✅ Previously failed state is cleared

Commit: 43d5ca0

Copilot AI review requested due to automatic review settings March 10, 2026 00:22
@ganeshkumarashok ganeshkumarashok force-pushed the aganeshkumar/fix-fabric-manager-mig branch from 43d5ca0 to ee46f7b Compare March 10, 2026 00:22
@ganeshkumarashok
Copy link
Contributor Author

Rebased on latest main and resolved merge conflicts.

Changes:

  • Rebased onto origin/main (was at 8dbbf475b3, now at cb478da79e)
  • Resolved merge conflicts in 6 generated test data files (ACL and Flatcar variants)
  • Regenerated all test snapshots to ensure consistency

The branch is now up-to-date with main and ready for review.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 66 out of 72 changed files in this pull request and generated 3 comments.


You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +387 to +388
systemctl stop nvidia-fabricmanager 2>/dev/null || true
systemctl disable nvidia-fabricmanager 2>/dev/null || true
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using the existing systemctl helper wrappers (systemctl_stop/systemctl_disable) with timeouts/retries instead of raw systemctl stop/disable here. That keeps behavior consistent with other service operations in CSE and avoids potential hangs on stop/disable when systemd is slow/unresponsive.

Suggested change
systemctl stop nvidia-fabricmanager 2>/dev/null || true
systemctl disable nvidia-fabricmanager 2>/dev/null || true
systemctl_stop nvidia-fabricmanager || true
systemctl_disable nvidia-fabricmanager || true

Copilot uses AI. Check for mistakes.
Comment on lines +385 to +390
# Check if the unit file exists (not just if it's loaded) and disable/stop it
if systemctl list-unit-files | grep -q "nvidia-fabricmanager.service"; then
systemctl stop nvidia-fabricmanager 2>/dev/null || true
systemctl disable nvidia-fabricmanager 2>/dev/null || true
# Reset any failed state so it doesn't show up in 'systemctl list-units --failed'
systemctl reset-failed nvidia-fabricmanager 2>/dev/null || true
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPU_NEEDS_FABRIC_MANAGER is derived from a map lookup and will be "false" for any GPU SKU not explicitly listed. With this change, those unknown SKUs will now have fabric manager forcibly disabled even if it would otherwise start successfully. To reduce risk, consider disabling only when the unit is in a failed state (e.g., systemctl is-failed) or ensuring the FabricManagerGPUSizes allowlist is comprehensive for all multi-GPU SKUs.

Suggested change
# Check if the unit file exists (not just if it's loaded) and disable/stop it
if systemctl list-unit-files | grep -q "nvidia-fabricmanager.service"; then
systemctl stop nvidia-fabricmanager 2>/dev/null || true
systemctl disable nvidia-fabricmanager 2>/dev/null || true
# Reset any failed state so it doesn't show up in 'systemctl list-units --failed'
systemctl reset-failed nvidia-fabricmanager 2>/dev/null || true
# Only disable/stop it if the unit exists and is currently in a failed state
if systemctl list-unit-files | grep -q "nvidia-fabricmanager.service"; then
if systemctl is-failed --quiet nvidia-fabricmanager; then
systemctl stop nvidia-fabricmanager 2>/dev/null || true
systemctl disable nvidia-fabricmanager 2>/dev/null || true
# Reset any failed state so it doesn't show up in 'systemctl list-units --failed'
systemctl reset-failed nvidia-fabricmanager 2>/dev/null || true
fi

Copilot uses AI. Check for mistakes.
# The NVIDIA driver installation may automatically enable this service,
# but it will fail on single-GPU systems, so we explicitly disable it
# Check if the unit file exists (not just if it's loaded) and disable/stop it
if systemctl list-unit-files | grep -q "nvidia-fabricmanager.service"; then
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

systemctl list-unit-files | grep -q "nvidia-fabricmanager.service" scans all unit files. Using systemctl list-unit-files --no-pager nvidia-fabricmanager.service (and checking the exit status) is more direct and avoids parsing/grep pitfalls (headers, partial matches).

Suggested change
if systemctl list-unit-files | grep -q "nvidia-fabricmanager.service"; then
if systemctl list-unit-files --no-pager nvidia-fabricmanager.service >/dev/null 2>&1; then

Copilot uses AI. Check for mistakes.
@ganeshkumarashok
Copy link
Contributor Author

Copilot Review Response

Addressed copilot feedback in commit 6b70510:

✅ Implemented:

Comment 2: Use systemctl helper wrappers

  • Changed to systemctl_stop and systemctl_disable with retry/timeout parameters (20 retries, 5s sleep, 25s timeout)
  • Ensures consistent behavior with other CSE service operations

Comment 4: Use more direct systemctl command

  • Changed from systemctl list-unit-files | grep -q "nvidia-fabricmanager.service"
  • To systemctl list-unit-files --no-pager nvidia-fabricmanager.service >/dev/null 2>&1
  • Avoids grep parsing issues (headers, partial matches)

❌ Not Implemented:

Comment 3: Only disable if in failed state

  • Suggestion: Check systemctl is-failed before disabling
  • Why not: This doesn't solve the problem. The unit may not be failed yet when CSE runs. On single-GPU systems:
    1. NVIDIA driver installation enables the service
    2. CSE continues (service hasn't started/failed yet)
    3. Later, systemd tries to start the service → it fails
    4. E2e validator sees the failed service and test fails

We need to proactively disable the service on single-GPU systems (when GPU_NEEDS_FABRIC_MANAGER=false) to prevent it from ever starting and failing.

Regarding unknown GPU SKUs:

Comment 3 raised concern about unknown GPU SKUs. The current behavior is correct:

  • Unknown GPU SKUs → GPU_NEEDS_FABRIC_MANAGER=false (map lookup returns false for missing keys)
  • These are likely single-GPU or non-fabric-manager SKUs
  • If a new multi-GPU SKU requires fabric manager, it should be explicitly added to FabricManagerGPUSizes map in pkg/agent/datamodel/gpu_components.go

This is safer than the alternative (unknown SKUs → fabric manager enabled → might fail).

The NVIDIA driver installation automatically installs and enables the
nvidia-fabricmanager systemd service. On single-GPU VM sizes like
Standard_NC24ads_A100_v4, fabric manager is not needed and will fail
to start, causing e2e tests to fail when checking for failed systemd
units.

Fabric manager is only required for multi-GPU systems (e.g., ND96asr_v4
with 8 GPUs) where it manages NVLink connections between GPUs. Single-GPU
systems with MIG partitions do not need fabric manager.

This fix explicitly disables and stops the nvidia-fabricmanager service
when GPU_NEEDS_FABRIC_MANAGER=false, preventing it from entering a
failed state on single-GPU systems.

Fixes Test_Ubuntu2404_NvidiaDevicePluginRunning_MIG e2e test failure.
Address review feedback: systemctlDisableAndStop only works with loaded
units and doesn't clear failed states. This could miss:
- Unit files that exist but aren't loaded yet
- Units that failed and are now inactive
- Failed states that persist after stop/disable

Changes:
- Check unit file existence with 'systemctl list-unit-files' instead of
  relying on systemctlDisableAndStop which only checks loaded units
- Explicitly call 'systemctl reset-failed' to clear any failed state
- This ensures the unit won't show up in 'systemctl list-units --failed'
  which is what the e2e validator checks

Co-authored-by: reviewer
Addressed copilot review comments:

1. Use systemctl_stop/systemctl_disable helper wrappers (20 retries, 5s
   sleep, 25s timeout) instead of raw systemctl commands for consistent
   retry/timeout behavior across CSE operations

2. Use 'systemctl list-unit-files --no-pager <unit>' instead of
   'systemctl list-unit-files | grep' for more direct unit file check
   that avoids grep parsing pitfalls (headers, partial matches)

Not implemented:
- Comment 3 (only disable if failed): This doesn't solve the problem
  because the unit may not be failed YET when CSE runs. We need to
  proactively disable it on single-GPU systems to prevent future failures.
@ganeshkumarashok ganeshkumarashok force-pushed the aganeshkumar/fix-fabric-manager-mig branch from 6b70510 to c061120 Compare March 10, 2026 20:30
Copilot AI review requested due to automatic review settings March 10, 2026 20:30
@ganeshkumarashok
Copy link
Contributor Author

Rebased on latest main (23cc72a). Conflicts resolved and test snapshots regenerated. Branch is now up-to-date and mergeable.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 66 out of 72 changed files in this pull request and generated 1 comment.


You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +385 to +392
# Check if the unit file exists using a more direct approach
if systemctl list-unit-files --no-pager nvidia-fabricmanager.service >/dev/null 2>&1; then
# Use systemctl helper wrappers for consistent retry/timeout behavior
systemctl_stop 20 5 25 nvidia-fabricmanager || true
systemctl_disable 20 5 25 nvidia-fabricmanager || true
# Reset any failed state so it doesn't show up in 'systemctl list-units --failed'
systemctl reset-failed nvidia-fabricmanager 2>/dev/null || true
fi
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The systemctl list-unit-files ... >/dev/null guard may not reliably indicate that the unit exists (the command can exit 0 even when no unit files match), which can cause systemctl_stop/systemctl_disable to run and potentially spend a long time retrying on nodes where nvidia-fabricmanager isn’t installed. Consider making the guard check for an actual match in the command output (e.g., using --no-legend and grep -q for the unit name) or using an existence check that fails when the unit file is absent, before invoking the retrying wrappers.

Copilot uses AI. Check for mistakes.
Addressed latest copilot feedback: systemctl list-unit-files may exit 0
even when no unit files match, which could cause unnecessary retries of
systemctl_stop/systemctl_disable on nodes without nvidia-fabricmanager.

Changed from:
  systemctl list-unit-files --no-pager <unit> >/dev/null 2>&1

To:
  systemctl list-unit-files --no-pager --no-legend <unit> | grep -q <unit>

This ensures:
- --no-legend removes header lines that could cause false positives
- grep -q explicitly checks for the unit name in output
- Only runs stop/disable wrappers when unit file actually exists
- Avoids spending time on retries for non-existent units
@ganeshkumarashok
Copy link
Contributor Author

Latest Copilot Feedback Addressed

Issue: systemctl list-unit-files --no-pager <unit> >/dev/null 2>&1 may exit 0 even when no unit files match, causing unnecessary retries.

Fix (commit d316c73):

# Before: May not reliably detect if unit exists
if systemctl list-unit-files --no-pager nvidia-fabricmanager.service >/dev/null 2>&1; then

# After: Explicitly checks for unit in output
if systemctl list-unit-files --no-pager --no-legend nvidia-fabricmanager.service 2>/dev/null | grep -q "nvidia-fabricmanager.service"; then

Why this is better:

  • --no-legend removes header lines that could cause false positives
  • grep -q explicitly checks for the unit name in the command output
  • Only invokes systemctl_stop/systemctl_disable retry wrappers when unit file actually exists
  • Avoids wasting time on 20 retries when the unit doesn't exist

This combines the robustness of the grep check with the directness of querying a specific unit name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants