fix: disable nvidia-fabricmanager on single-GPU VMs with MIG by ganeshkumarashok · Pull Request #8049 · Azure/AgentBaker

ganeshkumarashok · 2026-03-09T23:59:32Z

What this PR does / why we need it:

Fixes the Test_Ubuntu2404_NvidiaDevicePluginRunning_MIG e2e test failure caused by the nvidia-fabricmanager service entering a failed state on single-GPU systems.

The NVIDIA driver installation automatically installs and enables the nvidia-fabricmanager systemd service. On single-GPU VM sizes like Standard_NC24ads_A100_v4, fabric manager is not needed and will fail to start. The e2e test's ValidateNoFailedSystemdUnits() check then fails when it detects the service in a failed state.

Root cause:

Fabric manager is only required for multi-GPU systems (e.g., ND96asr_v4 with 8 A100 GPUs) where it manages NVLink connections between GPUs
Single-GPU systems with MIG partitions (e.g., NC24ads_A100_v4 with 1 A100 GPU) do not need fabric manager
The FabricManagerGPUSizes map correctly sets standard_nc24ads_a100_v4: false, but the CSE script didn't handle disabling the service when not needed

The fix:

Explicitly disable and stop the nvidia-fabricmanager service when GPU_NEEDS_FABRIC_MANAGER=false
Prevents the service from entering a failed state on single-GPU systems
Multi-GPU systems continue to have fabric manager enabled and started as before

Which issue(s) this PR fixes:

Fixes #

Testing:

Regenerated all test snapshots with make generate
The fix ensures single-GPU VMs don't have a failed fabricmanager service

Copilot

Pull request overview

This PR adjusts Linux CSE GPU driver configuration to prevent nvidia-fabricmanager from causing e2e failures on single-GPU MIG VM sizes by disabling the service when GPU_NEEDS_FABRIC_MANAGER=false, and regenerates snapshot testdata accordingly.

Changes:

Disable/stop nvidia-fabricmanager when fabric manager isn’t required.
Regenerate pkg/agent/testdata/** CustomData snapshots (make generate) to reflect the new CSE output.

Reviewed changes

Copilot reviewed 66 out of 72 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
parts/linux/cloud-init/artifacts/cse_main.sh	Adds logic to disable/stop `nvidia-fabricmanager` when not needed.
pkg/agent/testdata/MarinerV2+Kata/CustomData	Regenerated snapshot CustomData payload.
pkg/agent/testdata/MarinerV2+CustomCloud/CustomData	Regenerated snapshot CustomData payload.
pkg/agent/testdata/Flatcar/CustomData.inner	Regenerated snapshot CustomData payload.
pkg/agent/testdata/Flatcar+CustomCloud/CustomData.inner	Regenerated snapshot CustomData payload.
pkg/agent/testdata/CustomizedImageLinuxGuard/CustomData	Regenerated snapshot CustomData payload.
pkg/agent/testdata/CustomizedImageKata/CustomData	Regenerated snapshot CustomData payload.
pkg/agent/testdata/CustomizedImage/CustomData	Regenerated snapshot CustomData payload.
pkg/agent/testdata/AzureLinuxV3+Kata/CustomData	Regenerated snapshot CustomData payload.
pkg/agent/testdata/AzureLinuxV2+Kata/CustomData	Regenerated snapshot CustomData payload.
pkg/agent/testdata/ACL/CustomData.inner	Regenerated snapshot CustomData payload.
pkg/agent/testdata/ACL+CustomCloud/CustomData.inner	Regenerated snapshot CustomData payload.

You can also share your feedback on Copilot code review. Take the survey.

Copilot · 2026-03-10T00:03:13Z

parts/linux/cloud-init/artifacts/cse_main.sh

+            # but it will fail on single-GPU systems, so we explicitly disable it
+            systemctlDisableAndStop nvidia-fabricmanager || true


systemctlDisableAndStop only runs if the unit shows up in systemctl list-units (loaded units). If the driver install merely enables the unit (or it failed earlier and is now inactive), this may not actually disable it, and it also won’t clear an already-failed state—ValidateNoFailedSystemdUnits() uses systemctl list-units --failed, so a previously failed fabricmanager can still trip the e2e even after stop/disable. Consider disabling based on list-unit-files (unit file existence) and resetting the failed state (e.g., systemctl reset-failed) after stopping/disabling so the unit no longer appears as failed.

Suggested change

# but it will fail on single-GPU systems, so we explicitly disable it

systemctlDisableAndStop nvidia-fabricmanager || true

# but it will fail on single-GPU systems, so we explicitly disable it.

# Use list-unit-files so we catch units that are merely enabled or previously failed,

# and reset-failed so they no longer show up in `systemctl list-units --failed`.

if systemctl list-unit-files | grep -q '^nvidia-fabricmanager\.service'; then

systemctlDisableAndStop nvidia-fabricmanager || true

systemctl reset-failed nvidia-fabricmanager || true

fi

ganeshkumarashok · 2026-03-10T00:19:08Z

Updated based on review feedback. The previous approach using systemctlDisableAndStop had issues:

Only checks loaded units: systemctl list-units only shows loaded units, missing unit files that exist but aren't loaded
Doesn't clear failed state: If the service already failed, stop/disable won't clear it from systemctl list-units --failed

New approach:

Uses systemctl list-unit-files to detect the unit file based on existence, not load state
Explicitly calls systemctl reset-failed to clear any failed state
Ensures the service won't appear in the e2e validator's systemctl list-units --failed check

This handles all edge cases:

✅ Unit file exists but not loaded yet
✅ Unit failed and is now inactive
✅ Unit is in failed state but unloaded
✅ Previously failed state is cleared

Commit: 43d5ca0

ganeshkumarashok · 2026-03-10T00:22:20Z

Rebased on latest main and resolved merge conflicts.

Changes:

Rebased onto origin/main (was at 8dbbf475b3, now at cb478da79e)
Resolved merge conflicts in 6 generated test data files (ACL and Flatcar variants)
Regenerated all test snapshots to ensure consistency

The branch is now up-to-date with main and ready for review.

Copilot

Pull request overview

Copilot reviewed 66 out of 72 changed files in this pull request and generated 3 comments.

You can also share your feedback on Copilot code review. Take the survey.

Copilot · 2026-03-10T00:25:21Z

parts/linux/cloud-init/artifacts/cse_main.sh

+                systemctl stop nvidia-fabricmanager 2>/dev/null || true
+                systemctl disable nvidia-fabricmanager 2>/dev/null || true


Consider using the existing systemctl helper wrappers (systemctl_stop/systemctl_disable) with timeouts/retries instead of raw systemctl stop/disable here. That keeps behavior consistent with other service operations in CSE and avoids potential hangs on stop/disable when systemd is slow/unresponsive.

Suggested change

systemctl stop nvidia-fabricmanager 2>/dev/null || true

systemctl disable nvidia-fabricmanager 2>/dev/null || true

systemctl_stop nvidia-fabricmanager || true

systemctl_disable nvidia-fabricmanager || true

Copilot · 2026-03-10T00:25:22Z

parts/linux/cloud-init/artifacts/cse_main.sh

+            # Check if the unit file exists (not just if it's loaded) and disable/stop it
+            if systemctl list-unit-files | grep -q "nvidia-fabricmanager.service"; then
+                systemctl stop nvidia-fabricmanager 2>/dev/null || true
+                systemctl disable nvidia-fabricmanager 2>/dev/null || true
+                # Reset any failed state so it doesn't show up in 'systemctl list-units --failed'
+                systemctl reset-failed nvidia-fabricmanager 2>/dev/null || true


GPU_NEEDS_FABRIC_MANAGER is derived from a map lookup and will be "false" for any GPU SKU not explicitly listed. With this change, those unknown SKUs will now have fabric manager forcibly disabled even if it would otherwise start successfully. To reduce risk, consider disabling only when the unit is in a failed state (e.g., systemctl is-failed) or ensuring the FabricManagerGPUSizes allowlist is comprehensive for all multi-GPU SKUs.

Suggested change

# Check if the unit file exists (not just if it's loaded) and disable/stop it

if systemctl list-unit-files | grep -q "nvidia-fabricmanager.service"; then

systemctl stop nvidia-fabricmanager 2>/dev/null || true

systemctl disable nvidia-fabricmanager 2>/dev/null || true

# Reset any failed state so it doesn't show up in 'systemctl list-units --failed'

systemctl reset-failed nvidia-fabricmanager 2>/dev/null || true

# Only disable/stop it if the unit exists and is currently in a failed state

if systemctl list-unit-files | grep -q "nvidia-fabricmanager.service"; then

if systemctl is-failed --quiet nvidia-fabricmanager; then

systemctl stop nvidia-fabricmanager 2>/dev/null || true

systemctl disable nvidia-fabricmanager 2>/dev/null || true

# Reset any failed state so it doesn't show up in 'systemctl list-units --failed'

systemctl reset-failed nvidia-fabricmanager 2>/dev/null || true

fi

Copilot · 2026-03-10T00:25:22Z

parts/linux/cloud-init/artifacts/cse_main.sh

+            # The NVIDIA driver installation may automatically enable this service,
+            # but it will fail on single-GPU systems, so we explicitly disable it
+            # Check if the unit file exists (not just if it's loaded) and disable/stop it
+            if systemctl list-unit-files | grep -q "nvidia-fabricmanager.service"; then


systemctl list-unit-files | grep -q "nvidia-fabricmanager.service" scans all unit files. Using systemctl list-unit-files --no-pager nvidia-fabricmanager.service (and checking the exit status) is more direct and avoids parsing/grep pitfalls (headers, partial matches).

Suggested change

if systemctl list-unit-files | grep -q "nvidia-fabricmanager.service"; then

if systemctl list-unit-files --no-pager nvidia-fabricmanager.service >/dev/null 2>&1; then

ganeshkumarashok · 2026-03-10T20:26:43Z

Copilot Review Response

Addressed copilot feedback in commit 6b70510:

✅ Implemented:

Comment 2: Use systemctl helper wrappers

Changed to systemctl_stop and systemctl_disable with retry/timeout parameters (20 retries, 5s sleep, 25s timeout)
Ensures consistent behavior with other CSE service operations

Comment 4: Use more direct systemctl command

Changed from systemctl list-unit-files | grep -q "nvidia-fabricmanager.service"
To systemctl list-unit-files --no-pager nvidia-fabricmanager.service >/dev/null 2>&1
Avoids grep parsing issues (headers, partial matches)

❌ Not Implemented:

Comment 3: Only disable if in failed state

Suggestion: Check systemctl is-failed before disabling
Why not: This doesn't solve the problem. The unit may not be failed yet when CSE runs. On single-GPU systems:
1. NVIDIA driver installation enables the service
2. CSE continues (service hasn't started/failed yet)
3. Later, systemd tries to start the service → it fails
4. E2e validator sees the failed service and test fails

We need to proactively disable the service on single-GPU systems (when GPU_NEEDS_FABRIC_MANAGER=false) to prevent it from ever starting and failing.

Regarding unknown GPU SKUs:

Comment 3 raised concern about unknown GPU SKUs. The current behavior is correct:

Unknown GPU SKUs → GPU_NEEDS_FABRIC_MANAGER=false (map lookup returns false for missing keys)
These are likely single-GPU or non-fabric-manager SKUs
If a new multi-GPU SKU requires fabric manager, it should be explicitly added to FabricManagerGPUSizes map in pkg/agent/datamodel/gpu_components.go

This is safer than the alternative (unknown SKUs → fabric manager enabled → might fail).

The NVIDIA driver installation automatically installs and enables the nvidia-fabricmanager systemd service. On single-GPU VM sizes like Standard_NC24ads_A100_v4, fabric manager is not needed and will fail to start, causing e2e tests to fail when checking for failed systemd units. Fabric manager is only required for multi-GPU systems (e.g., ND96asr_v4 with 8 GPUs) where it manages NVLink connections between GPUs. Single-GPU systems with MIG partitions do not need fabric manager. This fix explicitly disables and stops the nvidia-fabricmanager service when GPU_NEEDS_FABRIC_MANAGER=false, preventing it from entering a failed state on single-GPU systems. Fixes Test_Ubuntu2404_NvidiaDevicePluginRunning_MIG e2e test failure.

Address review feedback: systemctlDisableAndStop only works with loaded units and doesn't clear failed states. This could miss: - Unit files that exist but aren't loaded yet - Units that failed and are now inactive - Failed states that persist after stop/disable Changes: - Check unit file existence with 'systemctl list-unit-files' instead of relying on systemctlDisableAndStop which only checks loaded units - Explicitly call 'systemctl reset-failed' to clear any failed state - This ensures the unit won't show up in 'systemctl list-units --failed' which is what the e2e validator checks Co-authored-by: reviewer

Addressed copilot review comments: 1. Use systemctl_stop/systemctl_disable helper wrappers (20 retries, 5s sleep, 25s timeout) instead of raw systemctl commands for consistent retry/timeout behavior across CSE operations 2. Use 'systemctl list-unit-files --no-pager <unit>' instead of 'systemctl list-unit-files | grep' for more direct unit file check that avoids grep parsing pitfalls (headers, partial matches) Not implemented: - Comment 3 (only disable if failed): This doesn't solve the problem because the unit may not be failed YET when CSE runs. We need to proactively disable it on single-GPU systems to prevent future failures.

ganeshkumarashok · 2026-03-10T20:30:20Z

Rebased on latest main (23cc72a). Conflicts resolved and test snapshots regenerated. Branch is now up-to-date and mergeable.

Copilot

Pull request overview

Copilot reviewed 66 out of 72 changed files in this pull request and generated 1 comment.

You can also share your feedback on Copilot code review. Take the survey.

Copilot · 2026-03-10T20:34:17Z

parts/linux/cloud-init/artifacts/cse_main.sh

+            # Check if the unit file exists using a more direct approach
+            if systemctl list-unit-files --no-pager nvidia-fabricmanager.service >/dev/null 2>&1; then
+                # Use systemctl helper wrappers for consistent retry/timeout behavior
+                systemctl_stop 20 5 25 nvidia-fabricmanager || true
+                systemctl_disable 20 5 25 nvidia-fabricmanager || true
+                # Reset any failed state so it doesn't show up in 'systemctl list-units --failed'
+                systemctl reset-failed nvidia-fabricmanager 2>/dev/null || true
+            fi


The systemctl list-unit-files ... >/dev/null guard may not reliably indicate that the unit exists (the command can exit 0 even when no unit files match), which can cause systemctl_stop/systemctl_disable to run and potentially spend a long time retrying on nodes where nvidia-fabricmanager isn’t installed. Consider making the guard check for an actual match in the command output (e.g., using --no-legend and grep -q for the unit name) or using an existence check that fails when the unit file is absent, before invoking the retrying wrappers.

Addressed latest copilot feedback: systemctl list-unit-files may exit 0 even when no unit files match, which could cause unnecessary retries of systemctl_stop/systemctl_disable on nodes without nvidia-fabricmanager. Changed from: systemctl list-unit-files --no-pager <unit> >/dev/null 2>&1 To: systemctl list-unit-files --no-pager --no-legend <unit> | grep -q <unit> This ensures: - --no-legend removes header lines that could cause false positives - grep -q explicitly checks for the unit name in output - Only runs stop/disable wrappers when unit file actually exists - Avoids spending time on retries for non-existent units

ganeshkumarashok · 2026-03-10T20:38:57Z

Latest Copilot Feedback Addressed

Issue: systemctl list-unit-files --no-pager <unit> >/dev/null 2>&1 may exit 0 even when no unit files match, causing unnecessary retries.

Fix (commit d316c73):

# Before: May not reliably detect if unit exists
if systemctl list-unit-files --no-pager nvidia-fabricmanager.service >/dev/null 2>&1; then

# After: Explicitly checks for unit in output
if systemctl list-unit-files --no-pager --no-legend nvidia-fabricmanager.service 2>/dev/null | grep -q "nvidia-fabricmanager.service"; then

Why this is better:

--no-legend removes header lines that could cause false positives
grep -q explicitly checks for the unit name in the command output
Only invokes systemctl_stop/systemctl_disable retry wrappers when unit file actually exists
Avoids wasting time on 20 retries when the unit doesn't exist

This combines the robustness of the grep check with the directness of querying a specific unit name.

Copilot AI review requested due to automatic review settings March 9, 2026 23:59

ganeshkumarashok requested review from AbelHu, Devinwong, YaoC, awesomenix, calvin197, cameronmeissner, djsly, juan-lee, junjiezhang1997, lilypan26, mxj220, pdamianov-dev, phealy, r2k1, sulixu, surajssd, timmy-wright, yewmsft and zachary-bailey as code owners March 9, 2026 23:59

Copilot started reviewing on behalf of ganeshkumarashok March 10, 2026 00:00 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings March 10, 2026 00:22

ganeshkumarashok force-pushed the aganeshkumar/fix-fabric-manager-mig branch from 43d5ca0 to ee46f7b Compare March 10, 2026 00:22

ganeshkumarashok temporarily deployed to test March 10, 2026 00:22 — with GitHub Actions Inactive

Copilot started reviewing on behalf of ganeshkumarashok March 10, 2026 00:22 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

ganeshkumarashok added 3 commits March 10, 2026 13:29

ganeshkumarashok force-pushed the aganeshkumar/fix-fabric-manager-mig branch from 6b70510 to c061120 Compare March 10, 2026 20:30

Copilot AI review requested due to automatic review settings March 10, 2026 20:30

ganeshkumarashok temporarily deployed to test March 10, 2026 20:30 — with GitHub Actions Inactive

Copilot started reviewing on behalf of ganeshkumarashok March 10, 2026 20:31 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

ganeshkumarashok temporarily deployed to test March 10, 2026 20:38 — with GitHub Actions Inactive

		# but it will fail on single-GPU systems, so we explicitly disable it
		systemctlDisableAndStop nvidia-fabricmanager \|\| true

-            # but it will fail on single-GPU systems, so we explicitly disable it
-            systemctlDisableAndStop nvidia-fabricmanager || true
+            # but it will fail on single-GPU systems, so we explicitly disable it.
+            # Use list-unit-files so we catch units that are merely enabled or previously failed,
+            # and reset-failed so they no longer show up in `systemctl list-units --failed`.
+            if systemctl list-unit-files | grep -q '^nvidia-fabricmanager\.service'; then
+                systemctlDisableAndStop nvidia-fabricmanager || true
+                systemctl reset-failed nvidia-fabricmanager || true
+            fi

		systemctl stop nvidia-fabricmanager 2>/dev/null \|\| true
		systemctl disable nvidia-fabricmanager 2>/dev/null \|\| true

	if systemctl list-unit-files \| grep -q "nvidia-fabricmanager.service"; then
	if systemctl list-unit-files --no-pager nvidia-fabricmanager.service >/dev/null 2>&1; then

Conversation

ganeshkumarashok commented Mar 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

ganeshkumarashok commented Mar 10, 2026

Uh oh!

ganeshkumarashok commented Mar 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

ganeshkumarashok commented Mar 10, 2026

Copilot Review Response

✅ Implemented:

❌ Not Implemented:

Regarding unknown GPU SKUs:

Uh oh!

ganeshkumarashok commented Mar 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

ganeshkumarashok commented Mar 10, 2026

Latest Copilot Feedback Addressed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants