fix: disable nvidia-fabricmanager on single-GPU VMs with MIG#8049
fix: disable nvidia-fabricmanager on single-GPU VMs with MIG#8049ganeshkumarashok wants to merge 4 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adjusts Linux CSE GPU driver configuration to prevent nvidia-fabricmanager from causing e2e failures on single-GPU MIG VM sizes by disabling the service when GPU_NEEDS_FABRIC_MANAGER=false, and regenerates snapshot testdata accordingly.
Changes:
- Disable/stop
nvidia-fabricmanagerwhen fabric manager isn’t required. - Regenerate
pkg/agent/testdata/**CustomData snapshots (make generate) to reflect the new CSE output.
Reviewed changes
Copilot reviewed 66 out of 72 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| parts/linux/cloud-init/artifacts/cse_main.sh | Adds logic to disable/stop nvidia-fabricmanager when not needed. |
| pkg/agent/testdata/MarinerV2+Kata/CustomData | Regenerated snapshot CustomData payload. |
| pkg/agent/testdata/MarinerV2+CustomCloud/CustomData | Regenerated snapshot CustomData payload. |
| pkg/agent/testdata/Flatcar/CustomData.inner | Regenerated snapshot CustomData payload. |
| pkg/agent/testdata/Flatcar+CustomCloud/CustomData.inner | Regenerated snapshot CustomData payload. |
| pkg/agent/testdata/CustomizedImageLinuxGuard/CustomData | Regenerated snapshot CustomData payload. |
| pkg/agent/testdata/CustomizedImageKata/CustomData | Regenerated snapshot CustomData payload. |
| pkg/agent/testdata/CustomizedImage/CustomData | Regenerated snapshot CustomData payload. |
| pkg/agent/testdata/AzureLinuxV3+Kata/CustomData | Regenerated snapshot CustomData payload. |
| pkg/agent/testdata/AzureLinuxV2+Kata/CustomData | Regenerated snapshot CustomData payload. |
| pkg/agent/testdata/ACL/CustomData.inner | Regenerated snapshot CustomData payload. |
| pkg/agent/testdata/ACL+CustomCloud/CustomData.inner | Regenerated snapshot CustomData payload. |
You can also share your feedback on Copilot code review. Take the survey.
| # but it will fail on single-GPU systems, so we explicitly disable it | ||
| systemctlDisableAndStop nvidia-fabricmanager || true |
There was a problem hiding this comment.
systemctlDisableAndStop only runs if the unit shows up in systemctl list-units (loaded units). If the driver install merely enables the unit (or it failed earlier and is now inactive), this may not actually disable it, and it also won’t clear an already-failed state—ValidateNoFailedSystemdUnits() uses systemctl list-units --failed, so a previously failed fabricmanager can still trip the e2e even after stop/disable. Consider disabling based on list-unit-files (unit file existence) and resetting the failed state (e.g., systemctl reset-failed) after stopping/disabling so the unit no longer appears as failed.
| # but it will fail on single-GPU systems, so we explicitly disable it | |
| systemctlDisableAndStop nvidia-fabricmanager || true | |
| # but it will fail on single-GPU systems, so we explicitly disable it. | |
| # Use list-unit-files so we catch units that are merely enabled or previously failed, | |
| # and reset-failed so they no longer show up in `systemctl list-units --failed`. | |
| if systemctl list-unit-files | grep -q '^nvidia-fabricmanager\.service'; then | |
| systemctlDisableAndStop nvidia-fabricmanager || true | |
| systemctl reset-failed nvidia-fabricmanager || true | |
| fi |
|
Updated based on review feedback. The previous approach using
New approach:
This handles all edge cases:
Commit: 43d5ca0 |
43d5ca0 to
ee46f7b
Compare
|
Rebased on latest Changes:
The branch is now up-to-date with main and ready for review. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 66 out of 72 changed files in this pull request and generated 3 comments.
You can also share your feedback on Copilot code review. Take the survey.
| systemctl stop nvidia-fabricmanager 2>/dev/null || true | ||
| systemctl disable nvidia-fabricmanager 2>/dev/null || true |
There was a problem hiding this comment.
Consider using the existing systemctl helper wrappers (systemctl_stop/systemctl_disable) with timeouts/retries instead of raw systemctl stop/disable here. That keeps behavior consistent with other service operations in CSE and avoids potential hangs on stop/disable when systemd is slow/unresponsive.
| systemctl stop nvidia-fabricmanager 2>/dev/null || true | |
| systemctl disable nvidia-fabricmanager 2>/dev/null || true | |
| systemctl_stop nvidia-fabricmanager || true | |
| systemctl_disable nvidia-fabricmanager || true |
| # Check if the unit file exists (not just if it's loaded) and disable/stop it | ||
| if systemctl list-unit-files | grep -q "nvidia-fabricmanager.service"; then | ||
| systemctl stop nvidia-fabricmanager 2>/dev/null || true | ||
| systemctl disable nvidia-fabricmanager 2>/dev/null || true | ||
| # Reset any failed state so it doesn't show up in 'systemctl list-units --failed' | ||
| systemctl reset-failed nvidia-fabricmanager 2>/dev/null || true |
There was a problem hiding this comment.
GPU_NEEDS_FABRIC_MANAGER is derived from a map lookup and will be "false" for any GPU SKU not explicitly listed. With this change, those unknown SKUs will now have fabric manager forcibly disabled even if it would otherwise start successfully. To reduce risk, consider disabling only when the unit is in a failed state (e.g., systemctl is-failed) or ensuring the FabricManagerGPUSizes allowlist is comprehensive for all multi-GPU SKUs.
| # Check if the unit file exists (not just if it's loaded) and disable/stop it | |
| if systemctl list-unit-files | grep -q "nvidia-fabricmanager.service"; then | |
| systemctl stop nvidia-fabricmanager 2>/dev/null || true | |
| systemctl disable nvidia-fabricmanager 2>/dev/null || true | |
| # Reset any failed state so it doesn't show up in 'systemctl list-units --failed' | |
| systemctl reset-failed nvidia-fabricmanager 2>/dev/null || true | |
| # Only disable/stop it if the unit exists and is currently in a failed state | |
| if systemctl list-unit-files | grep -q "nvidia-fabricmanager.service"; then | |
| if systemctl is-failed --quiet nvidia-fabricmanager; then | |
| systemctl stop nvidia-fabricmanager 2>/dev/null || true | |
| systemctl disable nvidia-fabricmanager 2>/dev/null || true | |
| # Reset any failed state so it doesn't show up in 'systemctl list-units --failed' | |
| systemctl reset-failed nvidia-fabricmanager 2>/dev/null || true | |
| fi |
| # The NVIDIA driver installation may automatically enable this service, | ||
| # but it will fail on single-GPU systems, so we explicitly disable it | ||
| # Check if the unit file exists (not just if it's loaded) and disable/stop it | ||
| if systemctl list-unit-files | grep -q "nvidia-fabricmanager.service"; then |
There was a problem hiding this comment.
systemctl list-unit-files | grep -q "nvidia-fabricmanager.service" scans all unit files. Using systemctl list-unit-files --no-pager nvidia-fabricmanager.service (and checking the exit status) is more direct and avoids parsing/grep pitfalls (headers, partial matches).
| if systemctl list-unit-files | grep -q "nvidia-fabricmanager.service"; then | |
| if systemctl list-unit-files --no-pager nvidia-fabricmanager.service >/dev/null 2>&1; then |
Copilot Review ResponseAddressed copilot feedback in commit 6b70510: ✅ Implemented:Comment 2: Use systemctl helper wrappers
Comment 4: Use more direct systemctl command
❌ Not Implemented:Comment 3: Only disable if in failed state
We need to proactively disable the service on single-GPU systems (when Regarding unknown GPU SKUs:Comment 3 raised concern about unknown GPU SKUs. The current behavior is correct:
This is safer than the alternative (unknown SKUs → fabric manager enabled → might fail). |
The NVIDIA driver installation automatically installs and enables the nvidia-fabricmanager systemd service. On single-GPU VM sizes like Standard_NC24ads_A100_v4, fabric manager is not needed and will fail to start, causing e2e tests to fail when checking for failed systemd units. Fabric manager is only required for multi-GPU systems (e.g., ND96asr_v4 with 8 GPUs) where it manages NVLink connections between GPUs. Single-GPU systems with MIG partitions do not need fabric manager. This fix explicitly disables and stops the nvidia-fabricmanager service when GPU_NEEDS_FABRIC_MANAGER=false, preventing it from entering a failed state on single-GPU systems. Fixes Test_Ubuntu2404_NvidiaDevicePluginRunning_MIG e2e test failure.
Address review feedback: systemctlDisableAndStop only works with loaded units and doesn't clear failed states. This could miss: - Unit files that exist but aren't loaded yet - Units that failed and are now inactive - Failed states that persist after stop/disable Changes: - Check unit file existence with 'systemctl list-unit-files' instead of relying on systemctlDisableAndStop which only checks loaded units - Explicitly call 'systemctl reset-failed' to clear any failed state - This ensures the unit won't show up in 'systemctl list-units --failed' which is what the e2e validator checks Co-authored-by: reviewer
Addressed copilot review comments: 1. Use systemctl_stop/systemctl_disable helper wrappers (20 retries, 5s sleep, 25s timeout) instead of raw systemctl commands for consistent retry/timeout behavior across CSE operations 2. Use 'systemctl list-unit-files --no-pager <unit>' instead of 'systemctl list-unit-files | grep' for more direct unit file check that avoids grep parsing pitfalls (headers, partial matches) Not implemented: - Comment 3 (only disable if failed): This doesn't solve the problem because the unit may not be failed YET when CSE runs. We need to proactively disable it on single-GPU systems to prevent future failures.
6b70510 to
c061120
Compare
|
Rebased on latest main (23cc72a). Conflicts resolved and test snapshots regenerated. Branch is now up-to-date and mergeable. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 66 out of 72 changed files in this pull request and generated 1 comment.
You can also share your feedback on Copilot code review. Take the survey.
| # Check if the unit file exists using a more direct approach | ||
| if systemctl list-unit-files --no-pager nvidia-fabricmanager.service >/dev/null 2>&1; then | ||
| # Use systemctl helper wrappers for consistent retry/timeout behavior | ||
| systemctl_stop 20 5 25 nvidia-fabricmanager || true | ||
| systemctl_disable 20 5 25 nvidia-fabricmanager || true | ||
| # Reset any failed state so it doesn't show up in 'systemctl list-units --failed' | ||
| systemctl reset-failed nvidia-fabricmanager 2>/dev/null || true | ||
| fi |
There was a problem hiding this comment.
The systemctl list-unit-files ... >/dev/null guard may not reliably indicate that the unit exists (the command can exit 0 even when no unit files match), which can cause systemctl_stop/systemctl_disable to run and potentially spend a long time retrying on nodes where nvidia-fabricmanager isn’t installed. Consider making the guard check for an actual match in the command output (e.g., using --no-legend and grep -q for the unit name) or using an existence check that fails when the unit file is absent, before invoking the retrying wrappers.
Addressed latest copilot feedback: systemctl list-unit-files may exit 0 even when no unit files match, which could cause unnecessary retries of systemctl_stop/systemctl_disable on nodes without nvidia-fabricmanager. Changed from: systemctl list-unit-files --no-pager <unit> >/dev/null 2>&1 To: systemctl list-unit-files --no-pager --no-legend <unit> | grep -q <unit> This ensures: - --no-legend removes header lines that could cause false positives - grep -q explicitly checks for the unit name in output - Only runs stop/disable wrappers when unit file actually exists - Avoids spending time on retries for non-existent units
Latest Copilot Feedback AddressedIssue: Fix (commit d316c73): # Before: May not reliably detect if unit exists
if systemctl list-unit-files --no-pager nvidia-fabricmanager.service >/dev/null 2>&1; then
# After: Explicitly checks for unit in output
if systemctl list-unit-files --no-pager --no-legend nvidia-fabricmanager.service 2>/dev/null | grep -q "nvidia-fabricmanager.service"; thenWhy this is better:
This combines the robustness of the grep check with the directness of querying a specific unit name. |
What this PR does / why we need it:
Fixes the
Test_Ubuntu2404_NvidiaDevicePluginRunning_MIGe2e test failure caused by the nvidia-fabricmanager service entering a failed state on single-GPU systems.The NVIDIA driver installation automatically installs and enables the nvidia-fabricmanager systemd service. On single-GPU VM sizes like
Standard_NC24ads_A100_v4, fabric manager is not needed and will fail to start. The e2e test'sValidateNoFailedSystemdUnits()check then fails when it detects the service in a failed state.Root cause:
ND96asr_v4with 8 A100 GPUs) where it manages NVLink connections between GPUsNC24ads_A100_v4with 1 A100 GPU) do not need fabric managerFabricManagerGPUSizesmap correctly setsstandard_nc24ads_a100_v4: false, but the CSE script didn't handle disabling the service when not neededThe fix:
GPU_NEEDS_FABRIC_MANAGER=falseWhich issue(s) this PR fixes:
Fixes #
Testing:
make generate