-
Notifications
You must be signed in to change notification settings - Fork 23
GPUDirect RDMA Install and Config
Alyssa Vu edited this page Aug 12, 2025
·
3 revisions
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
name: nic-cluster-policy
spec:
ofedDriver:
image: doca-driver
repository: nvcr.io/nvidia/mellanox
version: 25.04-0.6.1.0-2
forcePrecompiled: false
terminationGracePeriodSeconds: 300
startupProbe:
initialDelaySeconds: 10
periodSeconds: 20
livenessProbe:
initialDelaySeconds: 30
periodSeconds: 30
readinessProbe:
initialDelaySeconds: 10
periodSeconds: 30
upgradePolicy:
autoUpgrade: true
maxParallelUpgrades: 1
safeLoad: false
drain:
enable: true
force: true
timeoutSeconds: 300
deleteEmptyDir: true
This is how network operator install MOFED. We can try to install docker image of doca-driver using this link. This is full list of NVIDIA images used in network operator
To use GPUDirect RDMA, we need config like below:
driver:
rdma:
enabled: true # Enables RDMA for GPUDirect support
migManager:
enabled: false # Multi-Instance GPU not required
vgpuDeviceManager:
enabled: false # vGPU support not needed
vfioManager:
enabled: false # VFIO-PCI passthrough not used
sandboxDevicePlugin:
enabled: false # Sandboxed workloads not enabled
This will create a list of pods like below:
NAME READY STATUS RESTARTS AGE LABELS
gpu-feature-discovery-gsgrx 1/1 Running 0 8m32s app.kubernetes.io/managed-by=gpu-operator,app.kubernetes.io/part-of=nvidia-gpu,app=gpu-feature-discovery,controller-revision-hash=65b59b8987,helm.sh/chart=gpu-operator-v25.3.1,pod-template-generation=1
gpu-feature-discovery-qwxgm 1/1 Running 0 8m35s app.kubernetes.io/managed-by=gpu-operator,app.kubernetes.io/part-of=nvidia-gpu,app=gpu-feature-discovery,controller-revision-hash=65b59b8987,helm.sh/chart=gpu-operator-v25.3.1,pod-template-generation=1
gpu-operator-857fc9cf65-h9f2d 1/1 Running 0 8m57s app.kubernetes.io/component=gpu-operator,app.kubernetes.io/instance=gpu-operator,app.kubernetes.io/managed-by=Helm,app.kubernetes.io/name=gpu-operator,app.kubernetes.io/version=v25.3.1,app=gpu-operator,helm.sh/chart=gpu-operator-v25.3.1,nvidia.com/gpu-driver-upgrade-drain.skip=true,pod-template-hash=857fc9cf65
nvidia-container-toolkit-daemonset-bzmjn 1/1 Running 0 8m32s app.kubernetes.io/managed-by=gpu-operator,app=nvidia-container-toolkit-daemonset,controller-revision-hash=56fc4c4686,helm.sh/chart=gpu-operator-v25.3.1,pod-template-generation=1
nvidia-container-toolkit-daemonset-qb7m8 1/1 Running 0 8m35s app.kubernetes.io/managed-by=gpu-operator,app=nvidia-container-toolkit-daemonset,controller-revision-hash=56fc4c4686,helm.sh/chart=gpu-operator-v25.3.1,pod-template-generation=1
nvidia-cuda-validator-cdxxd 0/1 Completed 0 5m18s app=nvidia-cuda-validator
nvidia-cuda-validator-lss2f 0/1 Completed 0 5m29s app=nvidia-cuda-validator
nvidia-dcgm-exporter-4mjph 1/1 Running 0 8m35s app.kubernetes.io/managed-by=gpu-operator,app=nvidia-dcgm-exporter,controller-revision-hash=6f4dfc66ff,helm.sh/chart=gpu-operator-v25.3.1,pod-template-generation=1
nvidia-dcgm-exporter-8twrl 1/1 Running 0 8m32s app.kubernetes.io/managed-by=gpu-operator,app=nvidia-dcgm-exporter,controller-revision-hash=6f4dfc66ff,helm.sh/chart=gpu-operator-v25.3.1,pod-template-generation=1
nvidia-device-plugin-daemonset-4hjjg 1/1 Running 0 8m32s app.kubernetes.io/managed-by=gpu-operator,app=nvidia-device-plugin-daemonset,controller-revision-hash=6f54cb6dbb,helm.sh/chart=gpu-operator-v25.3.1,pod-template-generation=1
nvidia-device-plugin-daemonset-q6cz9 1/1 Running 0 8m35s app.kubernetes.io/managed-by=gpu-operator,app=nvidia-device-plugin-daemonset,controller-revision-hash=6f54cb6dbb,helm.sh/chart=gpu-operator-v25.3.1,pod-template-generation=1
nvidia-driver-daemonset-9582h 2/2 Running 1 (5m58s ago) 8m48s app.kubernetes.io/component=nvidia-driver,app.kubernetes.io/managed-by=gpu-operator,app=nvidia-driver-daemonset,controller-revision-hash=69f84678c9,helm.sh/chart=gpu-operator-v25.3.1,nvidia.com/precompiled=false,pod-template-generation=1
nvidia-driver-daemonset-v26hd 2/2 Running 2 (6m7s ago) 8m48s app.kubernetes.io/component=nvidia-driver,app.kubernetes.io/managed-by=gpu-operator,app=nvidia-driver-daemonset,controller-revision-hash=69f84678c9,helm.sh/chart=gpu-operator-v25.3.1,nvidia.com/precompiled=false,pod-template-generation=1
nvidia-operator-validator-7l68x 1/1 Running 0 8m35s app.kubernetes.io/managed-by=gpu-operator,app.kubernetes.io/part-of=gpu-operator,app=nvidia-operator-validator,controller-revision-hash=7795dd956d,helm.sh/chart=gpu-operator-v25.3.1,pod-template-generation=1
nvidia-operator-validator-hs8h6 1/1 Running 0 8m32s app.kubernetes.io/managed-by=gpu-operator,app.kubernetes.io/part-of=gpu-operator,app=nvidia-operator-validator,controller-revision-hash=7795dd956d,helm.sh/chart=gpu-operator-v25.3.1,pod-template-generation=1
Without MOFED, nvidia-driver-daemonset will get stuck waiting for MOFED driver to be ready and won't install NVIDIA driver. We can try to see if downloading the docker image directly to node solves the issue.