Skip to content

GPUDirect RDMA Install and Config

Alyssa Vu edited this page Aug 12, 2025 · 3 revisions

Overview

Network Operator

apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  ofedDriver:
    image: doca-driver
    repository: nvcr.io/nvidia/mellanox
    version: 25.04-0.6.1.0-2
    forcePrecompiled: false
    terminationGracePeriodSeconds: 300
    startupProbe:
      initialDelaySeconds: 10
      periodSeconds: 20
    livenessProbe:
      initialDelaySeconds: 30
      periodSeconds: 30
    readinessProbe:
      initialDelaySeconds: 10
      periodSeconds: 30
    upgradePolicy:
      autoUpgrade: true
      maxParallelUpgrades: 1
      safeLoad: false
      drain:
        enable: true
        force: true
        timeoutSeconds: 300
        deleteEmptyDir: true

This is how network operator install MOFED. We can try to install docker image of doca-driver using this link. This is full list of NVIDIA images used in network operator

GPU Operator

To use GPUDirect RDMA, we need config like below:

driver:
  rdma:
    enabled: true # Enables RDMA for GPUDirect support
migManager:
  enabled: false # Multi-Instance GPU not required
vgpuDeviceManager:
  enabled: false # vGPU support not needed
vfioManager:
  enabled: false # VFIO-PCI passthrough not used
sandboxDevicePlugin:
  enabled: false # Sandboxed workloads not enabled

This will create a list of pods like below:

NAME                                       READY   STATUS      RESTARTS        AGE     LABELS
gpu-feature-discovery-gsgrx                1/1     Running     0               8m32s   app.kubernetes.io/managed-by=gpu-operator,app.kubernetes.io/part-of=nvidia-gpu,app=gpu-feature-discovery,controller-revision-hash=65b59b8987,helm.sh/chart=gpu-operator-v25.3.1,pod-template-generation=1
gpu-feature-discovery-qwxgm                1/1     Running     0               8m35s   app.kubernetes.io/managed-by=gpu-operator,app.kubernetes.io/part-of=nvidia-gpu,app=gpu-feature-discovery,controller-revision-hash=65b59b8987,helm.sh/chart=gpu-operator-v25.3.1,pod-template-generation=1
gpu-operator-857fc9cf65-h9f2d              1/1     Running     0               8m57s   app.kubernetes.io/component=gpu-operator,app.kubernetes.io/instance=gpu-operator,app.kubernetes.io/managed-by=Helm,app.kubernetes.io/name=gpu-operator,app.kubernetes.io/version=v25.3.1,app=gpu-operator,helm.sh/chart=gpu-operator-v25.3.1,nvidia.com/gpu-driver-upgrade-drain.skip=true,pod-template-hash=857fc9cf65
nvidia-container-toolkit-daemonset-bzmjn   1/1     Running     0               8m32s   app.kubernetes.io/managed-by=gpu-operator,app=nvidia-container-toolkit-daemonset,controller-revision-hash=56fc4c4686,helm.sh/chart=gpu-operator-v25.3.1,pod-template-generation=1
nvidia-container-toolkit-daemonset-qb7m8   1/1     Running     0               8m35s   app.kubernetes.io/managed-by=gpu-operator,app=nvidia-container-toolkit-daemonset,controller-revision-hash=56fc4c4686,helm.sh/chart=gpu-operator-v25.3.1,pod-template-generation=1
nvidia-cuda-validator-cdxxd                0/1     Completed   0               5m18s   app=nvidia-cuda-validator
nvidia-cuda-validator-lss2f                0/1     Completed   0               5m29s   app=nvidia-cuda-validator
nvidia-dcgm-exporter-4mjph                 1/1     Running     0               8m35s   app.kubernetes.io/managed-by=gpu-operator,app=nvidia-dcgm-exporter,controller-revision-hash=6f4dfc66ff,helm.sh/chart=gpu-operator-v25.3.1,pod-template-generation=1
nvidia-dcgm-exporter-8twrl                 1/1     Running     0               8m32s   app.kubernetes.io/managed-by=gpu-operator,app=nvidia-dcgm-exporter,controller-revision-hash=6f4dfc66ff,helm.sh/chart=gpu-operator-v25.3.1,pod-template-generation=1
nvidia-device-plugin-daemonset-4hjjg       1/1     Running     0               8m32s   app.kubernetes.io/managed-by=gpu-operator,app=nvidia-device-plugin-daemonset,controller-revision-hash=6f54cb6dbb,helm.sh/chart=gpu-operator-v25.3.1,pod-template-generation=1
nvidia-device-plugin-daemonset-q6cz9       1/1     Running     0               8m35s   app.kubernetes.io/managed-by=gpu-operator,app=nvidia-device-plugin-daemonset,controller-revision-hash=6f54cb6dbb,helm.sh/chart=gpu-operator-v25.3.1,pod-template-generation=1
nvidia-driver-daemonset-9582h              2/2     Running     1 (5m58s ago)   8m48s   app.kubernetes.io/component=nvidia-driver,app.kubernetes.io/managed-by=gpu-operator,app=nvidia-driver-daemonset,controller-revision-hash=69f84678c9,helm.sh/chart=gpu-operator-v25.3.1,nvidia.com/precompiled=false,pod-template-generation=1
nvidia-driver-daemonset-v26hd              2/2     Running     2 (6m7s ago)    8m48s   app.kubernetes.io/component=nvidia-driver,app.kubernetes.io/managed-by=gpu-operator,app=nvidia-driver-daemonset,controller-revision-hash=69f84678c9,helm.sh/chart=gpu-operator-v25.3.1,nvidia.com/precompiled=false,pod-template-generation=1
nvidia-operator-validator-7l68x            1/1     Running     0               8m35s   app.kubernetes.io/managed-by=gpu-operator,app.kubernetes.io/part-of=gpu-operator,app=nvidia-operator-validator,controller-revision-hash=7795dd956d,helm.sh/chart=gpu-operator-v25.3.1,pod-template-generation=1
nvidia-operator-validator-hs8h6            1/1     Running     0               8m32s   app.kubernetes.io/managed-by=gpu-operator,app.kubernetes.io/part-of=gpu-operator,app=nvidia-operator-validator,controller-revision-hash=7795dd956d,helm.sh/chart=gpu-operator-v25.3.1,pod-template-generation=1

Without MOFED, nvidia-driver-daemonset will get stuck waiting for MOFED driver to be ready and won't install NVIDIA driver. We can try to see if downloading the docker image directly to node solves the issue.

Clone this wiki locally