diff --git a/content/en/gpu_monitoring/setup.md b/content/en/gpu_monitoring/setup.md index a59ce0efde8..c5c98ea5e82 100644 --- a/content/en/gpu_monitoring/setup.md +++ b/content/en/gpu_monitoring/setup.md @@ -28,7 +28,6 @@ To begin using Datadog's GPU Monitoring, your environment must meet the followin - **Datadog Agent**: v7.74 - **Operating system**: Linux - - (Optional) For advanced eBPF metrics, Linux kernel version 5.8 - **NVIDIA driver**: version 450.51 If using Kubernetes, the following additional requirements must be met: @@ -121,52 +120,17 @@ The following instructions are the basic steps to set up GPU Monitoring in the f {{% tab "Docker" %}} -To enable GPU Monitoring in Docker without advanced eBPF metrics, use the following configuration when starting the container Agent: +To enable GPU Monitoring in Docker, use the following configuration when starting the container Agent: ```shell docker run \ --pid host \ --gpus all \ --e DD_GPU_ENABLED=true \ --v /var/run/docker.sock:/var/run/docker.sock:ro \ --v /proc/:/host/proc/:ro \ --v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \ -registry.datadoghq.com/agent:latest -``` - -To enable advanced eBPF metrics, use the following configuration for the required permissions to run eBPF programs: - -```shell -docker run \ ---cgroupns host \ ---pid host \ ---gpus all \ -e DD_API_KEY="" \ --e DD_GPU_MONITORING_ENABLED=true \ -e DD_GPU_ENABLED=true \ --v /:/host/root:ro \ -v /var/run/docker.sock:/var/run/docker.sock:ro \ -v /proc/:/host/proc/:ro \ -v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \ --v /sys/kernel/debug:/sys/kernel/debug \ --v /lib/modules:/lib/modules:ro \ --v /usr/src:/usr/src:ro \ --v /var/tmp/datadog-agent/system-probe/build:/var/tmp/datadog-agent/system-probe/build \ --v /var/tmp/datadog-agent/system-probe/kernel-headers:/var/tmp/datadog-agent/system-probe/kernel-headers \ --v /etc/apt:/host/etc/apt:ro \ --v /etc/yum.repos.d:/host/etc/yum.repos.d:ro \ --v /etc/zypp:/host/etc/zypp:ro \ --v /etc/pki:/host/etc/pki:ro \ --v /etc/yum/vars:/host/etc/yum/vars:ro \ --v /etc/dnf/vars:/host/etc/dnf/vars:ro \ --v /etc/rhsm:/host/etc/rhsm:ro \ --e HOST_ROOT=/host/root \ ---security-opt apparmor:unconfined \ ---cap-add=SYS_ADMIN \ ---cap-add=SYS_RESOURCE \ ---cap-add=SYS_PTRACE \ ---cap-add=IPC_LOCK \ ---cap-add=CHOWN \ registry.datadoghq.com/agent:latest ``` @@ -200,41 +164,6 @@ services: capabilities: [gpu] ``` -To enable advanced eBPF metrics, use the following configuration for the required permissions to run eBPF programs: - -```yaml -version: '3' -services: - datadog: - image: "registry.datadoghq.com/agent:latest" - environment: - - DD_GPU_MONITORING_ENABLED=true # only for advanced eBPF metrics - - DD_GPU_ENABLED=true - - DD_API_KEY= - - HOST_ROOT=/host/root - volumes: - - /var/run/docker.sock:/var/run/docker.sock:ro - - /proc/:/host/proc/:ro - - /sys/fs/cgroup/:/host/sys/fs/cgroup:ro - - /sys/kernel/debug:/sys/kernel/debug - - /:/host/root - cap_add: - - SYS_ADMIN - - SYS_RESOURCE - - SYS_PTRACE - - IPC_LOCK - - CHOWN - security_opt: - - apparmor:unconfined - deploy: - resources: - reservations: - devices: - - driver: nvidia - count: all - capabilities: [gpu] -``` - {{% /tab %}} {{% tab "Linux (non-containerized)" %}} @@ -245,27 +174,6 @@ gpu: enabled: true ``` -Additionally, to enable advanced eBPF-based metrics such as GPU core utilization (`gpu.process.core.usage`), follow these steps: - -1. If `/etc/datadog-agent/system-probe.yaml` does not exist, create it from `system-probe.yaml.example`: - - ```shell - sudo -u dd-agent install -m 0640 /etc/datadog-agent/system-probe.yaml.example /etc/datadog-agent/system-probe.yaml - ``` - -2. Edit `/etc/datadog-agent/system-probe.yaml` and enable GPU monitoring in system-probe: - - ```yaml - gpu_monitoring: - enabled: true - ``` - -3. Restart the Datadog Agent - - ```shell - sudo systemctl restart datadog-agent - ``` - {{% /tab %}} {{< /tabs >}} @@ -306,10 +214,7 @@ To set up GPU Monitoring on a mixed cluster, use the Operator's [Agent Profiles] Then re-deploy the Datadog Operator with: `helm upgrade --install datadog/datadog-operator -f datadog-operator.yaml`. -2. Modify your `DatadogAgent` resource with the following changes: - - 1. Add the `agent.datadoghq.com/update-metadata` annotation to the `DatadogAgent` resource. - 2. If advanced eBPF metrics are wanted, verify that at least one system-probe feature is enabled. Examples of system-probe features are `npm`, `cws`, `usm`. If none is enabled, the `oomKill` feature can be enabled. +2. Modify your `DatadogAgent` resource by adding the `agent.datadoghq.com/update-metadata` annotation to the `DatadogAgent` resource. The additions to the `datadog-agent.yaml` file should look like this: @@ -320,12 +225,6 @@ To set up GPU Monitoring on a mixed cluster, use the Operator's [Agent Profiles] name: datadog annotations: agent.datadoghq.com/update-metadata: "true" # Required for the Datadog Agent Internal mode to work. - spec: - features: - oomKill: - # Only enable this feature if there is nothing else that requires the system-probe container in all Agent pods - # Examples of system-probe features are npm, cws, usm - enabled: true ``` 3. Apply your changes to the `DatadogAgent` resource. These changes are safe to apply to all Datadog Agents, regardless of whether they run on GPU nodes.