Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
105 changes: 2 additions & 103 deletions content/en/gpu_monitoring/setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,6 @@ To begin using Datadog's GPU Monitoring, your environment must meet the followin

- **Datadog Agent**: v7.74
- **Operating system**: Linux
- (Optional) For advanced eBPF metrics, Linux kernel version 5.8
- **NVIDIA driver**: version 450.51

If using Kubernetes, the following additional requirements must be met:
Expand Down Expand Up @@ -121,52 +120,17 @@ The following instructions are the basic steps to set up GPU Monitoring in the f

{{% tab "Docker" %}}

To enable GPU Monitoring in Docker without advanced eBPF metrics, use the following configuration when starting the container Agent:
To enable GPU Monitoring in Docker, use the following configuration when starting the container Agent:

```shell
docker run \
--pid host \
--gpus all \
-e DD_GPU_ENABLED=true \
-v /var/run/docker.sock:/var/run/docker.sock:ro \
-v /proc/:/host/proc/:ro \
-v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \
registry.datadoghq.com/agent:latest
```

To enable advanced eBPF metrics, use the following configuration for the required permissions to run eBPF programs:

```shell
docker run \
--cgroupns host \
--pid host \
--gpus all \
-e DD_API_KEY="<DATADOG_API_KEY>" \
-e DD_GPU_MONITORING_ENABLED=true \
-e DD_GPU_ENABLED=true \
-v /:/host/root:ro \
-v /var/run/docker.sock:/var/run/docker.sock:ro \
-v /proc/:/host/proc/:ro \
-v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \
-v /sys/kernel/debug:/sys/kernel/debug \
-v /lib/modules:/lib/modules:ro \
-v /usr/src:/usr/src:ro \
-v /var/tmp/datadog-agent/system-probe/build:/var/tmp/datadog-agent/system-probe/build \
-v /var/tmp/datadog-agent/system-probe/kernel-headers:/var/tmp/datadog-agent/system-probe/kernel-headers \
-v /etc/apt:/host/etc/apt:ro \
-v /etc/yum.repos.d:/host/etc/yum.repos.d:ro \
-v /etc/zypp:/host/etc/zypp:ro \
-v /etc/pki:/host/etc/pki:ro \
-v /etc/yum/vars:/host/etc/yum/vars:ro \
-v /etc/dnf/vars:/host/etc/dnf/vars:ro \
-v /etc/rhsm:/host/etc/rhsm:ro \
-e HOST_ROOT=/host/root \
--security-opt apparmor:unconfined \
--cap-add=SYS_ADMIN \
--cap-add=SYS_RESOURCE \
--cap-add=SYS_PTRACE \
--cap-add=IPC_LOCK \
--cap-add=CHOWN \
registry.datadoghq.com/agent:latest
```

Expand Down Expand Up @@ -200,41 +164,6 @@ services:
capabilities: [gpu]
```

To enable advanced eBPF metrics, use the following configuration for the required permissions to run eBPF programs:

```yaml
version: '3'
services:
datadog:
image: "registry.datadoghq.com/agent:latest"
environment:
- DD_GPU_MONITORING_ENABLED=true # only for advanced eBPF metrics
- DD_GPU_ENABLED=true
- DD_API_KEY=<DATADOG_API_KEY>
- HOST_ROOT=/host/root
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
- /proc/:/host/proc/:ro
- /sys/fs/cgroup/:/host/sys/fs/cgroup:ro
- /sys/kernel/debug:/sys/kernel/debug
- /:/host/root
cap_add:
- SYS_ADMIN
- SYS_RESOURCE
- SYS_PTRACE
- IPC_LOCK
- CHOWN
security_opt:
- apparmor:unconfined
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
```

{{% /tab %}}
{{% tab "Linux (non-containerized)" %}}

Expand All @@ -245,27 +174,6 @@ gpu:
enabled: true
```

Additionally, to enable advanced eBPF-based metrics such as GPU core utilization (`gpu.process.core.usage`), follow these steps:

1. If `/etc/datadog-agent/system-probe.yaml` does not exist, create it from `system-probe.yaml.example`:

```shell
sudo -u dd-agent install -m 0640 /etc/datadog-agent/system-probe.yaml.example /etc/datadog-agent/system-probe.yaml
```

2. Edit `/etc/datadog-agent/system-probe.yaml` and enable GPU monitoring in system-probe:

```yaml
gpu_monitoring:
enabled: true
```

3. Restart the Datadog Agent

```shell
sudo systemctl restart datadog-agent
```

{{% /tab %}}

{{< /tabs >}}
Expand Down Expand Up @@ -306,10 +214,7 @@ To set up GPU Monitoring on a mixed cluster, use the Operator's [Agent Profiles]

Then re-deploy the Datadog Operator with: `helm upgrade --install <release-name> datadog/datadog-operator -f datadog-operator.yaml`.

2. Modify your `DatadogAgent` resource with the following changes:

1. Add the `agent.datadoghq.com/update-metadata` annotation to the `DatadogAgent` resource.
2. If advanced eBPF metrics are wanted, verify that at least one system-probe feature is enabled. Examples of system-probe features are `npm`, `cws`, `usm`. If none is enabled, the `oomKill` feature can be enabled.
2. Modify your `DatadogAgent` resource by adding the `agent.datadoghq.com/update-metadata` annotation to the `DatadogAgent` resource.

The additions to the `datadog-agent.yaml` file should look like this:

Expand All @@ -320,12 +225,6 @@ To set up GPU Monitoring on a mixed cluster, use the Operator's [Agent Profiles]
name: datadog
annotations:
agent.datadoghq.com/update-metadata: "true" # Required for the Datadog Agent Internal mode to work.
spec:
features:
oomKill:
# Only enable this feature if there is nothing else that requires the system-probe container in all Agent pods
# Examples of system-probe features are npm, cws, usm
enabled: true
```

3. Apply your changes to the `DatadogAgent` resource. These changes are safe to apply to all Datadog Agents, regardless of whether they run on GPU nodes.
Expand Down
Loading