The benchmark automatically collects comprehensive GPU metrics using NVIDIA DCGM (Data Center GPU Manager) during benchmark execution.
The following DCGM field IDs are monitored at 1-second intervals:
- 155: GPU Power Usage (Watts)
- 156: GPU Temperature (°C)
- 203: GPU Utilization (%)
- 204: Memory Utilization (%)
- 1001: SM (Streaming Multiprocessor) Active
- 1002: SM Occupancy
- 1003: Tensor Core Active
- 1004: DRAM Active
- 1005: FP64 Active
- 1006: FP32 Active
- 1007: FP16 Active
- 1008: PCIe TX Throughput
- 1009: PCIe RX Throughput
GPU metrics are saved to /tmp/gpu_metrics.log on the instance and included in the S3 upload as gpu_metrics.log.gz.
# Entity PWRUSG GPUTMP GRUTIL MMUSG ...
# GPU 0 250.5 72 95 80 ...
# GPU 0 252.1 73 96 81 ...
The GPU metrics are:
- Collected continuously during the benchmark
- Downloaded to local results directory
- Compressed and uploaded to S3
- Available for post-processing and analysis
To analyze GPU metrics:
import pandas as pd
# Load GPU metrics
df = pd.read_csv('gpu_metrics.log', sep=r'\s+', comment='#')
# Calculate statistics
print(f"Average Power: {df['PWRUSG'].mean():.2f}W")
print(f"Peak Temperature: {df['GPUTMP'].max()}°C")
print(f"Average GPU Util: {df['GRUTIL'].mean():.1f}%")GPU telemetry is enabled by default. To disable, modify the _run_benchmark() method in benchmark_runner.py.
- NVIDIA GPU with DCGM support
- CUDA 12.0+ recommended
- DCGM installed on the instance (automatically installed during setup)
If GPU metrics are not collected:
- Check DCGM installation:
dcgmi discovery -l - Verify DCGM service:
systemctl status nvidia-dcgm - Test manually:
dcgmi dmon -e 155,156 -c 5