Operator-grade GPU monitor for NVIDIA GPUs with GB10 / DGX Spark–aware unified memory handling.
- GPU utilization, temperature, power, memory — via NVML
- Runtime detection of coherent UMA (GB10 / DGX Spark)
- Memory display uses
vm.total - vm.availablefor accurate UMA reporting - Memory pressure (PSI) — LOW / MOD / HIGH / CRITICAL from
/proc/pressure/memory - IO pressure (IO PSI) — LOW / MOD / HIGH / CRITICAL from
/proc/pressure/io - Load-gated clock states — IDLE / PASS / LOCKED / THROTTLED
- CPU utilization and active core count
- SWAP monitoring
- TEMP row — current and session peak for GPU and CPU, color-coded green/yellow/red
- INFO row — time, driver version, CUDA version, kernel, uptime
- Process list sorted by GPU memory usage, scales to terminal height
- Anomaly auto-logger — automatically logs to
~/sparkview_logs/when issues are detected - Clean exit on Ctrl+C
Note: This tool is not fully validated on GB10 / DGX Spark hardware. If you run it on Spark, please open an issue with your results.
Community discussion and field results: https://forums.developer.nvidia.com/t/sparkview-gpu-monitor-tool-with-gb10-aware-unified-memory-handling/366877
git clone https://github.com/parallelArchitect/sparkview.git
cd sparkview
# create a virtual environment (recommended on DGX Spark)
python3 -m venv sparkview-venv
# activate it
source ~/sparkview/sparkview-venv/bin/activate
# install dependencies
pip install nvitop psutil rich textualpython3 main.pyAdd a permanent alias for one-command launch from terminal:
echo "alias sparkview='source ~/sparkview/sparkview-venv/bin/activate && python3 ~/sparkview/main.py'" >> ~/.bashrc
source ~/.bashrcThen just type sparkview from terminal to launch.
sparkview automatically starts logging when any of these conditions are detected:
- PSI memory pressure reaches MOD, HIGH, or CRITICAL
- IO pressure reaches MOD, HIGH, or CRITICAL
- GPU clock drops to THROTTLED or LOCKED under load
- Memory > 85% with swap active
- GPU or CPU temperature exceeds 80°C
Logs are saved to ~/sparkview_logs/<timestamp>/:
anomaly.log.gz— full compressed snapshot log, one entry every 2 secondssummary.json— machine-readable event summary including trigger, duration, peak temps, driver, CUDA, and kernel version
IO PSI (/proc/pressure/io) measures how much time tasks spend waiting on disk I/O. For ML training workloads this surfaces dataloader bottlenecks before GPU utilization drops.
| PSI | IO PSI | Diagnosis |
|---|---|---|
| LOW | CRITICAL | Pure IO bottleneck — dataloader, checkpoint write, network FS |
| HIGH | CRITICAL | System contention — memory reclaim and disk competing |
| LOW | LOW | Healthy |
On coherent UMA platforms, nvmlDeviceGetMemoryInfo may report total ≈ MemTotal (~121 GB). This does not reflect allocatable memory.
sparkview detects this condition at runtime and uses vm.total - vm.available for used memory and vm.total as the display total — accurate under any workload including heavy inference loads.
The PSI memory pressure signal (/proc/pressure/memory) provides visibility into memory contention before swap or system freeze.
| State | Meaning |
|---|---|
| IDLE | GPU not under load — not evaluated |
| PASS | Clock healthy under load |
| LOCKED | Clock externally capped via nvidia-smi -lgc |
| THROTTLED | Low clock under load — PD issue suspected |
Current implementation uses a fixed threshold:
- Clock < 1400 MHz under sustained load → THROTTLED
This threshold is derived from field observations on GB10 systems using the spark-gpu-throttle-check tool, where healthy operation reaches ~2400 MHz and degraded systems operate in the ~500–850 MHz range.
Detection is load-gated — evaluation only occurs when GPU utilization confirms active workload.
- spark-gpu-throttle-check — point-in-time GPU clock diagnostic, throttle cause identification, baseline drift detection
- cuda-unified-memory-analyzer — UMA fault counts, migration bytes, and memory pressure diagnostics for GB10 and discrete GPUs
- nvidia-uma-fault-probe — cycle-accurate UMA fault latency and bandwidth measurement, C and PTX
- nvml-unified-shim — fixes NVML memory reporting on UMA platforms, MemAvailable instead of MemTotal
- dgx-forensic-collect — targeted forensic data collector for DGX Spark, EFI pstore, rasdaemon, DOE mailbox state
parallelArchitect Human-directed GPU engineering with AI assistance.
MIT