Skip to content

parallelArchitect/sparkview

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

sparkview

Operator-grade GPU monitor for NVIDIA GPUs with GB10 / DGX Spark–aware unified memory handling.

Features

  • GPU utilization, temperature, power, memory — via NVML
  • Runtime detection of coherent UMA (GB10 / DGX Spark)
  • Memory display uses vm.total - vm.available for accurate UMA reporting
  • Memory pressure (PSI) — LOW / MOD / HIGH / CRITICAL from /proc/pressure/memory
  • IO pressure (IO PSI) — LOW / MOD / HIGH / CRITICAL from /proc/pressure/io
  • Load-gated clock states — IDLE / PASS / LOCKED / THROTTLED
  • CPU utilization and active core count
  • SWAP monitoring
  • TEMP row — current and session peak for GPU and CPU, color-coded green/yellow/red
  • INFO row — time, driver version, CUDA version, kernel, uptime
  • Process list sorted by GPU memory usage, scales to terminal height
  • Anomaly auto-logger — automatically logs to ~/sparkview_logs/ when issues are detected
  • Clean exit on Ctrl+C

Note: This tool is not fully validated on GB10 / DGX Spark hardware. If you run it on Spark, please open an issue with your results.

Community discussion and field results: https://forums.developer.nvidia.com/t/sparkview-gpu-monitor-tool-with-gb10-aware-unified-memory-handling/366877

Install

git clone https://github.com/parallelArchitect/sparkview.git
cd sparkview

# create a virtual environment (recommended on DGX Spark)
python3 -m venv sparkview-venv

# activate it
source ~/sparkview/sparkview-venv/bin/activate

# install dependencies
pip install nvitop psutil rich textual

Run

python3 main.py

Add a permanent alias for one-command launch from terminal:

echo "alias sparkview='source ~/sparkview/sparkview-venv/bin/activate && python3 ~/sparkview/main.py'" >> ~/.bashrc
source ~/.bashrc

Then just type sparkview from terminal to launch.

Anomaly Logging

sparkview automatically starts logging when any of these conditions are detected:

  • PSI memory pressure reaches MOD, HIGH, or CRITICAL
  • IO pressure reaches MOD, HIGH, or CRITICAL
  • GPU clock drops to THROTTLED or LOCKED under load
  • Memory > 85% with swap active
  • GPU or CPU temperature exceeds 80°C

Logs are saved to ~/sparkview_logs/<timestamp>/:

  • anomaly.log.gz — full compressed snapshot log, one entry every 2 seconds
  • summary.json — machine-readable event summary including trigger, duration, peak temps, driver, CUDA, and kernel version

IO PSI — Pipeline Starvation Detection

IO PSI (/proc/pressure/io) measures how much time tasks spend waiting on disk I/O. For ML training workloads this surfaces dataloader bottlenecks before GPU utilization drops.

PSI IO PSI Diagnosis
LOW CRITICAL Pure IO bottleneck — dataloader, checkpoint write, network FS
HIGH CRITICAL System contention — memory reclaim and disk competing
LOW LOW Healthy

GB10 / DGX Spark

On coherent UMA platforms, nvmlDeviceGetMemoryInfo may report total ≈ MemTotal (~121 GB). This does not reflect allocatable memory.

sparkview detects this condition at runtime and uses vm.total - vm.available for used memory and vm.total as the display total — accurate under any workload including heavy inference loads.

The PSI memory pressure signal (/proc/pressure/memory) provides visibility into memory contention before swap or system freeze.

Clock States

State Meaning
IDLE GPU not under load — not evaluated
PASS Clock healthy under load
LOCKED Clock externally capped via nvidia-smi -lgc
THROTTLED Low clock under load — PD issue suspected

Current implementation uses a fixed threshold:

  • Clock < 1400 MHz under sustained load → THROTTLED

This threshold is derived from field observations on GB10 systems using the spark-gpu-throttle-check tool, where healthy operation reaches ~2400 MHz and degraded systems operate in the ~500–850 MHz range.

Detection is load-gated — evaluation only occurs when GPU utilization confirms active workload.

Related Tools

  • spark-gpu-throttle-check — point-in-time GPU clock diagnostic, throttle cause identification, baseline drift detection
  • cuda-unified-memory-analyzer — UMA fault counts, migration bytes, and memory pressure diagnostics for GB10 and discrete GPUs
  • nvidia-uma-fault-probe — cycle-accurate UMA fault latency and bandwidth measurement, C and PTX
  • nvml-unified-shim — fixes NVML memory reporting on UMA platforms, MemAvailable instead of MemTotal
  • dgx-forensic-collect — targeted forensic data collector for DGX Spark, EFI pstore, rasdaemon, DOE mailbox state

Author

parallelArchitect Human-directed GPU engineering with AI assistance.

License

MIT

About

Operator-grade GPU monitor for NVIDIA GPUs with native GB10 / DGX Spark coherent UMA support — PSI pressure, clock detection, ConnectX-7 network layer

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages