Skip to content

Latest commit

 

History

History
265 lines (201 loc) · 7.99 KB

File metadata and controls

265 lines (201 loc) · 7.99 KB

ruvector-profiler

License: MIT

Memory, power, and latency profiling hooks with CSV emitters — the observability layer for attention benchmarking.

Dimension What It Measures Output
Memory RSS, KV-cache, activations, temp buffers MemoryReport + CSV
Power Wattage samples, trapezoidal energy integration EnergyResult + CSV
Latency p50/p95/p99, mean, std LatencyStats + CSV
Config SHA-256 fingerprint of all parameters Reproducibility hash

Overview

This crate instruments benchmark runs with three profiling dimensions -- memory pressure, energy consumption, and latency distribution -- and exports results to CSV files for downstream analysis. It is the observability layer in the ruvector attention benchmarking pipeline, sitting between the attention operators (ruvector-attn-mincut) and the analysis/plotting stage.

Every benchmark run is tagged with a SHA-256 config fingerprint so that results are reproducible and auditable across machines.

Modules

Module Purpose
memory MemoryTracker with RSS snapshots and peak tracking
power PowerTracker with PowerSource trait and trapezoidal integration
latency LatencyStats computing p50/p95/p99 from LatencyRecord samples
csv_emitter write_results_csv, write_latency_csv, write_memory_csv
config_hash BenchConfig with SHA-256 fingerprinting for reproducibility

Usage Example: Full Benchmark Loop

use ruvector_profiler::*;

// Tag this run with a reproducible fingerprint
let config = BenchConfig {
    model_commit: "abc1234".into(),
    weights_hash: "def5678".into(),
    lambda: 0.5, tau: 2, eps: 0.01,
    compiler_flags: "-O3".into(),
};
println!("Run fingerprint: {}", config_hash(&config));

// Set up trackers
let mut mem = MemoryTracker::new("mincut_l0.5_t2");
let source = MockPowerSource { watts: 75.0 };
let mut pwr = PowerTracker::new("gpu");
let mut latencies = Vec::new();

for i in 0..1000 {
    mem.snapshot();
    pwr.sample(&source);
    let start = std::time::Instant::now();

    // ... run attention operator ...

    let elapsed = start.elapsed().as_micros() as u64;
    latencies.push(LatencyRecord {
        sample_id: i, wall_time_us: elapsed,
        kernel_time_us: elapsed, seq_len: 128,
    });
}

// Aggregate
let stats = compute_latency_stats(&latencies);
let report = mem.report();
let energy = pwr.energy();

println!("Peak RSS: {} bytes | p95: {} us | Energy: {:.3} J",
    report.peak_rss, stats.p95_us, energy.total_joules);

// Export to CSV
write_latency_csv("results/latency.csv", &latencies).unwrap();
write_memory_csv("results/memory.csv", &mem.snapshots).unwrap();

Memory Profiling

MemoryTracker captures RSS snapshots via /proc/self/status on Linux (zero fallback on other platforms). Each MemorySnapshot records:

Field Description
peak_rss_bytes Resident set size at capture time
kv_cache_bytes Estimated KV-cache allocation
activation_bytes Activation tensor memory
temp_buffer_bytes Temporary working buffers
timestamp_us Microsecond UNIX timestamp

MemoryTracker::report() aggregates snapshots into a MemoryReport with peak_rss, mean_rss, kv_cache_total, and activation_total.

Power Profiling

PowerTracker collects wattage readings from any PowerSource implementation. Energy is computed via trapezoidal integration over the sample timeline, yielding an EnergyResult with total_joules, mean_watts, peak_watts, and duration_s. A MockPowerSource is provided for deterministic tests.

use ruvector_profiler::PowerSource;

struct NvmlPowerSource { /* device handle */ }
impl PowerSource for NvmlPowerSource {
    fn read_watts(&self) -> f64 { todo!("read from NVML/RAPL") }
}

Latency Profiling

compute_latency_stats takes a slice of LatencyRecord and returns LatencyStats with p50_us, p95_us, p99_us, mean_us, std_us, and sample count n. Records need not be pre-sorted.

CSV Output Formats

write_results_csv -- Aggregate summary

setting,coherence_delta,kv_cache_reduction,peak_mem_reduction,energy_reduction,p95_latency_us,accuracy
mincut_l0.5_t2,-0.003,0.25,0.18,0.12,1150,0.994

write_latency_csv -- Per-sample latency

sample_id,wall_time_us,kernel_time_us,seq_len
0,850,780,128

write_memory_csv -- Per-snapshot memory

timestamp_us,peak_rss_bytes,kv_cache_bytes,activation_bytes,temp_buffer_bytes
1700000000,4194304,1048576,2097152,524288

Config Fingerprinting

BenchConfig captures all parameters defining a benchmark run. config_hash produces a 64-character SHA-256 hex digest of the JSON-serialized config.

use ruvector_profiler::{BenchConfig, config_hash};

let config = BenchConfig {
    model_commit: "abc1234".into(), weights_hash: "def5678".into(),
    lambda: 0.5, tau: 2, eps: 0.01, compiler_flags: "-O3".into(),
};
assert_eq!(config_hash(&config).len(), 64);

Integration with run_mincut_bench.sh

The scripts/run_mincut_bench.sh script orchestrates a full benchmark run:

run_mincut_bench.sh
  +-- cargo build --release (-p attn-mincut, coherence, profiler)
  +-- Baseline softmax run --> baseline.csv
  +-- Grid search (lambda x tau) --> per-setting CSV + witness JSONL
  +-- Aggregate metrics --> results.csv
  +-- Pack witness bundle --> witness.rvf

CSV files follow the schemas above. Use config_hash to link results back to their exact configuration.

Tutorial: Running a Complete Min-Cut Benchmark

Step 1: Set up config and trackers

use ruvector_profiler::*;

let config = BenchConfig {
    model_commit: "abc1234".into(),
    weights_hash: "def5678".into(),
    lambda: 0.5, tau: 2, eps: 0.01,
    compiler_flags: "-O3 -mavx2".into(),
};
println!("Config fingerprint: {}", config_hash(&config));

let mut mem_tracker = MemoryTracker::new("mincut_l0.5_t2");
let power_source = MockPowerSource { watts: 75.0 };
let mut power_tracker = PowerTracker::new("gpu");

Step 2: Run benchmark loop

let mut latencies = Vec::new();
for i in 0..1000 {
    mem_tracker.snapshot();
    power_tracker.sample(&power_source);
    let start = std::time::Instant::now();
    // ... run attn_mincut() ...
    latencies.push(LatencyRecord {
        sample_id: i,
        wall_time_us: start.elapsed().as_micros() as u64,
        kernel_time_us: start.elapsed().as_micros() as u64,
        seq_len: 128,
    });
}

Step 3: Export results

let stats = compute_latency_stats(&latencies);
let report = mem_tracker.report();
let energy = power_tracker.energy();

write_latency_csv("results/latency.csv", &latencies).unwrap();
write_memory_csv("results/memory.csv", &mem_tracker.snapshots).unwrap();

println!("Peak RSS: {} | p95: {}us | Energy: {:.3}J",
    report.peak_rss, stats.p95_us, energy.total_joules);

Step 4: Use the benchmark script

# Full grid search: 1000 samples x 6 settings
./scripts/run_mincut_bench.sh --samples 1000

# Custom grid
./scripts/run_mincut_bench.sh --lambda "0.3 0.5 0.7" --tau "0 2" --seed 42

Expected output structure

results/mincut-bench/
  csv/
    baseline.csv           # Softmax reference
    mincut_l0.3_t0.csv     # Per-setting results
    mincut_l0.3_t2.csv
    ...
    results.csv            # Aggregate comparison
  witness/
    mincut_l0.3_t0.jsonl   # SHA-256 witness chains
    witness.rvf            # RVF-packed bundle
  figs/                    # Generated plots

Related Crates

Crate Role
ruvector-attn-mincut Attention operators being profiled
ruvector-coherence Quality metrics fed into ResultRow
ruvector-solver Sublinear solvers for graph analytics

License

Licensed under the MIT License.