Skip to content

Add pixelization imaging profiling: A100 + RTX 2060 + CPU sweep#57

Merged
Jammy2211 merged 1 commit into
mainfrom
feature/pixelization-profiling-a100
May 10, 2026
Merged

Add pixelization imaging profiling: A100 + RTX 2060 + CPU sweep#57
Jammy2211 merged 1 commit into
mainfrom
feature/pixelization-profiling-a100

Conversation

@Jammy2211
Copy link
Copy Markdown
Contributor

Summary

Adds long-term tracking artifacts for the rectangular pixelization imaging likelihood under jax_profiling/results/jit/imaging/pixelization/ — six configs side-by-side (CPU/GPU × fp64/mp on consumer hardware + A100 fp64/mp). Generated by new tooling in z_projects/profiling/scripts/ (separate local-only commit, no PR target).

Likelihood: Sersic + Isothermal + ExternalShear lens with a RectangularAdaptDensity(28, 28) source mesh + Constant regularization. Mirrors the canonical reference at jax_profiling/jit/imaging/pixelization.py. Companion to the MGE sweep merged in #56 — same harness, different model, an extra three steps (Overlay grid, Regularization matrix H, Regularized reconstruction) on top of the MGE 8-step pipeline.

Headline numbers

Config Full pipeline vmap per call
hpc_a100_fp64 9.7 ms 12.3 ms
hpc_a100_mp 10.1 ms 12.4 ms
local_gpu_fp64 (RTX 2060) 212.2 ms 233.1 ms
local_gpu_mp 192.6 ms 212.1 ms
local_cpu_fp64 2379.5 ms 2157.6 ms
local_cpu_mp 1670.1 ms 1878.5 ms

Key findings

  • A100 fp64 is 22× faster than RTX 2060 fp64 on the full single-JIT pipeline, and 245× faster than CPU. The production-vs-consumer gap is materially wider than for MGE (PR Add MGE imaging profiling: A100 + RTX 2060 + CPU sweep #56 measured 7.7× A100 vs RTX 2060). Pixelization's dense linear algebra benefits more from A100's tensor cores + memory bandwidth than MGE's smaller linear-LP solve does.
  • Bottleneck shifts dramatically across device classes.
    • On CPU: Curvature matrix F (1317 ms) + Inversion setup (1228 ms) account for ~90% of step total.
    • On RTX 2060: F (102 ms / 48%), Inversion setup (63 ms / 30%), Reconstruction NNLS (60 ms / 28%) — three near-equal contributors.
    • On A100: F construction collapses to 0.53 ms (1/200th of CPU). NNLS reconstruction (6.8 ms) becomes ~70% of step total. Optimising NNLS is the next throughput lever for production hardware.
  • Mixed precision is a no-op on GPUs (within noise: A100 mp is ~4% slower than fp64; RTX 2060 mp is ~10% faster). On CPU mp gives ~30% full-pipeline speedup, mostly from F construction (1317 → 875 ms). The use_mixed_precision flag remains a CPU lever, not a GPU one — same conclusion as MGE.
  • vmap does not help pixelization (0.9–1.2× per call across every device class). Contrast with MGE which gets ~2× from vmap. Root cause is the inherently iterative NNLS solve in reconstruction_positive_only_from — it does not batch usefully. Batched pixelization evaluation needs a different reconstruction strategy.

Caveats

  • A100 JIT log-evidence shows fp32-level truncation (26232.3516 vs eager numpy reference 26232.0686). Same root cause as PR Add MGE imaging profiling: A100 + RTX 2060 + CPU sweep #56: the HPC PyAutoNSS venv does not have jax_enable_x64=True. Doesn't affect timing data here, and the assertion uses rtol=1e-2 for mp paths to absorb this. Worth confirming before quoting A100-served log Z values to high precision.
  • vmap regression vs single-JIT for pixelization is real, not measurement noise. It reproduces across all six configs and is consistent with NNLS being serial. The comparison.json headline section captures both numbers explicitly.
  • Local sweep timings vary across sessions due to JAX cache state and GPU thermal state. Cross-platform comparisons (A100 vs RTX 2060 vs CPU) are robust; single-machine cross-session deltas are not.
  • The chart (comparison.png) uses log scale on the y axis to make the A100 / RTX 2060 / CPU classes coexist legibly, since they span ~3 orders of magnitude.

Generated by

  • z_projects/profiling/scripts/pixelization_profile.py — single-config 11-step JIT profiler (per-step timings + full pipeline + vmap + memory analysis). Argparse-driven, honours PYAUTO_ROOT for worktree-aware canonical writes.
  • z_projects/profiling/scripts/pixelization_aggregate.py--ingest-pre-fix /tmp (no-op unless artifacts present); --consolidate-from <staging> to move HPC pulls into this canonical dir; default to emit comparison.json + comparison.png.
  • z_projects/profiling/scripts/_setup_pixelization.py — shared build_dataset / build_model / build_analysis so the canonical reference's EXPECTED_LOG_EVIDENCE_HST = 26232.068573757562 constant carries through asserted on every run.
  • z_projects/profiling/hpc/batch_gpu/submit_pixelization_profile_{fp64,mp} — A100 SLURM submits.

The z_projects/profiling/ source side commits to its own (remote-less) main; only the result artifacts in this PR are version-tracked.

Test plan

  • All 6 JSON files schema-valid (parsed cleanly by pixelization_aggregate.py)
  • comparison.json + comparison.png regenerated end-to-end
  • No untracked or modified files in jax_profiling/results/jit/imaging/ outside the new pixelization/ subdir
  • Existing legacy per-version flat summary files (pixelization_likelihood_summary_hst_v*.{json,png}) untouched
  • Eager log_evidence regression assertion: 26232.068574 matches EXPECTED_LOG_EVIDENCE_HST (rtol=1e-4)
  • Full-pipeline JIT + vmap log_evidence assertions pass (rtol=1e-4 fp64, rtol=1e-2 mp)

🤖 Generated with Claude Code

…00 + RTX 2060 + CPU sweep

Six configs side-by-side for the rectangular pixelization imaging
likelihood (Sersic + Isothermal + ExternalShear lens with a
RectangularAdaptDensity(28,28) source mesh + Constant regularization)
covering consumer hardware (RTX 2060 Max-Q + i9-10885H), production
A100, and both fp64 + mixed-precision variants. Generated by new
tooling in z_projects/profiling/scripts/ (separate local-only commit;
no PR target).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant