Skip to content

Add MGE imaging profiling: A100 + RTX 2060 + CPU sweep#56

Merged
Jammy2211 merged 1 commit into
mainfrom
feature/mge-profiling-a100
May 9, 2026
Merged

Add MGE imaging profiling: A100 + RTX 2060 + CPU sweep#56
Jammy2211 merged 1 commit into
mainfrom
feature/mge-profiling-a100

Conversation

@Jammy2211
Copy link
Copy Markdown
Contributor

Summary

Adds long-term tracking artifacts for the MGE imaging likelihood under jax_profiling/results/jit/imaging/mge/ — 10 configs side-by-side covering pre-fix vs post-fix FFT precision flow on consumer hardware (RTX 2060 Max-Q + i9-10885H) plus the production A100 baseline. Generated by new tooling in z_projects/profiling/scripts/ (separate local-only commit; no PR).

Headline numbers

Config Full pipeline vmap per call
hpc_a100_fp64 5.7 ms 2.4 ms
hpc_a100_mp 5.4 ms 2.3 ms
local_gpu_fp64 (RTX 2060) 43.7 ms 23.9 ms
local_gpu_mp 43.0 ms 15.0 ms
local_gpu_fp64_pre_fix 37.1 ms 17.6 ms
local_gpu_mp_pre_fix 47.0 ms 12.2 ms
local_cpu_fp64 308.3 ms 233.6 ms
local_cpu_mp 322.2 ms 232.3 ms
local_cpu_fp64_pre_fix 208.6 ms 128.3 ms
local_cpu_mp_pre_fix 193.3 ms 120.4 ms

Key findings

  • A100 fp64 is 7.7× faster than RTX 2060 fp64 on full single-JIT, 10× faster on vmap. Production hardware is in a different regime — the consumer-laptop story we used during the FFT precision audit was not representative.
  • Mixed precision delivers ~zero speedup on A100 (5% noise level). A100's fp64:fp32 throughput ratio is 1:2 vs RTX 2060's 1:32 — fp64 is not punitive on production hardware. The use_mixed_precision flag is a consumer-GPU lever, not a production one.
  • The FFT precision fix (PyAutoArray#302) is visible in the gap between local_gpu_mp_pre_fix (47.0 ms full pipeline → bug: full fp64 FFT then narrow) and local_gpu_mp (43.0 ms → real complex64 FFT). Most of the user-facing win lives in vmap and is sensitive to JAX cache + thermal state.

Caveats

  • A100 JIT log-likelihood shows fp32-level truncation (-159734.59375) vs the eager numpy reference (-159736.355042208). Looks like jax_enable_x64 is not set in the HPC PyAutoNSS venv. Doesn't affect timing data here, but worth confirming before quoting A100-served NSS / Nautilus log Z values to ~1e-3 precision. Filing a separate prompt to investigate.
  • Local sweep timings vary across sessions due to JAX cache state and GPU thermal state. Cross-platform comparisons (A100 vs RTX 2060 vs CPU) are robust; single-machine cross-session deltas are not. The pre-fix vs post-fix comparison was captured in different sessions and the apparent post-fix CPU regression is most likely cache / system noise rather than a real code regression — the integration tests in PyAutoArray#302 / autogalaxy_workspace_test#38 stayed within rtol=1e-4.
  • The chart (comparison.png) uses log scale on the y axis to make the A100/RTX 2060/CPU classes coexist legibly.

Generated by

  • z_projects/profiling/scripts/mge_profile.py — single-config step-by-step JIT profiler (per-step timings + full pipeline + vmap + memory analysis).
  • z_projects/profiling/scripts/mge_aggregate.py--ingest-pre-fix /tmp to convert old-schema /tmp logs; --consolidate-from <staging> to move HPC pulls into this canonical dir; default to emit comparison.json+png.
  • z_projects/profiling/hpc/batch_gpu/submit_mge_profile_{fp64,mp} — A100 SLURM submits.

The z_projects/profiling/ source side commits to its own (remote-less) main; only the result artifacts in this PR are version-tracked.

Test plan

  • All 10 JSON files schema-valid (parsed cleanly by mge_aggregate.py)
  • comparison.json + comparison.png regenerated end-to-end
  • No untracked or modified files in jax_profiling/results/jit/imaging/ outside the new mge/ subdir
  • Existing per-version flat summary files (mge_likelihood_summary_hst_v*.{json,png}) untouched

🤖 Generated with Claude Code

…060 + CPU sweep

Long-term tracking artifacts for the MGE imaging likelihood across 10
configs: 4 post-FFT-fix local (CPU + RTX 2060, fp64 + mp), 4 pre-fix
local (parsed from session /tmp logs), and 2 HPC A100 (fp64 + mp).

Headline numbers (full pipeline single-JIT / vmap per call):

  hpc_a100_fp64        5.7  /  2.4  ms
  hpc_a100_mp          5.4  /  2.3  ms
  local_gpu_fp64      43.7  / 23.9  ms   (RTX 2060 Max-Q)
  local_gpu_mp        43.0  / 15.0  ms
  local_cpu_fp64     308.3  / 233.6 ms   (i9-10885H)
  local_cpu_mp       322.2  / 232.3 ms

Key findings:
  - A100 fp64 is 7.7× faster than RTX 2060 fp64 on full single-JIT,
    10× faster on vmap. Production hardware is in a different regime.
  - Mixed precision delivers ~zero speedup on A100 (5% noise level)
    because A100's fp64:fp32 throughput ratio is 1:2 vs RTX 2060's
    1:32. The mp flag is a consumer-GPU lever, not a production one.
  - The PyAutoArray FFT precision fix (PyAutoArray#302) is visible
    in the gap between local_gpu_mp_pre_fix (47.0 ms full pipeline)
    and local_gpu_mp (43.0 ms) — though most of the win lives in
    vmap and that figure is sensitive to JAX cache state.

Caveats:
  - Eager numpy log_likelihood = -159736.355042208 across all configs;
    A100 JIT path produces -159734.5938 (fp32-level truncation),
    suggesting jax_enable_x64 may not be set in the HPC PyAutoNSS
    venv. Worth verifying — does not affect timing, may matter for
    correctness audits of A100-served NSS / Nautilus runs.
  - Local sweep timings vary across sessions due to JAX cache and GPU
    thermal state. Treat single-machine cross-session deltas with
    care; the cross-platform A100 vs RTX 2060 vs CPU story is robust.

Generated by z_projects/profiling/scripts/mge_profile.py +
mge_aggregate.py against PyAutoArray@9b4df257 (FFT-precision-fix
merged) and AutoLens v2026.5.8.2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant