Add MGE imaging profiling: A100 + RTX 2060 + CPU sweep#56
Merged
Conversation
…060 + CPU sweep
Long-term tracking artifacts for the MGE imaging likelihood across 10
configs: 4 post-FFT-fix local (CPU + RTX 2060, fp64 + mp), 4 pre-fix
local (parsed from session /tmp logs), and 2 HPC A100 (fp64 + mp).
Headline numbers (full pipeline single-JIT / vmap per call):
hpc_a100_fp64 5.7 / 2.4 ms
hpc_a100_mp 5.4 / 2.3 ms
local_gpu_fp64 43.7 / 23.9 ms (RTX 2060 Max-Q)
local_gpu_mp 43.0 / 15.0 ms
local_cpu_fp64 308.3 / 233.6 ms (i9-10885H)
local_cpu_mp 322.2 / 232.3 ms
Key findings:
- A100 fp64 is 7.7× faster than RTX 2060 fp64 on full single-JIT,
10× faster on vmap. Production hardware is in a different regime.
- Mixed precision delivers ~zero speedup on A100 (5% noise level)
because A100's fp64:fp32 throughput ratio is 1:2 vs RTX 2060's
1:32. The mp flag is a consumer-GPU lever, not a production one.
- The PyAutoArray FFT precision fix (PyAutoArray#302) is visible
in the gap between local_gpu_mp_pre_fix (47.0 ms full pipeline)
and local_gpu_mp (43.0 ms) — though most of the win lives in
vmap and that figure is sensitive to JAX cache state.
Caveats:
- Eager numpy log_likelihood = -159736.355042208 across all configs;
A100 JIT path produces -159734.5938 (fp32-level truncation),
suggesting jax_enable_x64 may not be set in the HPC PyAutoNSS
venv. Worth verifying — does not affect timing, may matter for
correctness audits of A100-served NSS / Nautilus runs.
- Local sweep timings vary across sessions due to JAX cache and GPU
thermal state. Treat single-machine cross-session deltas with
care; the cross-platform A100 vs RTX 2060 vs CPU story is robust.
Generated by z_projects/profiling/scripts/mge_profile.py +
mge_aggregate.py against PyAutoArray@9b4df257 (FFT-precision-fix
merged) and AutoLens v2026.5.8.2.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds long-term tracking artifacts for the MGE imaging likelihood under
jax_profiling/results/jit/imaging/mge/— 10 configs side-by-side covering pre-fix vs post-fix FFT precision flow on consumer hardware (RTX 2060 Max-Q + i9-10885H) plus the production A100 baseline. Generated by new tooling inz_projects/profiling/scripts/(separate local-only commit; no PR).Headline numbers
Key findings
use_mixed_precisionflag is a consumer-GPU lever, not a production one.local_gpu_mp_pre_fix(47.0 ms full pipeline → bug: full fp64 FFT then narrow) andlocal_gpu_mp(43.0 ms → real complex64 FFT). Most of the user-facing win lives in vmap and is sensitive to JAX cache + thermal state.Caveats
jax_enable_x64is not set in the HPCPyAutoNSSvenv. Doesn't affect timing data here, but worth confirming before quoting A100-served NSS / Nautilus log Z values to ~1e-3 precision. Filing a separate prompt to investigate.rtol=1e-4.comparison.png) uses log scale on the y axis to make the A100/RTX 2060/CPU classes coexist legibly.Generated by
z_projects/profiling/scripts/mge_profile.py— single-config step-by-step JIT profiler (per-step timings + full pipeline + vmap + memory analysis).z_projects/profiling/scripts/mge_aggregate.py—--ingest-pre-fix /tmpto convert old-schema /tmp logs;--consolidate-from <staging>to move HPC pulls into this canonical dir; default to emit comparison.json+png.z_projects/profiling/hpc/batch_gpu/submit_mge_profile_{fp64,mp}— A100 SLURM submits.The
z_projects/profiling/source side commits to its own (remote-less) main; only the result artifacts in this PR are version-tracked.Test plan
mge_aggregate.py)comparison.json+comparison.pngregenerated end-to-endjax_profiling/results/jit/imaging/outside the newmge/subdirmge_likelihood_summary_hst_v*.{json,png}) untouched🤖 Generated with Claude Code