Add MGE imaging profiling: A100 + RTX 2060 + CPU sweep by Jammy2211 · Pull Request #56 · PyAutoLabs/autolens_workspace_developer

Jammy2211 · 2026-05-09T12:46:50Z

Summary

Adds long-term tracking artifacts for the MGE imaging likelihood under jax_profiling/results/jit/imaging/mge/ — 10 configs side-by-side covering pre-fix vs post-fix FFT precision flow on consumer hardware (RTX 2060 Max-Q + i9-10885H) plus the production A100 baseline. Generated by new tooling in z_projects/profiling/scripts/ (separate local-only commit; no PR).

Headline numbers

Config	Full pipeline	vmap per call
hpc_a100_fp64	5.7 ms	2.4 ms
hpc_a100_mp	5.4 ms	2.3 ms
local_gpu_fp64 (RTX 2060)	43.7 ms	23.9 ms
local_gpu_mp	43.0 ms	15.0 ms
local_gpu_fp64_pre_fix	37.1 ms	17.6 ms
local_gpu_mp_pre_fix	47.0 ms	12.2 ms
local_cpu_fp64	308.3 ms	233.6 ms
local_cpu_mp	322.2 ms	232.3 ms
local_cpu_fp64_pre_fix	208.6 ms	128.3 ms
local_cpu_mp_pre_fix	193.3 ms	120.4 ms

Key findings

A100 fp64 is 7.7× faster than RTX 2060 fp64 on full single-JIT, 10× faster on vmap. Production hardware is in a different regime — the consumer-laptop story we used during the FFT precision audit was not representative.
Mixed precision delivers ~zero speedup on A100 (5% noise level). A100's fp64:fp32 throughput ratio is 1:2 vs RTX 2060's 1:32 — fp64 is not punitive on production hardware. The use_mixed_precision flag is a consumer-GPU lever, not a production one.
The FFT precision fix (PyAutoArray#302) is visible in the gap between local_gpu_mp_pre_fix (47.0 ms full pipeline → bug: full fp64 FFT then narrow) and local_gpu_mp (43.0 ms → real complex64 FFT). Most of the user-facing win lives in vmap and is sensitive to JAX cache + thermal state.

Caveats

A100 JIT log-likelihood shows fp32-level truncation (-159734.59375) vs the eager numpy reference (-159736.355042208). Looks like jax_enable_x64 is not set in the HPC PyAutoNSS venv. Doesn't affect timing data here, but worth confirming before quoting A100-served NSS / Nautilus log Z values to ~1e-3 precision. Filing a separate prompt to investigate.
Local sweep timings vary across sessions due to JAX cache state and GPU thermal state. Cross-platform comparisons (A100 vs RTX 2060 vs CPU) are robust; single-machine cross-session deltas are not. The pre-fix vs post-fix comparison was captured in different sessions and the apparent post-fix CPU regression is most likely cache / system noise rather than a real code regression — the integration tests in PyAutoArray#302 / autogalaxy_workspace_test#38 stayed within rtol=1e-4.
The chart (comparison.png) uses log scale on the y axis to make the A100/RTX 2060/CPU classes coexist legibly.

Generated by

z_projects/profiling/scripts/mge_profile.py — single-config step-by-step JIT profiler (per-step timings + full pipeline + vmap + memory analysis).
z_projects/profiling/scripts/mge_aggregate.py — --ingest-pre-fix /tmp to convert old-schema /tmp logs; --consolidate-from <staging> to move HPC pulls into this canonical dir; default to emit comparison.json+png.
z_projects/profiling/hpc/batch_gpu/submit_mge_profile_{fp64,mp} — A100 SLURM submits.

The z_projects/profiling/ source side commits to its own (remote-less) main; only the result artifacts in this PR are version-tracked.

Test plan

All 10 JSON files schema-valid (parsed cleanly by mge_aggregate.py)
comparison.json + comparison.png regenerated end-to-end
No untracked or modified files in jax_profiling/results/jit/imaging/ outside the new mge/ subdir
Existing per-version flat summary files (mge_likelihood_summary_hst_v*.{json,png}) untouched

🤖 Generated with Claude Code

…060 + CPU sweep Long-term tracking artifacts for the MGE imaging likelihood across 10 configs: 4 post-FFT-fix local (CPU + RTX 2060, fp64 + mp), 4 pre-fix local (parsed from session /tmp logs), and 2 HPC A100 (fp64 + mp). Headline numbers (full pipeline single-JIT / vmap per call): hpc_a100_fp64 5.7 / 2.4 ms hpc_a100_mp 5.4 / 2.3 ms local_gpu_fp64 43.7 / 23.9 ms (RTX 2060 Max-Q) local_gpu_mp 43.0 / 15.0 ms local_cpu_fp64 308.3 / 233.6 ms (i9-10885H) local_cpu_mp 322.2 / 232.3 ms Key findings: - A100 fp64 is 7.7× faster than RTX 2060 fp64 on full single-JIT, 10× faster on vmap. Production hardware is in a different regime. - Mixed precision delivers ~zero speedup on A100 (5% noise level) because A100's fp64:fp32 throughput ratio is 1:2 vs RTX 2060's 1:32. The mp flag is a consumer-GPU lever, not a production one. - The PyAutoArray FFT precision fix (PyAutoArray#302) is visible in the gap between local_gpu_mp_pre_fix (47.0 ms full pipeline) and local_gpu_mp (43.0 ms) — though most of the win lives in vmap and that figure is sensitive to JAX cache state. Caveats: - Eager numpy log_likelihood = -159736.355042208 across all configs; A100 JIT path produces -159734.5938 (fp32-level truncation), suggesting jax_enable_x64 may not be set in the HPC PyAutoNSS venv. Worth verifying — does not affect timing, may matter for correctness audits of A100-served NSS / Nautilus runs. - Local sweep timings vary across sessions due to JAX cache and GPU thermal state. Treat single-machine cross-session deltas with care; the cross-platform A100 vs RTX 2060 vs CPU story is robust. Generated by z_projects/profiling/scripts/mge_profile.py + mge_aggregate.py against PyAutoArray@9b4df257 (FFT-precision-fix merged) and AutoLens v2026.5.8.2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Jammy2211 merged commit 5ac6f2a into main May 9, 2026

Jammy2211 deleted the feature/mge-profiling-a100 branch May 9, 2026 15:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MGE imaging profiling: A100 + RTX 2060 + CPU sweep#56

Add MGE imaging profiling: A100 + RTX 2060 + CPU sweep#56
Jammy2211 merged 1 commit into
mainfrom
feature/mge-profiling-a100

Jammy2211 commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Jammy2211 commented May 9, 2026

Summary

Headline numbers

Key findings

Caveats

Generated by

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant