Re-profile rectangular + Delaunay at full production fiducial (1225/1231 src + MGE-60 lens)#60
Merged
Merged
Conversation
…e fiducial
Bumps mesh sizes:
- Rectangular: 28x28 = 784 → 35x35 = 1225 source pixels
- Delaunay: 26x26 Overlay = 570 vertices → 39x39 Overlay = 1231 vertices
EXPECTED_LOG_EVIDENCE_HST recomputed:
- Rectangular: 25918.02569499014 (was 26232.07)
- Delaunay: 27433.90296505439 (was 29179.95)
Both canonical references in jax_profiling/jit/imaging/{pixelization,
delaunay}.py and the result artifacts under jax_profiling/results/jit/
imaging/ have been updated together so canonical and per-config stay
aligned at the new fiducial.
Configs shipped:
- Rectangular: 6 (CPU+GPU x fp64+mp, A100 fp64+mp) — full coverage
- Delaunay: 4 (GPU fp64+mp, A100 fp64+mp) — local CPU configs hang
local_cpu_fp64 + local_cpu_mp for Delaunay at 1231 vertices both hang
indefinitely at full_pipeline_first_call after compile succeeds.
Identical futex_wait_queue_me signature for both precisions —
extends Task #24 (was thought to be mp-specific; size-related).
Rectangular CPU configs at 1225 work fine. PR ships Delaunay without
CPU rows.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lens light: single Sersic → MGE-60 (60 linear Gaussians) in both
canonical refs (jax_profiling/jit/imaging/{pixelization,delaunay}.py)
and corresponding result artifacts. The MGE columns enter the
inversion mapping matrix, growing F+H from NxN to (N+60)x(N+60).
EXPECTED_LOG_EVIDENCE_HST recomputed:
- Pixelization (1225 + MGE-60): 24746.105672366088 (was 25918.026 single-Sersic)
- Delaunay (1231 + MGE-60): 26288.321397232066 (was 27433.903)
Configs shipped:
- Rectangular: 6 (CPU+GPU x fp64+mp, A100 fp64+mp) — full coverage
- Delaunay: 4 (GPU fp64+mp, A100 fp64+mp) — local CPU still hangs (Task #24)
Companion z_projects/profiling commit on local main updates
_setup_pixelization.py + _setup_delaunay.py to MGE-60, fixes
pixelization_profile.py to use eager-extracted H from inversion (the
old source-only constant_regularization_matrix_from gave a (N,N) shape
that mismatched F's (N+60,N+60) when lens is linear-MGE), and adds
delaunay_vmap_probe.py + rectangular_vmap_probe.py for per-step
vmap/single decomposition.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the rectangular-pixelization (#57) and Delaunay (#58) imaging-likelihood profiling sweeps with the full production fiducial setup — both the ~1250 source pixel mesh AND the MGE-60 lens light (60 linear Gaussians).
This PR went through three iterations of "make it more production-realistic":
The two earlier states are still in this PR's git history but
mainlands the final MGE-60 fiducial. Updates canonical references atjax_profiling/jit/imaging/{pixelization,delaunay}.pyto match.35×35 was chosen for rectangular (closest perfect square to 1250; mesh required square). 39×39 Overlay was empirically calibrated for Delaunay so post-mask-filtering yields ~1250 vertices (1201 inside + 30 edge = 1231 total).
Headline at the full fiducial
Delaunay vmap regression on A100 is 8.8× single-call (438 / 49.6 ms) when MGE-60 lens is added. The combined F+H is (1291, 1291) — 1231 source pixels + 60 lens MGE columns — and NNLS doesn't batch usefully at this size.
How the bottleneck picture shifts vs the earlier iterations
A100 vs RTX 2060 widens then narrows for rectangular as the model grows — at the production fiducial it's 21× (vs 22× at 784, 29× at 1225 single-Sersic). The MGE-60 mapping-matrix columns are a cheap GPU operation but the F construction grows faster on CPU/consumer GPU than A100.
For Delaunay, the vmap regression explodes from 4× → 8.8× as the lens MGE columns are added. The combined-size NNLS is the wall.
Key findings at the production fiducial
NNLS reconstruction is now the dominant cost everywhere under vmap. A100 single-JIT: NNLS ≈ 16 ms of 25 ms rect total (64%), 16 ms of 50 ms Delaunay total (31%). Under vmap the share grows since NNLS doesn't batch and the rest does. Per-step probes on RTX 2060 confirm: scipy.spatial.Delaunay
pure_callbackscales SUBLINEARLY under vmap (0.85× ratio) while NNLS scales SUPERlinearly (1.27×). Production runs use vmap → NNLS is the lever.scipy pure_callback is not the production-relevant Delaunay bottleneck. Single-JIT timings would say it is (Inversion-setup-combined dominates), but production uses vmap and the pure_callback amortises across the batch. The new
vmap_probescripts (inz_projects/profiling/scripts/) make this concrete.MGE-60 lens light adds ~10ms single + ~14ms vmap on A100 rect, and ~15ms single + ~340 ms vmap on A100 Delaunay. The Delaunay vmap penalty is grossly out of proportion to the model-size change. Hypothesis: the (1291, 1291) F+H matrix solve under vmap hits an XLA scheduling pathology that the (1231, 1231) single-call path doesn't. Worth a follow-up trace.
Mixed precision is essentially a no-op on A100 everywhere (rect 24.9 vs 25.1 ms, Delaunay 49.6 vs 49.6 ms). On RTX 2060: rect mp gives 8% off (495 vs 537), Delaunay mp gives 5% off — much smaller than at single-Sersic where Delaunay-mp saved 24%. The MGE-60 inversion-setup work is mp-friendly but NNLS isn't. On CPU: rect-mp saves 14% (3803 vs 4443) — still positive at this scale, opposite of what we saw at single-Sersic 1225 (where CPU-mp regressed).
Delaunay-on-CPU hangs at the bigger meshes. Both fp64 and mp hang at
full_pipeline_first_callafter compile (futex_wait_queue_mesignature). Reproducible. Specific to (CPU + Delaunay + larger mesh). Task Update source_plane regression to Richardson-converged truth #24 followup. Rectangular CPU at 1225 + MGE-60 works fine.Caveats
PyAutoNSSvenv lacksjax_enable_x64. rtol=1e-4 assertions still pass.-infin per-config script output for all configs in this PR. Cosmetic bug in the script's hand-rolledcompute_log_evidence—slogdet(H)of the regularization matrix padded to full F shape includes the all-zeros lens-MGE block (singular), so the slogdet is-inf. Does not affect any assertion (which all use eager and full-pipeline-JIT paths). Production AnalysisImaging handles this correctly. Will fix the script in a followup.pixelization_profile.pyfix to HPC. SLURM jobs reported exit 0:0 because the bash epilogue ran even though Python crashed.Generated by
z_projects/profiling/scripts/pixelization_profile.py+delaunay_profile.py— the main per-config profilers (now with MGE-60 lens via_setup_*.py).z_projects/profiling/scripts/{delaunay,rectangular}_vmap_probe.py— new per-step vmap/single decomposition probes used in the body's analysis.Test plan
comparison.json+comparison.pngregenerated end-to-endEXPECTED_LOG_EVIDENCE_HST(rtol=1e-4) on every config that ranFollowups
JAX_PLATFORM_NAME=cpu python z_projects/profiling/scripts/delaunay_profile.py --config-name local_cpu_fp64.compute_log_evidenceshould use the source-only H block inslogdet(H)instead of the zero-padded full matrix. Trivial diff once we know it's the only consumer.🤖 Generated with Claude Code