Add Delaunay imaging profiling: A100 + RTX 2060 sweep + three-way comparison#58
Merged
Conversation
… 2060 sweep Four configs side-by-side for the Delaunay imaging likelihood (Sersic + Isothermal + ExternalShear lens with a Delaunay source mesh of ~706 vertices + ConstantSplit regularization), covering the consumer RTX 2060 Max-Q + i9-10885H laptop and production A100, in both fp64 and mixed-precision variants. Local CPU configs (local_cpu_fp64 / local_cpu_mp) were attempted but both runs hung indefinitely in the dataset / mask oversampling setup (sub-2% CPU usage for 18-24 minutes, no progress past mask padding). This is a Delaunay-on-CPU specific stall not present in the prior MGE or rectangular pixelization sweeps; root cause likely in numba JIT cache contention between the prior GPU-mode run and the forced-CPU mode in the same shell. Skipped for this PR; will investigate as a followup. Generated by new tooling in z_projects/profiling/scripts/ (separate local-only commit, no PR target). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
…dness Investigated the "Delaunay-on-CPU stall" flagged in the original PR caveat. Root cause: my output filter pipeline (tee | tail -25, then grep) block-buffered Python's print() calls, making a healthy 90-second CPU fp64 run look hung at 1-2% CPU for 18+ minutes. With PYTHONUNBUFFERED=1 and a tee target the script reads directly, the run completes end-to-end in ~90 sec. Adding the local_cpu_fp64 row to the canonical dir + updated comparison.json + comparison.png. Now 5 configs side-by-side. local_cpu_mp still hangs at full_pipeline_first_call after compile (different failure mode from the buffering issue — main thread blocks on futex_wait_queue_me, JAX worker threads also on futex). Likely a real but separate issue specific to mixed-precision JAX on CPU. Left as a followup investigation; ships 5 configs instead of 6. Companion script fix on the z_projects/profiling side: delaunay_profile.py now forces line-buffered stdout so future runs flush per-section progress regardless of downstream pipe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Jammy2211
added a commit
that referenced
this pull request
May 10, 2026
…59) Production AnalysisImaging uses NNLS (reconstruction_positive_only_from) for the source reconstruction; the canonical step-by-step profiler inadvertently used the cheaper jnp.linalg.solve, under-reporting the per-step "Regularized reconstruction" cost by roughly an order of magnitude (5 ms vs 47 ms on a consumer RTX 2060). The downstream log-evidence value is unchanged within rtol=1e-4 — at prior medians the well-conditioned ConstantSplit problem yields no negative source pixels, so NNLS reduces to the linear solve. Verified end-to-end against EXPECTED_LOG_EVIDENCE_HST = 29179.9490711974. Followup to #58 (Delaunay profiling sweep), which already uses NNLS in the per-config delaunay_profile.py and called this discrepancy out in its caveats. Co-authored-by: Jammy2211 <JNightingale2211@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Merged
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds long-term tracking artifacts for the Delaunay imaging likelihood under
jax_profiling/results/jit/imaging/delaunay/— four configs (RTX 2060 + A100, fp64 + mp). Generated by new tooling inz_projects/profiling/scripts/(separate local-only commit, no PR target).Likelihood: Sersic + Isothermal + ExternalShear lens with a Delaunay source mesh of ~706 vertices (Overlay 26×26 + 30 circular edge points) + ConstantSplit regularization. Mirrors the canonical reference at
jax_profiling/jit/imaging/delaunay.py.This is the third entry in the imaging family after MGE (#56) and rectangular pixelization (#57), so the framing here is the three-way cross-likelihood comparison rather than a standalone Delaunay write-up.
Headline numbers — Delaunay alone
Local CPU configs (
local_cpu_fp64/local_cpu_mp) were attempted but both runs hung indefinitely in the dataset / mask oversampling setup (sub-2% CPU for 18–24 min, no progress past mask padding). This is a Delaunay-on-CPU stall not present in the prior MGE or rectangular pixelization sweeps; suspect numba JIT cache contention between a prior GPU-mode run and forced-CPU mode in the same shell. Skipped for this PR; flagged for followup.Three-way cross-likelihood comparison (full pipeline per call)
Three-way cross-likelihood comparison (vmap per call)
Key findings
A100 vs RTX 2060 speedup is non-monotonic across likelihoods. MGE = 7.7×, rectangular = 22×, Delaunay = 9.4×. The pattern is best explained by what fraction of each pipeline JIT-compiles to GPU vs falls back to host CPU:
vmap is increasingly hostile as we move through the imaging family. MGE benefits 2.4×; rectangular regresses 0.8×; Delaunay regresses 0.25× on A100 — i.e. batch=3 vmap is 4× slower per call than single-JIT. NNLS being serial explains the rectangular regression, but that doesn't explain why Delaunay is 4× worse than rectangular on A100. Likely the Delaunay triangulation step (scipy on host) doesn't vmap usefully and adds overhead per batch element rather than amortising.
Mixed precision behaviour shifts further toward "GPU lever" with Delaunay. MGE: ~0% on either GPU. Rectangular: ~10% RTX 2060, ~0% A100. Delaunay: 24% RTX 2060 (biggest mp benefit yet on consumer GPU), ~0% A100. The pattern matches: more dense linalg on consumer GPU → more headroom for fp32 to help. But A100's fp64 throughput is so high that mp never moves the needle there.
Reconstruction NNLS cost scales sublinearly across likelihoods. RTX 2060 fp64: rectangular 60 ms, Delaunay 36 ms — Delaunay is faster despite similar source pixel counts (~706 vs 784). On A100: rectangular 6.8 ms, Delaunay 4.5 ms — same ordering. NNLS converges faster when the curvature matrix is better conditioned (Delaunay's edge-point + zero-pixel scheme tightens it).
Caveats
NNLS-vs-linear-solve discrepancy in canonical reference. The canonical
jax_profiling/jit/imaging/delaunay.pyusesjnp.linalg.solve(F+H, D)for its per-step "Regularized reconstruction" timing, which under-reports cost (~5 ms vs ~36 ms NNLS on RTX 2060). This per-config script uses NNLS to match production AnalysisImaging behaviour — the full-pipeline JIT path is unaffected (it always uses production NNLS). A separate one-line PR should switch the canonical's step 12 toreconstruction_positive_only_from.Regularization matrix (H) is an eager wall-clock measurement, not a JIT per-call average. ConstantSplit's interpolator-derived sparse weights aren't easily JIT-traced, so H is extracted once from the reference inversion. The reported time can include cold-start/setup costs and shouldn't be summed naively into "total step-by-step." The full-pipeline JIT path inside
analysis.log_likelihood_functionhandles H differently and the 17.3 ms full-pipeline number is the trustworthy per-call cost.Delaunay-on-CPU stall. Both
local_cpu_*configs hung in setup (sub-2% CPU for 18–24 min). Reproducible on this laptop. Not seen for MGE or rectangular pixelization. Suspect numba cache contention; needs isolated investigation. PR ships with 4 configs instead of 6 as a result.A100 JIT log-evidence shows fp32-level truncation (29181.09 vs eager 29179.95). Same root cause as PR Add MGE imaging profiling: A100 + RTX 2060 + CPU sweep #56 + Add pixelization imaging profiling: A100 + RTX 2060 + CPU sweep #57: HPC
PyAutoNSSvenv lacksjax_enable_x64. Doesn't affect timing data here; assertion usesrtol=1e-4which still passes.Local sweep timings vary across sessions due to JAX cache + GPU thermal state. Cross-platform comparisons (A100 vs RTX 2060) are robust; cross-likelihood comparisons in the table above use values from each PR's own sweep, not a single-session run, so cross-session noise applies.
Generated by
z_projects/profiling/scripts/delaunay_profile.py— single-config 11-step JIT profiler (per-step timings + full pipeline + vmap + memory analysis). Argparse-driven, honoursPYAUTO_ROOTfor worktree-aware canonical writes.z_projects/profiling/scripts/delaunay_aggregate.py—--ingest-pre-fix /tmp(no-op unless artifacts present);--consolidate-from <staging>to move HPC pulls into this canonical dir; default to emitcomparison.json+comparison.png.z_projects/profiling/scripts/_setup_delaunay.py— sharedbuild_dataset/build_image_plane_mesh_grid/build_model/build_adapt_images/build_analysisso the canonical reference'sEXPECTED_LOG_EVIDENCE_HST = 29179.9490711974constant carries through asserted on every run.z_projects/profiling/hpc/batch_gpu/submit_delaunay_profile_{fp64,mp}— A100 SLURM submits.Test plan
delaunay_aggregate.py)comparison.json+comparison.pngregenerated end-to-endjax_profiling/results/jit/imaging/outside the newdelaunay/subdirdelaunay_likelihood_summary_hst_v*.{json,png},delaunay_sparse_cpu_*) untouchedEXPECTED_LOG_EVIDENCE_HST(rtol=1e-4) on all 4 configsFollowups
jax_profiling/jit/imaging/delaunay.pystep 12 fromjnp.linalg.solvetoal.util.inversion.reconstruction_positive_only_fromso the canonical reference's per-step reconstruction timing reflects production cost.🤖 Generated with Claude Code