chore: fuse N add_scaled into one parallel_for#22893
Merged
Merged
Conversation
Replace the N-1 serial add_scaled dispatches in batch_polynomials with a single parallel_for chunked over the destination range. Each worker iterates every source polynomial in turn, intersecting the chunk range with each source's [start_index, end_index) so writes stay disjoint across threads even when sources have different backing ranges. Amortises N-1 parallel_for startup costs into one dispatch and gives the destination a single pass. Zone-level batch_polynomials wall reduction on HARDWARE_CONCURRENCY=16: native storage/transfer-sp/transfer-pr: -7% / -26% / -21% WASM storage/transfer-sp/transfer-pr: -12% / -29% / -24% Total-wall move is below single-run noise because batch_polynomials is ~1% of total wall on every flow; output is byte-identical (VK pinning unchanged, hypernova_tests and chonk_tests pass).
Hoist the fused parallel add_scaled pattern out of HypernovaFoldingProver into a generic Polynomial helper that issues a single parallel_for over the destination range. Apply it at the two additional sites Sergei flagged on the original PR review: - PolynomialBatcher::compute_batched (PCS poly batcher used by shplemini / chonk batched honk translator). - AvmProver::execute_pcs_rounds (both shifted and unshifted batching loops). The AVM unshifted batch fuses ~2715 add_scaled dispatches into one parallel_for; PCS Mega unshifted fuses ~52; shifted batches fuse smaller N=4-19. In every case the per-call parallel_for startup cost is amortised from N x to 1 x.
2 tasks
add_scaled into one parallel_for
suyash67
added a commit
that referenced
this pull request
May 1, 2026
A handful of profiling-driven trims for Chonk client-IVC. Each commit targets one zone, no behaviour change. ## Per-commit gains Each row is the commit applied **alone** on top of `merge-train/barretenberg`, measured on `ecdsar1+transfer_1_recursions+sponsored_fpc`, 16-thread remote bench, single sample per build. | Commit | Target zone | Native baseline | Native Δ | WASM baseline | WASM Δ | |---|---|---:|---:|---:|---:| | W2: reuse precomputed VK in `Chonk::accumulate_hiding_kernel` | hiding kernel | 121 ms | **−104 ms** | 343 ms | **−301 ms** | | W3: parallelise `construct_trace_data` over trace blocks¹ | construct_trace_data | 135 ms | **−34 ms** | 373 ms | **−116 ms** | | W6: parallelise `compute_permutation_mapping` cycle loop | compute_permutation_mapping | 36 ms | **−11 ms** | 61 ms | **−23 ms** | ¹ W3 splits the work into two parallel phases: Phase 1 fans out per-block (wires + copy-cycle node emission), Phase 2 fans out over a flattened `(block, selector)` task list so the threadpool can load-balance selector filling across blocks regardless of per-block size skew. The single-pass-per-block structure that the original W3 used was WASM-only; the flattened selector phase is what unlocks the native gain. ² End-to-end numbers below were measured with the original W4 commit included; subtract W4's per-commit Δ (−17 ms native, −71 ms WASM) for the W4-less stack, or roll #22893 in for the up-to-date total. ## End-to-end (full stack on top of baseline) | Zone | Native baseline | Native Δ | WASM baseline | WASM Δ | |---|---:|---:|---:|---:| | `Chonk::accumulate` (×11 per proof) | 3292 ms | **−103 ms** (3.1%)² | 9172 ms | **−532 ms** (5.8%)² | | `ChonkAPI::prove` (full E2E) | 6236 ms | **−121 ms** (2%)² | 16728 ms | **−532 ms** (3.2%)² | A fifth change (fold ECCVM masking poly into wire batch) was prototyped but didn't show measurable impact under the new "masking at top of trace" model, so it was dropped. The original W4 (fuse N `add_scaled` in `HypernovaFoldingProver::batch_polynomials`) has been split out into #22893, where it's extracted into a shared `Polynomial::add_scaled_batch` helper and applied to the PCS poly batcher and AVM prover. ## Test plan - [x] `chonk_tests` (33/33), `eccvm_tests` (44/44), `ultra_honk_tests` (283/283), `hypernova_tests` (9/9) green locally - [x] Profiled native + WASM with `/profile-chonk` on remote bench machine
This was referenced May 1, 2026
danielntmd
pushed a commit
to danielntmd/aztec-packages
that referenced
this pull request
May 6, 2026
BEGIN_COMMIT_OVERRIDE
fix(ci): default S3_BUILD_CACHE_AWS_PARAMS in cache_s3_transfer{,_to}
(AztecProtocol#22898)
chore: low-hanging chonk prover fixes from profiling (AztecProtocol#22855)
chore: fuse N `add_scaled` into one `parallel_for` (AztecProtocol#22893)
feat: Delayed merge implementation (AztecProtocol#22775)
chore: numeric audit response (AztecProtocol#22856)
fix: harden BN254 G2 SRS ingress (AztecProtocol#22858)
fix: remove unused hash_challenge variable in batch_merge.test.cpp
(AztecProtocol#22906)
fix(bbup): remove jq dependency (AztecProtocol#22912)
chore: fix g2 test failing on merge-train (AztecProtocol#22920)
fix(ci): error on disabled-cache in CI hash calculation (AztecProtocol#22904)
END_COMMIT_OVERRIDE
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Replaces the N sequential
add_scaledcalls in three batch-polynomial accumulation hot spots with a single fusedparallel_forover the destination range, amortising N×parallel_forstartup overhead into 1×.Extracted as a generic
Polynomial<Fr>::add_scaled_batch(dst, sources, scalars)helper inpolynomials/polynomial.{hpp,cpp}, explicitly instantiated forbb::frandgrumpkin::fr. Equivalent to a loop ofdst.add_scaled(sources[i], scalars[i])calls — same numerical result, single dispatch.Where it's applied
HypernovaFoldingProver::batch_polynomialsPolynomialBatcher::compute_batched(gemini.hpp)AvmProver::execute_pcs_roundsMeasured impact (remote EC2, 16 threads, 3 runs each)
Chonk transfer flow (
ecdsar1+transfer_1_recursions+sponsored_fpc, full proof):Marginal on Chonk because the affected callsites are not Chonk's hot path.
AVM proof (
AvmVerifierTests.GoodPublicInputs):AvmProver::execute_pcs_roundsThe AVM
execute_pcs_roundscuts in half because N≈3000 fusedparallel_forstartups are eliminated per call.Memory cost
Each call adds two transient
std::vectors of size N (one ofPolynomialSpan<const Fr>= 24 B, one ofFr= 32 B), freed when the function returns. Largest case is AVM unshifted (~170 KB transient) — no source polynomial is copied, only views are stored. Confirmed via/usr/bin/time -von WASM Chonk: peak RSS unchanged (within 36 KB of 316 MB).Test plan
chonk_tests(33/33),hypernova_tests(9/9),commitment_schemes_tests(88/88) greenvm2_tests*Verifier*:*Prover*(5/5) green; fullvm2_tests(1764/1764)execute_pcs_roundsbenchmark on remote confirms −37.5% on the targeted function