Skip to content

chore: fuse N add_scaled into one parallel_for#22893

Merged
suyash67 merged 2 commits into
merge-train/barretenbergfrom
sb/add-scaled-batch
May 1, 2026
Merged

chore: fuse N add_scaled into one parallel_for#22893
suyash67 merged 2 commits into
merge-train/barretenbergfrom
sb/add-scaled-batch

Conversation

@suyash67

@suyash67 suyash67 commented May 1, 2026

Copy link
Copy Markdown
Contributor

Replaces the N sequential add_scaled calls in three batch-polynomial accumulation hot spots with a single fused parallel_for over the destination range, amortising N× parallel_for startup overhead into 1×.

Extracted as a generic Polynomial<Fr>::add_scaled_batch(dst, sources, scalars) helper in polynomials/polynomial.{hpp,cpp}, explicitly instantiated for bb::fr and grumpkin::fr. Equivalent to a loop of dst.add_scaled(sources[i], scalars[i]) calls — same numerical result, single dispatch.

Where it's applied

Site N (fused sources) Notes
HypernovaFoldingProver::batch_polynomials 51 unshifted, 4 shifted (Mega) First commit on this branch
PolynomialBatcher::compute_batched (gemini.hpp) 52 / 5 (Mega), 36 / 5 (Ultra), 66 / 19 (Chonk batched translator+MegaZK) Used by shplemini
AvmProver::execute_pcs_rounds 2715 unshifted, 363 shifted Largest N by far — savings amplified

Measured impact (remote EC2, 16 threads, 3 runs each)

Chonk transfer flow (ecdsar1+transfer_1_recursions+sponsored_fpc, full proof):

Before After Δ
Native runtime 6.11 s 6.07 s -0.7%
WASM runtime 17.27 s 17.25 s -0.1%
WASM peak RSS 323,604 KB 323,568 KB ~0

Marginal on Chonk because the affected callsites are not Chonk's hot path.

AVM proof (AvmVerifierTests.GoodPublicInputs):

Function Before After Δ
AvmProver::execute_pcs_rounds 392.5 ms 245.4 ms −37.5%
Total proof time 935 ms 927 ms -0.9%

The AVM execute_pcs_rounds cuts in half because N≈3000 fused parallel_for startups are eliminated per call.

Memory cost

Each call adds two transient std::vectors of size N (one of PolynomialSpan<const Fr> = 24 B, one of Fr = 32 B), freed when the function returns. Largest case is AVM unshifted (~170 KB transient) — no source polynomial is copied, only views are stored. Confirmed via /usr/bin/time -v on WASM Chonk: peak RSS unchanged (within 36 KB of 316 MB).

Test plan

  • chonk_tests (33/33), hypernova_tests (9/9), commitment_schemes_tests (88/88) green
  • vm2_tests *Verifier*:*Prover* (5/5) green; full vm2_tests (1764/1764)
  • Native + WASM Chonk benchmarks on remote EC2 confirm no regression
  • AVM execute_pcs_rounds benchmark on remote confirms −37.5% on the targeted function

suyash67 added 2 commits May 1, 2026 08:09
Replace the N-1 serial add_scaled dispatches in batch_polynomials with a
single parallel_for chunked over the destination range. Each worker
iterates every source polynomial in turn, intersecting the chunk range
with each source's [start_index, end_index) so writes stay disjoint
across threads even when sources have different backing ranges.

Amortises N-1 parallel_for startup costs into one dispatch and gives
the destination a single pass.

Zone-level batch_polynomials wall reduction on HARDWARE_CONCURRENCY=16:
  native  storage/transfer-sp/transfer-pr: -7% / -26% / -21%
  WASM    storage/transfer-sp/transfer-pr: -12% / -29% / -24%

Total-wall move is below single-run noise because batch_polynomials is
~1% of total wall on every flow; output is byte-identical (VK pinning
unchanged, hypernova_tests and chonk_tests pass).
Hoist the fused parallel add_scaled pattern out of HypernovaFoldingProver
into a generic Polynomial helper that issues a single parallel_for over the
destination range. Apply it at the two additional sites Sergei flagged on
the original PR review:

- PolynomialBatcher::compute_batched (PCS poly batcher used by shplemini /
  chonk batched honk translator).
- AvmProver::execute_pcs_rounds (both shifted and unshifted batching loops).

The AVM unshifted batch fuses ~2715 add_scaled dispatches into one parallel_for;
PCS Mega unshifted fuses ~52; shifted batches fuse smaller N=4-19. In every
case the per-call parallel_for startup cost is amortised from N x to 1 x.
@suyash67 suyash67 changed the title perf: fuse N add_scaled into one parallel_for in hypernova/PCS/AVM provers chore: fuse N add_scaled into one parallel_for in hypernova/PCS/AVM provers May 1, 2026
@suyash67 suyash67 changed the title chore: fuse N add_scaled into one parallel_for in hypernova/PCS/AVM provers chore: fuse N add_scaled into one parallel_for May 1, 2026

@federicobarbacovi federicobarbacovi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff!

suyash67 added a commit that referenced this pull request May 1, 2026
A handful of profiling-driven trims for Chonk client-IVC. Each commit
targets one zone, no behaviour change.

## Per-commit gains

Each row is the commit applied **alone** on top of
`merge-train/barretenberg`, measured on
`ecdsar1+transfer_1_recursions+sponsored_fpc`, 16-thread remote bench,
single sample per build.

| Commit | Target zone | Native baseline | Native Δ | WASM baseline |
WASM Δ |
|---|---|---:|---:|---:|---:|
| W2: reuse precomputed VK in `Chonk::accumulate_hiding_kernel` | hiding
kernel | 121 ms | **−104 ms** | 343 ms | **−301 ms** |
| W3: parallelise `construct_trace_data` over trace blocks¹ |
construct_trace_data | 135 ms | **−34 ms** | 373 ms | **−116 ms** |
| W6: parallelise `compute_permutation_mapping` cycle loop |
compute_permutation_mapping | 36 ms | **−11 ms** | 61 ms | **−23 ms** |

¹ W3 splits the work into two parallel phases: Phase 1 fans out
per-block (wires + copy-cycle node emission), Phase 2 fans out over a
flattened `(block, selector)` task list so the threadpool can
load-balance selector filling across blocks regardless of per-block size
skew. The single-pass-per-block structure that the original W3 used was
WASM-only; the flattened selector phase is what unlocks the native gain.

² End-to-end numbers below were measured with the original W4 commit
included; subtract W4's per-commit Δ (−17 ms native, −71 ms WASM) for
the W4-less stack, or roll #22893 in for the up-to-date total.

## End-to-end (full stack on top of baseline)

| Zone | Native baseline | Native Δ | WASM baseline | WASM Δ |
|---|---:|---:|---:|---:|
| `Chonk::accumulate` (×11 per proof) | 3292 ms | **−103 ms** (3.1%)² |
9172 ms | **−532 ms** (5.8%)² |
| `ChonkAPI::prove` (full E2E) | 6236 ms | **−121 ms** (2%)² | 16728 ms
| **−532 ms** (3.2%)² |

A fifth change (fold ECCVM masking poly into wire batch) was prototyped
but didn't show measurable impact under the new "masking at top of
trace" model, so it was dropped.

The original W4 (fuse N `add_scaled` in
`HypernovaFoldingProver::batch_polynomials`) has been split out into
#22893, where it's extracted into a shared
`Polynomial::add_scaled_batch` helper and applied to the PCS poly
batcher and AVM prover.

## Test plan
- [x] `chonk_tests` (33/33), `eccvm_tests` (44/44), `ultra_honk_tests`
(283/283), `hypernova_tests` (9/9) green locally
- [x] Profiled native + WASM with `/profile-chonk` on remote bench
machine
@suyash67 suyash67 merged commit 1f87e64 into merge-train/barretenberg May 1, 2026
17 of 18 checks passed
@suyash67 suyash67 deleted the sb/add-scaled-batch branch May 1, 2026 17:49
danielntmd pushed a commit to danielntmd/aztec-packages that referenced this pull request May 6, 2026
BEGIN_COMMIT_OVERRIDE
fix(ci): default S3_BUILD_CACHE_AWS_PARAMS in cache_s3_transfer{,_to}
(AztecProtocol#22898)
chore: low-hanging chonk prover fixes from profiling (AztecProtocol#22855)
chore: fuse N `add_scaled` into one `parallel_for` (AztecProtocol#22893)
feat: Delayed merge implementation (AztecProtocol#22775)
chore: numeric audit response (AztecProtocol#22856)
fix: harden BN254 G2 SRS ingress (AztecProtocol#22858)
fix: remove unused hash_challenge variable in batch_merge.test.cpp
(AztecProtocol#22906)
fix(bbup): remove jq dependency (AztecProtocol#22912)
chore: fix g2 test failing on merge-train (AztecProtocol#22920)
fix(ci): error on disabled-cache in CI hash calculation (AztecProtocol#22904)
END_COMMIT_OVERRIDE
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants