chore: low-hanging chonk prover fixes from profiling#22855
Merged
Conversation
iakovenkos
reviewed
Apr 29, 2026
| // parallel_for startup overhead into 1×. Chunking over the destination range (not per | ||
| // source) keeps writes disjoint across threads even when sources have different | ||
| // start_index/end_index. | ||
| auto fused_add_scaled = [&](Polynomial<FF>& dst) { |
Contributor
There was a problem hiding this comment.
can you pls enable it in the pcs where poly batcher calls add_scaled? and in the avm prover as @federicobarbacovi pointed out
iakovenkos
reviewed
Apr 29, 2026
iakovenkos
reviewed
Apr 29, 2026
iakovenkos
reviewed
Apr 29, 2026
iakovenkos
approved these changes
Apr 29, 2026
iakovenkos
left a comment
Contributor
There was a problem hiding this comment.
lgtm! left a couple of minor comments/suggestions
Skip the 31 sequential polynomial commits in MegaZKVerificationKey(precomputed) ctor by reusing the caller-supplied precomputed_vk directly as hiding_vk. MegaZKFlavor inherits VerificationKey from MegaFlavor unchanged, so the two types are identical (static_assert enforces this). Falls back to reconstruction when precomputed_vk is null for dev/test paths. Zone wall drops: - WASM: 342/300/333 ms -> 46/46/46 ms (-85% to -87%) - Native: 137/125/137 ms -> 18/17/18 ms (-86% to -87%) VK pinning short hash d519f639 unchanged; chonk tests pass.
Cycles are disjoint by construction of the generalized permutation argument (every (gate_idx, wire_idx) position belongs to exactly one variable, hence to exactly one cycle), so per-(col, row) writes never alias across cycles. Drop the serial outer loop for parallel_for_heuristic over cycle_idx without any thread-local staging or merge step. Zone wall drops: - WASM: compute_permutation_mapping 186/63/129 ms -> 112/40/80 ms (-36% to -40%) - Native: compute_permutation_mapping 110/33/71 ms -> 73/20/44 ms (-34% to -40%) VK pinning short hash d519f639 unchanged; chonk + ultra_honk tests pass.
Replace the serial block loop in populate_wires_and_selectors_and_compute _copy_cycles with parallel_for over blocks.get(). Each worker writes wires and selectors directly (disjoint row ranges per block by construction of trace_offset) and accumulates copy-cycle emissions into a thread-local flat list of (real_var_idx, cycle_node) pairs. A serial concat pass preserves block order so compute_permutation_mapping -> VK bytes stay deterministic. Zone wall drops: - WASM: construct_trace_data 1155/405/788 ms -> 886/311/585 ms (-23% to -26%) - Native: ~flat (dispatch + alloc overhead roughly cancels the parallel win on already-fast memcpy) Below-prediction outcome (WASM 0.5-1.1% vs predicted 2.0-2.4%, native near-zero vs 2.6%) - ceiling capped by Amdahl on unequal block sizes. VK pinning d519f639 unchanged; chonk + ultra_honk tests pass.
danielntmd
pushed a commit
to danielntmd/aztec-packages
that referenced
this pull request
May 6, 2026
BEGIN_COMMIT_OVERRIDE
fix(ci): default S3_BUILD_CACHE_AWS_PARAMS in cache_s3_transfer{,_to}
(AztecProtocol#22898)
chore: low-hanging chonk prover fixes from profiling (AztecProtocol#22855)
chore: fuse N `add_scaled` into one `parallel_for` (AztecProtocol#22893)
feat: Delayed merge implementation (AztecProtocol#22775)
chore: numeric audit response (AztecProtocol#22856)
fix: harden BN254 G2 SRS ingress (AztecProtocol#22858)
fix: remove unused hash_challenge variable in batch_merge.test.cpp
(AztecProtocol#22906)
fix(bbup): remove jq dependency (AztecProtocol#22912)
chore: fix g2 test failing on merge-train (AztecProtocol#22920)
fix(ci): error on disabled-cache in CI hash calculation (AztecProtocol#22904)
END_COMMIT_OVERRIDE
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A handful of profiling-driven trims for Chonk client-IVC. Each commit targets one zone, no behaviour change.
Per-commit gains
Each row is the commit applied alone on top of
merge-train/barretenberg, measured onecdsar1+transfer_1_recursions+sponsored_fpc, 16-thread remote bench, single sample per build.Chonk::accumulate_hiding_kernelconstruct_trace_dataover trace blocks¹compute_permutation_mappingcycle loop¹ W3 splits the work into two parallel phases: Phase 1 fans out per-block (wires + copy-cycle node emission), Phase 2 fans out over a flattened
(block, selector)task list so the threadpool can load-balance selector filling across blocks regardless of per-block size skew. The single-pass-per-block structure that the original W3 used was WASM-only; the flattened selector phase is what unlocks the native gain.² End-to-end numbers below were measured with the original W4 commit included; subtract W4's per-commit Δ (−17 ms native, −71 ms WASM) for the W4-less stack, or roll #22893 in for the up-to-date total.
End-to-end (full stack on top of baseline)
Chonk::accumulate(×11 per proof)ChonkAPI::prove(full E2E)A fifth change (fold ECCVM masking poly into wire batch) was prototyped but didn't show measurable impact under the new "masking at top of trace" model, so it was dropped.
The original W4 (fuse N
add_scaledinHypernovaFoldingProver::batch_polynomials) has been split out into #22893, where it's extracted into a sharedPolynomial::add_scaled_batchhelper and applied to the PCS poly batcher and AVM prover.Test plan
chonk_tests(33/33),eccvm_tests(44/44),ultra_honk_tests(283/283),hypernova_tests(9/9) green locally/profile-chonkon remote bench machine