Skip to content

fix(bb-prover): pool long-lived bb verifier processes instead of spawning per-call#23093

Merged
spalladino merged 6 commits into
merge-train/spartanfrom
claudebox/pool-bb-verifier-instances
May 11, 2026
Merged

fix(bb-prover): pool long-lived bb verifier processes instead of spawning per-call#23093
spalladino merged 6 commits into
merge-train/spartanfrom
claudebox/pool-bb-verifier-instances

Conversation

@AztecBot

@AztecBot AztecBot commented May 8, 2026

Copy link
Copy Markdown
Collaborator

Why

Follow-up to #21564 (bb-prover bb.js migration) addressing the IVC verification perf regression that surfaced in tx_stats_bench.

The migration kept the legacy spawn-per-verification model: every chonk/ultra-honk verification through BBCircuitVerifier spawned a fresh bb process and SIGTERMed it after one proof. BB_NUM_IVC_VERIFIERS=8 only capped concurrency at the queue layer (QueuedIVCVerifier), not the number of bb processes.

That made the bench spawn ~600 bb processes over its 60s 10 TPS phase inside an 8-CPU isolate. Two compounding problems:

  1. ~50–100 ms of bb startup tax on every verification's hot path.
  2. The bind→listen race in NativeUnixSocket: bb's socket file appears after bind() but before listen(). A TS connect() landing in that window gets ECONNREFUSED. Vanishingly rare under low load; reliable flake under contention. Diagnosis at http://ci.aztec-labs.com/735256f13a268733.

What

Make BB_NUM_IVC_VERIFIERS mean what its name says (commits aa99817, 0f4cb77)

Pool of long-lived bb verifier processes instead of fresh-per-call. The factory class is renamed BBJsProverFactoryBBJsFactory (it's used for both proving and verifying) and given a single getInstance(): Promise<BBJsApi & AsyncDisposable> method:

  • new BBJsFactory(path) → no pool. Every getInstance() spawns a fresh bb that is destroyed on dispose. Same as the previous withFreshInstance behaviour — used by BBNativeRollupProver, the AVM proving tester, and ivc-integration helpers, so their semantics are unchanged.
  • new BBJsFactory(path, { poolSize: N }) → pool of N long-lived bb processes, lazily spawned on first acquire. Used by BBCircuitVerifier with poolSize: numConcurrentIVCVerifiers.

Callers use await using inst = await factory.getInstance() for RAII-style release, matching the codebase's preference for AsyncDisposable. BBCircuitVerifier.stop (already wired through to aztec-node shutdown) tears the pool down.

Close the bind→listen race in bb.js (commit 8e519b0)

barretenberg/ts/src/bb_backends/node/native_socket.ts: retry connect() on ECONNREFUSED with exponential backoff (capped at 50 ms) up to the existing 5 s budget. Other socket errors fail fast as before. Pool startup still spawns N bb processes in parallel, so the race surface is reduced from ~600 to N — the retry handles the residual.

Server-side Chonk proof split (commit 97577cf)

splitChonkProofToStructured in TS had three hand-maintained constants (MERGE_PROOF_SIZE, ECCVM_PROOF_LENGTH, JOINT_PROOF_LENGTH) duplicating C++ values. When C++ shifted Chonk layout (e.g. databus relation changes shrinking the oink portion in the previous round of regressions), these went stale and verification failed deep in the verifier with an opaque "OinkVerifier: num_public_inputs mismatch with VK".

Add a new ChonkVerifyFromFields bbapi command that takes a flat Vec<bb::fr> and calls ChonkProof::from_field_elements server-side, then runs the verifier. The TS layer now passes flat fields straight through — no layout knowledge, no hand-maintained constants.

  • bbapi_chonk.{hpp,cpp}: new struct + execute().
  • bbapi_execute.hpp: register the variant.
  • bb_js_backend.ts: verifyChonkProof calls the new API; splitChonkProofToStructured and the 3 constants are deleted.

Disposal robustness (commit 5cde220)

The first cut of BBJsFactory had three .catch(() => {}) clauses that silently swallowed bb destroy() errors, and an initPool() that dropped already-spawned bb children if a sibling creation failed (Promise.all short-circuit). Both would manifest as the Jest "worker failed to exit gracefully" warning we hit on one test run.

Now: destroy errors propagate (AggregateError for the pool path); initPool uses allSettled and tears down anything it spawned if any sibling rejects.

Playground bundle size (commit 1681d33)

The new ChonkVerifyFromFields bbapi variant tipped the playground main entrypoint over the 1750 KB hard limit. Bumped to 1800 with a bump-log entry.

Effect

  • tx_stats_bench: 600 bb spawns → 8 bb spawns at boot, then 8 long-lived processes serve every verification. The bind→listen race surface drops 75×, and the residual is handled by the connect retry. Per-call ~50–100 ms bb startup cost disappears from the verifier hot path.
  • Brittle TS Chonk constants are gone — Chonk layout changes in C++ can no longer manifest as opaque verifier errors in TS.
  • Disposal failures surface instead of leaking bb children.
  • Behaviour for proving paths (BBNativeRollupProver, AVM tests, ivc-integration) is unchanged — they still spawn fresh per call.

ClaudeBox log: https://claudebox.work/s/2d65052b0deaeab2?run=3

@AztecBot AztecBot added ci-draft Run CI on draft PRs. claudebox Owned by claudebox. it can push to this PR. labels May 8, 2026
charlielye and others added 4 commits May 8, 2026 13:30
The socket file appears after bb's bind() but before its listen(). A connect()
landing in that window returns ECONNREFUSED. The previous code rejected the
first attempt; under contention (e.g. tx_stats_bench spawning ~600 bb processes
on an 8-CPU isolate) the race window stretches and ECONNREFUSED becomes a
reliable flake.

Retry connect() with exponential backoff (capped at 50ms) while ECONNREFUSED
keeps coming, up to the existing 5s budget. Other socket errors fail fast as
before. With this in place the bind→listen race is no longer observable
regardless of how many bb spawns happen in parallel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-or-fresh API

Old API exposed two callback-style methods (withFreshInstance / withVerifierInstance)
plus startVerifierPool / stopVerifierPool, with the prover-vs-verifier distinction
hardcoded into the surface. The class also handled both proving and verifying
despite the "Prover" name, which made it harder to reason about.

New API: a single getInstance() returning BBJsApi & AsyncDisposable.
- No poolSize set: getInstance() spawns a fresh bb that is destroyed on dispose
  (same behaviour as the old withFreshInstance).
- poolSize: N: pool of N long-lived bb processes; getInstance() borrows from it
  and dispose returns it. Pool is lazily spawned on first acquire.

Callers use `await using inst = await factory.getInstance()` for RAII-style
release, matching the codebase's preference for AsyncDisposable.

BBCircuitVerifier passes poolSize: numConcurrentIVCVerifiers (or undefined when
0) — same intent as the previous startVerifierPool path. Every other caller
(BBNativeRollupProver, AvmProvingTester, prove_native.ts) leaves poolSize unset,
preserving the prior fresh-per-call behaviour.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mFields

splitChonkProofToStructured had three hand-maintained constants (MERGE_PROOF_SIZE,
ECCVM_PROOF_LENGTH, JOINT_PROOF_LENGTH) duplicating C++ values. When C++ shifted
the Chonk proof layout (e.g. databus relation changes shrinking the oink portion),
these constants went stale and chonkVerify failed deep in the verifier with an
opaque "OinkVerifier: num_public_inputs mismatch with VK" error.

Add a new ChonkVerifyFromFields bbapi command that takes a flat Vec<bb::fr> and
calls ChonkProof::from_field_elements server-side, then runs the verifier.
The TS layer now passes the flat fields straight through — no layout knowledge,
no hand-maintained constants.

- bbapi_chonk.{hpp,cpp}: new ChonkVerifyFromFields struct + execute().
- bbapi_execute.hpp: register the command in the Command and CommandResponse unions.
- bb_js_backend.ts: BBJsApi.verifyChonkProof now calls api.chonkVerifyFromFields;
  splitChonkProofToStructured and the 3 constants are deleted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous BBJsFactory swallowed bb destroy() errors with .catch(() => {}) in
three disposal paths (makeOwned, makeBorrowed, destroy). A bb child that failed
to shut down cleanly would silently linger and Jest would later complain about a
worker that "failed to exit gracefully", which is exactly the symptom we hit on
one test run after the rename.

Also fix a partial-failure leak in initPool: the previous Promise.all rejected
on first createInstance() failure but other already-spawned bb children were
dropped on the floor. Switch to allSettled so we destroy them.

Disposal failures now surface (single bb destroy errors propagate; pool destroy
aggregates via AggregateError) instead of being silently swallowed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@charlielye charlielye added the ci-full Run all master checks. label May 8, 2026
The new ChonkVerifyFromFields bbapi variant added to bb.js bindings tips the
playground main entrypoint just over the 1750 KB hard limit (1750.02 KB).
Bumping to 1800 to give 50 KB of headroom for further bbapi additions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@charlielye charlielye marked this pull request as ready for review May 11, 2026 08:10
@charlielye charlielye requested a review from Thunkar as a code owner May 11, 2026 08:10
@charlielye charlielye requested review from spalladino and removed request for Thunkar May 11, 2026 09:32

@spalladino spalladino left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@spalladino spalladino merged commit 6338c3f into merge-train/spartan May 11, 2026
25 checks passed
@spalladino spalladino deleted the claudebox/pool-bb-verifier-instances branch May 11, 2026 12:26
rangozd pushed a commit to rangozd/aztec-packages that referenced this pull request May 16, 2026
BEGIN_COMMIT_OVERRIDE
fix(test): warp L1 forward when proposer scan hits EpochNotStable
(AztecProtocol#22967)
test(e2e): fail epochs tests on proposer-rollup-check-failed (AztecProtocol#22965)
fix: grafana switch to aztec_status="proposed" (AztecProtocol#22978)
chore: update benchmark scraper (AztecProtocol#22984)
test(e2e): migrate simple epoch tests to pipelining (AztecProtocol#22973)
chore: remove top-level yarn.lock (AztecProtocol#22987)
refactor(archiver)!: unify L2BlockSource checkpoint lookups via query
objects (AztecProtocol#22933)
fix(sequencer): bounded sweep instead of event scan for governance
proposal check (AztecProtocol#22989)
fix(docs): allow webapp-tutorial yarn install to populate empty lockfile
in CI (AztecProtocol#23000)
test(e2e): enable pipelining in l1-reorgs and mbps redistribution tests
(AztecProtocol#23009)
fix(archiver): restore pending block height metric under pipelining
(AztecProtocol#22994)
chore(p2p): remove skipped validation result option (AztecProtocol#23034)
refactor(p2p)!: remove slow tx collection flow (AztecProtocol#22878)
chore(spartan): add next-net-clone environment config (AztecProtocol#22995)
chore(sequencer): add context to proposer-rollup-check-failed logs
(AztecProtocol#23071)
test(e2e): wait for archiver sync before asserting pipelining (AztecProtocol#22997)
refactor(node-rpc)!: remove deprecated AztecNode methods and
L2BlockSource tip helpers (AztecProtocol#22934)
feat(p2p): detect and track announce IP changes at runtime (AztecProtocol#22405)
test: mark tx_stats_bench 10 TPS as flake-retryable on
merge-train/spartan (AztecProtocol#23083)
fix(sequencer): bind vote-only multicalls to target slot under
pipelining (AztecProtocol#23090)
feat(sequencer): build optimistically across pruning epoch boundary
(AztecProtocol#23056)
fix(sequencer): use chainTipsOverride.pending for log context (AztecProtocol#23098)
test(e2e): relax post-boundary slot assertion in
epochs_proof_at_boundary (AztecProtocol#23108)
fix(bb-prover): pool long-lived bb verifier processes instead of
spawning per-call (AztecProtocol#23093)
fix(sequencer): anchor fee asset price modifier to predicted parent
(AztecProtocol#23113)
chore: error log when L1 head timestamp drifts (AztecProtocol#22947)
fix(sequencer): override full parent checkpoint cell in pipelined
simulation (AztecProtocol#23073)
test(e2e): enable pipelining on missed l1 slot test (AztecProtocol#23068)
fix: more robust metrics reporting in IRM monitor (AztecProtocol#23038)
fix: preserve LMDB slashing protection (AztecProtocol#23145)
test(e2e): enable pipelining on p2p tests (AztecProtocol#23070)
fix(archiver): move L2 tips cache refresh out of write transactions
(AztecProtocol#23110)
test(e2e): fix data_withholding_slash flake by freezing L1 across
restart (AztecProtocol#23162)
fix(validator): include proposed checkpoint out-hashes when validating
checkpoint proposals (AztecProtocol#23119)
refactor(config): drop nested config option, flatten l1Contracts
(AztecProtocol#23143)
test(e2e): bump bash TIMEOUT for e2e_p2p/add_rollup to match jest 20m
(AztecProtocol#23177)
fix(p2p): chunk archive of mined txs on block finalization (A-969)
(AztecProtocol#23085)
fix(p2p): stream tx pool hydration to bound startup memory (A-968)
(AztecProtocol#23086)
chore: remove orphan --archiver flag usages from start invocations
(AztecProtocol#23186)
feat(ci): daily merge-train/spartan stale-PR notifier (AztecProtocol#23189)
fix: preserve contract artifact permissions (AztecProtocol#23174)
fix(ci3): accept slashes in /list/&lt;path:key&gt; for merge-train
history (AztecProtocol#23160)
feat(ci): route merge-train/spartan flake notifications to
#team-alpha-ci (AztecProtocol#23219)
fix(cheat-codes): wait for post-warp L2 block in warpL2TimeAtLeastTo
(AztecProtocol#23213)
feat: slash attesters signing over bad checkpoints (AztecProtocol#23180)
refactor(prover-client): split orchestrator into sub-tree + top-tree
pair (AztecProtocol#22996)
fix(srs): retry transient CRS HTTP downloads with exponential backoff
(AztecProtocol#23244)
refactor(p2p): remove old reqresp mode (AztecProtocol#23158)
docs(sequencer-client): rewrite top-level and timing READMEs (AztecProtocol#23149)
fix(aztec-node): include upcoming checkpoint's L1 to L2 messages in
simulatePublicCalls (AztecProtocol#23163)
END_COMMIT_OVERRIDE
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-draft Run CI on draft PRs. ci-full Run all master checks. claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants