feat: migrate bb-prover from direct binary execution to bb.js API#21564
Conversation
|
Hey Charlie, this looks nice, just one comment for the AVM. We regularly use the following flow for all kinds of debugging, testing, benchmarking and iterating: we run some test (e.g., the avm bulk test). This test writes the avm inputs file and outputs the exact bb-avm command that you'd need to run. After that, we can just use that command (and that file) and operate 100% on the CLI/C++ side. We can then easily change the passed environment variables, concurrency, etc. IIUC, these changes would break that because the files wouldn't be written? |
This should now be covered by setting a |
613e654 to
970f34c
Compare
fcarreiro
left a comment
There was a problem hiding this comment.
LGTM in general except for some changes in vm2/api. I can review again after that is solved!
There was a problem hiding this comment.
I think the additions to this file should instead be in BB's avm_api. In particular the vm2/api should not use or know of the result types defined in api_avm (all that is one level higher).
There was a problem hiding this comment.
Pulled the stats reset/snapshot up into bbapi_avm.cpp (linked it against vm2_sim so it can hit Stats::get() directly).
The byte-based functions (avm_prove_from_bytes etc.) are still in vm2/api_avm.cpp, moveing those needs bbapi to be AVM-conditional since AvmAPI::prove only exists in AVM-on builds.
There was a problem hiding this comment.
Ok I see. I just don't like that vm2/api_avm depends on types and files from a level up (and doesn't even define part of the implementation functions in its own hpp). It's not great but given all that this PR needs to achieve, it's probably ok and can be refactored later if needed.
If avm folks are ok with it I'm ok with it :)
There was a problem hiding this comment.
Agree not ideal. Refactor in follow up PR will be best.
| this.logger.error(`Proof generation failed: ${proofRes.reason}`); | ||
| const { proof, verified, stats } = await this.bbJsFactory.withFreshInstance(i => i.generateAvmProof(inputsBuffer)); | ||
| this.recordProverMetrics(stats, txLabel); | ||
| // bb.js also verifies right after proving; treat a failed inner verification as a prove failure. |
There was a problem hiding this comment.
Interesting. Is this a bb.js thing or is it because our vm2/api actually does prove and verify? We added that some time ago but we thought we might remove it at some point.
There was a problem hiding this comment.
It was just mimicking the avm_prove call which does a verify. But it's redundant because all call sites do their own verify checks. I guess cli did it to provide non zero exit code. Have removed the redundent verify from the api call.
There was a problem hiding this comment.
Yes I don't remember now why we added that verify. The thing is that prove never fails (unless you crash), so you always generate a proof, however it may not verify later. Right now I don't remember which code path needed it. Maybe git blame helps. I'll let @IlyasRidhuan decide
| * The CLI always uses `--scheme ultra_honk`; flavors are expressed via | ||
| * `--oracle_hash` and `--ipa_accumulation`. | ||
| */ | ||
| function getCliFlags(flavor: UltraHonkFlavor): string { |
There was a problem hiding this comment.
Do we have something that incentivizes people to keep this (and other CLI code) in sync if they change the CLI itself?
There was a problem hiding this comment.
Ah nice, the test for this file executes the cli so that should be more than enough.
There was a problem hiding this comment.
Ok I see. I just don't like that vm2/api_avm depends on types and files from a level up (and doesn't even define part of the implementation functions in its own hpp). It's not great but given all that this PR needs to achieve, it's probably ok and can be refactored later if needed.
If avm folks are ok with it I'm ok with it :)
## Summary Migrates all `bb-prover` server-side proving and verification from spawning `bb`/`bb-avm` binaries via `child_process.spawn()` to using the bb.js TypeScript API (`Barretenberg` class via `NativeUnixSocket` backend). This eliminates temporary file I/O (writing/reading bytecode, witnesses, proofs, VKs to disk) and stdout-based result parsing. A fresh bb process is still spawned per proof to maintain parallelism, but communication happens via Unix socket instead of files. ### What was migrated | Operation | Before | After | |---|---|---| | UltraHonk proving | `bb prove` via spawn | `api.circuitProve()` via bb.js | | UltraHonk verification | `bb verify` via spawn | `api.circuitVerify()` via bb.js | | Chonk (IVC) verification | `bb verify --scheme chonk` via spawn + temp files | `api.chonkVerify()` via bb.js | | AVM proving | `bb-avm avm_prove` via spawn + temp files | `api.avmProve()` via bb.js | | AVM verification | `bb-avm avm_verify` via spawn + temp files | `api.avmVerify()` via bb.js | | AVM circuit check | `bb-avm avm_check_circuit` via spawn + temp files | `api.avmCheckCircuit()` via bb.js | | Gate counting | `bb gates` via spawn | `api.circuitStats()` / `api.chonkStats()` via bb.js | | Solidity contract gen | `bb contract` via spawn | `api.circuitWriteSolidityVerifier()` via bb.js | ### Key changes - **New `BBJsInstance` / `BBJsProverFactory`** (`bb_js_backend.ts`): Adapter layer wrapping the `Barretenberg` class with typed methods for all proof operations. - **New `DebugBBJsInstance`** (`bb_js_debug.ts`): Debug wrapper that writes inputs/outputs to disk and logs equivalent CLI commands (see debug section below). - **C++ AVM msgpack commands** (`bbapi_avm.hpp/cpp`): Three bbapi command structs (`AvmProve`, `AvmVerify`, `AvmCheckCircuit`) with in-memory function implementations in `vm2/api_avm.cpp` and stub implementations in `vm2_stub/api_avm.cpp`. - **AVM per-stage timings over the API**: New `AvmStat` serialized struct, plus a `stats: AvmStat[]` field on `AvmProve::Response` and `AvmCheckCircuit::Response`. `Stats::snapshot()` in `barretenberg/vm2/tooling` returns the per-stage map; `avm_prove_from_bytes` resets the singleton at entry and snapshots on exit. This replaces the legacy stdout-scraping used by `avm_proving_tester.ts` for per-stage timings, which now reads the `stats` array directly. - **Chonk proof splitting** (`splitChonkProofToStructured`): Mirrors C++ `ChonkProof::from_field_elements` — derives the `hidingOinkProof` length as the remainder after subtracting the four fixed-size sub-proofs (`MERGE_PROOF_SIZE`, `ECCVM_PROOF_LENGTH`, `IPA_PROOF_LENGTH`, `JOINT_PROOF_LENGTH`). Removes coupling to `HIDING_OINK_LENGTH` / `HIDING_KERNEL_IO_PUBLIC_INPUTS_SIZE` / `numCustomPublicInputs`, so changes in Mega witness entities can't silently misalign the TS split. - **Removed `bb-prover/src/bb/execute.ts`**: replaced by `file_names.ts` containing the three bb-CLI output filename constants (`VK_FILENAME`, `PROOF_FILENAME`, `PUBLIC_INPUTS_FILENAME`) that a few readers still need. The binary spawner (`executeBB`) is inlined into `ivc-integration/src/bb_js_debug.test.ts`, the only remaining consumer — a CLI-parity test that must spawn `bb` to compare its output against bb.js's. - **Test harnesses migrated to bb.js**: `avm_proving_tester.ts` (reads timings from the `stats` response field instead of scraping stdout) and `ivc-integration/src/prove_native.ts::proveAvm` (in-memory proof, no tempdir, no file I/O). ### Debug workflow with `BB_DEBUG_OUTPUT_DIR` Set the `BB_DEBUG_OUTPUT_DIR` environment variable to enable file-based debugging. When set, `DebugBBJsInstance` wraps all bb.js calls and: 1. Writes all inputs (bytecode, witness, VK, proof, etc.) to numbered subdirectories under the debug dir 2. Writes a `command.sh` script with the equivalent `bb` CLI command that can be re-run manually 3. Writes outputs (proof files, verification results) after each operation Example: ```bash BB_DEBUG_OUTPUT_DIR=/tmp/bb-debug yarn workspace @aztec/prover-client test # Then inspect /tmp/bb-debug/0001_prove_MyCircuit/, /tmp/bb-debug/0002_verify_MyCircuit/, etc. ``` Each subdirectory contains the input files and a `command.sh` that reproduces the operation using the `bb` CLI binary, useful for isolating failures outside of the TypeScript stack.
0407d1a to
ea1c9b6
Compare
…artan (#23083) ## Motivation The `verifies transactions at 10 TPS` sub-test of [`yarn-project/end-to-end/src/bench/tx_stats_bench.test.ts`](https://github.com/AztecProtocol/aztec-packages/blob/merge-train/spartan/yarn-project/end-to-end/src/bench/tx_stats_bench.test.ts) is now reliably flaking on the `bench all` step of `merge-train/spartan`. It has fired on at least two different merge-train commits hours apart, with no relation to either commit's diff: | Run | Triggering merge-train commit | CI log | |---|---|---| | [25546251580](https://github.com/AztecProtocol/aztec-packages/actions/runs/25546251580) | #22934 (refactor(node-rpc)! removing deprecated AztecNode methods) | http://ci.aztec-labs.com/1778227975844707 | | [25552992890](https://github.com/AztecProtocol/aztec-packages/actions/runs/25552992890) | #22405 (feat(p2p): detect and track announce IP changes at runtime) | http://ci.aztec-labs.com/1778237470322975 | Both runs hit the same assertion: ``` ● transaction benchmarks › verifies transactions at 10 TPS expect(received).toBe(expected) // Object.is equality Expected: true Received: false at bench/tx_stats_bench.test.ts:268 ``` Sub-test failing log on the latest run: http://ci.aztec-labs.com/ca459ca73d02002c (`bench all` parent: http://ci.aztec-labs.com/90616bad7bf7ebaa). The other three sub-tests in the suite (compression; single private verify x20 serial; single public verify x20 serial) pass cleanly against the same proven txs in both runs. The failure is in the stress sub-test that fires 600 IVC verifications at 10/s with 8 concurrent IVC verifiers (`BB_NUM_IVC_VERIFIERS=8`, `BB_IVC_CONCURRENCY=1`). At least one verification returns `valid: false` under load. ## Cause Neither triggering commit touches the IVC verifier path: - #22934 is a pure node-rpc surface refactor. - #22405 is p2p / discv5 ENR plumbing. The two failures sharing this signature across unrelated diffs is strong evidence that the flake is independent of the merge-train commit and stems from the bench infrastructure itself. The likely culprit is the recent bb-prover migration to the bb.js `NativeUnixSocket` backend (#21564), which spawns a fresh bb subprocess per Chonk verification via `withVerifierInstance`. Under 8x parallel verifications on the CPU-isolated bench host (each verifier requesting 16 threads, 8 × 16 = 128 threads on 56 isolated cores), transient verifier failures appear. The bench-output log shows continuous `bb.js - Received signal 15, shutting down gracefully...` traffic during the 10 TPS phase — verifier instances are being torn down rapidly, and at least one verification slips through with a stale/incomplete response. Because the serial sub-tests (`numIterations = 20` sequential) pass cleanly in both runs, this is a stress-only interaction, not a correctness regression. ## Approach Add `tx_stats_bench` to `.test_patterns.yml` with an `error_regex` anchored to the test file's stack-trace line (`tx_stats_bench.test.ts:<line>:<col>`), and assign `*charlie` as owner (author of the bb.js migration). With this entry, `ci3/run_test_cmd` retries the test once on failure and treats a single retry-pass as a flake instead of a hard fail, unblocking the merge train for unrelated commits while Charlie investigates the underlying concurrency interaction with the bb.js backend. The `error_regex` is intentionally narrow (file + line + column from the stack trace) so other ways tx_stats_bench could fail (timeout, OOM, infra) are still surfaced as hard fails. ## Changes - `.test_patterns.yml`: add a `tx_stats_bench` entry with an error_regex anchored to the test file's stack-trace line and `*charlie` as owner. ClaudeBox logs: - https://claudebox.work/s/6e7853d3a073145f?run=1 (initial diagnosis on #22934 failure) - https://claudebox.work/s/c12a360275f05ad3?run=1 (this update on #22405 recurrence)
…ning per-call (#23093) ## Why Follow-up to #21564 (bb-prover bb.js migration) addressing the IVC verification perf regression that surfaced in `tx_stats_bench`. The migration kept the legacy spawn-per-verification model: every chonk/ultra-honk verification through `BBCircuitVerifier` spawned a fresh `bb` process and SIGTERMed it after one proof. `BB_NUM_IVC_VERIFIERS=8` only capped concurrency at the queue layer (`QueuedIVCVerifier`), not the number of bb processes. That made the bench spawn ~600 bb processes over its 60s 10 TPS phase inside an 8-CPU isolate. Two compounding problems: 1. ~50–100 ms of `bb` startup tax on every verification's hot path. 2. The bind→listen race in `NativeUnixSocket`: bb's socket file appears after `bind()` but before `listen()`. A TS `connect()` landing in that window gets `ECONNREFUSED`. Vanishingly rare under low load; reliable flake under contention. Diagnosis at http://ci.aztec-labs.com/735256f13a268733. ## What ### Make `BB_NUM_IVC_VERIFIERS` mean what its name says (commits aa99817, 0f4cb77) Pool of long-lived bb verifier processes instead of fresh-per-call. The factory class is renamed `BBJsProverFactory` → `BBJsFactory` (it's used for both proving and verifying) and given a single `getInstance(): Promise<BBJsApi & AsyncDisposable>` method: - `new BBJsFactory(path)` → no pool. Every `getInstance()` spawns a fresh bb that is destroyed on dispose. Same as the previous `withFreshInstance` behaviour — used by `BBNativeRollupProver`, the AVM proving tester, and ivc-integration helpers, so their semantics are unchanged. - `new BBJsFactory(path, { poolSize: N })` → pool of N long-lived bb processes, lazily spawned on first acquire. Used by `BBCircuitVerifier` with `poolSize: numConcurrentIVCVerifiers`. Callers use `await using inst = await factory.getInstance()` for RAII-style release, matching the codebase's preference for `AsyncDisposable`. `BBCircuitVerifier.stop` (already wired through to aztec-node shutdown) tears the pool down. ### Close the bind→listen race in bb.js (commit 8e519b0) `barretenberg/ts/src/bb_backends/node/native_socket.ts`: retry `connect()` on `ECONNREFUSED` with exponential backoff (capped at 50 ms) up to the existing 5 s budget. Other socket errors fail fast as before. Pool startup still spawns N bb processes in parallel, so the race surface is reduced from ~600 to N — the retry handles the residual. ### Server-side Chonk proof split (commit 97577cf) `splitChonkProofToStructured` in TS had three hand-maintained constants (`MERGE_PROOF_SIZE`, `ECCVM_PROOF_LENGTH`, `JOINT_PROOF_LENGTH`) duplicating C++ values. When C++ shifted Chonk layout (e.g. databus relation changes shrinking the oink portion in the previous round of regressions), these went stale and verification failed deep in the verifier with an opaque "OinkVerifier: num_public_inputs mismatch with VK". Add a new `ChonkVerifyFromFields` bbapi command that takes a flat `Vec<bb::fr>` and calls `ChonkProof::from_field_elements` server-side, then runs the verifier. The TS layer now passes flat fields straight through — no layout knowledge, no hand-maintained constants. - `bbapi_chonk.{hpp,cpp}`: new struct + `execute()`. - `bbapi_execute.hpp`: register the variant. - `bb_js_backend.ts`: `verifyChonkProof` calls the new API; `splitChonkProofToStructured` and the 3 constants are deleted. ### Disposal robustness (commit 5cde220) The first cut of `BBJsFactory` had three `.catch(() => {})` clauses that silently swallowed bb `destroy()` errors, and an `initPool()` that dropped already-spawned bb children if a sibling creation failed (`Promise.all` short-circuit). Both would manifest as the Jest "worker failed to exit gracefully" warning we hit on one test run. Now: destroy errors propagate (`AggregateError` for the pool path); `initPool` uses `allSettled` and tears down anything it spawned if any sibling rejects. ### Playground bundle size (commit 1681d33) The new `ChonkVerifyFromFields` bbapi variant tipped the playground main entrypoint over the 1750 KB hard limit. Bumped to 1800 with a bump-log entry. ## Effect - `tx_stats_bench`: 600 bb spawns → 8 bb spawns at boot, then 8 long-lived processes serve every verification. The bind→listen race surface drops 75×, *and* the residual is handled by the connect retry. Per-call ~50–100 ms `bb` startup cost disappears from the verifier hot path. - Brittle TS Chonk constants are gone — Chonk layout changes in C++ can no longer manifest as opaque verifier errors in TS. - Disposal failures surface instead of leaking bb children. - Behaviour for proving paths (`BBNativeRollupProver`, AVM tests, ivc-integration) is unchanged — they still spawn fresh per call. ClaudeBox log: https://claudebox.work/s/2d65052b0deaeab2?run=3 --------- Co-authored-by: Charlie <5764343+charlielye@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Migrates all
bb-proverserver-side proving and verification from spawningbb/bb-avmbinaries viachild_process.spawn()to using the bb.js TypeScript API (Barretenbergclass viaNativeUnixSocketbackend). This eliminates temporary file I/O (writing/reading bytecode, witnesses, proofs, VKs to disk) and stdout-based result parsing. A fresh bb process is still spawned per proof to maintain parallelism, but communication happens via Unix socket instead of files.What was migrated
bb provevia spawnapi.circuitProve()via bb.jsbb verifyvia spawnapi.circuitVerify()via bb.jsbb verify --scheme chonkvia spawn + temp filesapi.chonkVerify()via bb.jsbb-avm avm_provevia spawn + temp filesapi.avmProve()via bb.jsbb-avm avm_verifyvia spawn + temp filesapi.avmVerify()via bb.jsbb-avm avm_check_circuitvia spawn + temp filesapi.avmCheckCircuit()via bb.jsbb gatesvia spawnapi.circuitStats()/api.chonkStats()via bb.jsbb contractvia spawnapi.circuitWriteSolidityVerifier()via bb.jsKey changes
BBJsInstance/BBJsProverFactory(bb_js_backend.ts): Adapter layer wrapping theBarretenbergclass with typed methods for all proof operations.DebugBBJsInstance(bb_js_debug.ts): Debug wrapper that writes inputs/outputs to disk and logs equivalent CLI commands (see debug section below).bbapi_avm.hpp/cpp): Three bbapi command structs (AvmProve,AvmVerify,AvmCheckCircuit) with in-memory function implementations invm2/api_avm.cppand stub implementations invm2_stub/api_avm.cpp.AvmStatserialized struct, plus astats: AvmStat[]field onAvmProve::ResponseandAvmCheckCircuit::Response.Stats::snapshot()inbarretenberg/vm2/toolingreturns the per-stage map;avm_prove_from_bytesresets the singleton at entry and snapshots on exit. This replaces the legacy stdout-scraping used byavm_proving_tester.tsfor per-stage timings, which now reads thestatsarray directly.splitChonkProofToStructured): Mirrors C++ChonkProof::from_field_elements— derives thehidingOinkProoflength as the remainder after subtracting the four fixed-size sub-proofs (MERGE_PROOF_SIZE,ECCVM_PROOF_LENGTH,IPA_PROOF_LENGTH,JOINT_PROOF_LENGTH). Removes coupling toHIDING_OINK_LENGTH/HIDING_KERNEL_IO_PUBLIC_INPUTS_SIZE/numCustomPublicInputs, so changes in Mega witness entities can't silently misalign the TS split.bb-prover/src/bb/execute.ts: replaced byfile_names.tscontaining the three bb-CLI output filename constants (VK_FILENAME,PROOF_FILENAME,PUBLIC_INPUTS_FILENAME) that a few readers still need. The binary spawner (executeBB) is inlined intoivc-integration/src/bb_js_debug.test.ts, the only remaining consumer — a CLI-parity test that must spawnbbto compare its output against bb.js's.avm_proving_tester.ts(reads timings from thestatsresponse field instead of scraping stdout) andivc-integration/src/prove_native.ts::proveAvm(in-memory proof, no tempdir, no file I/O).Debug workflow with
BB_DEBUG_OUTPUT_DIRSet the
BB_DEBUG_OUTPUT_DIRenvironment variable to enable file-based debugging. When set,DebugBBJsInstancewraps all bb.js calls and:command.shscript with the equivalentbbCLI command that can be re-run manuallyExample:
Each subdirectory contains the input files and a
command.shthat reproduces the operation using thebbCLI binary, useful for isolating failures outside of the TypeScript stack.