refactor: replace NAPI with IPC for world state, AVM, and contracts DB#21331
Closed
charlielye wants to merge 82 commits into
Closed
refactor: replace NAPI with IPC for world state, AVM, and contracts DB#21331charlielye wants to merge 82 commits into
charlielye wants to merge 82 commits into
Conversation
75b542a to
524eeb2
Compare
# Conflicts: # barretenberg/cpp/src/barretenberg/nodejs_module/world_state/world_state.cpp # yarn-project/aztec-node/src/aztec-node/server.ts # yarn-project/prover-node/src/prover-node.test.ts # yarn-project/simulator/src/public/public_tx_simulator/public_tx_simulator.test.ts # yarn-project/world-state/src/native/native_world_state.ts # yarn-project/world-state/src/native/native_world_state_instance.ts # yarn-project/world-state/src/synchronizer/factory.ts
The merge of origin/next introduced GenesisData (replacing prefilledPublicData) but the IPC code paths still referenced the old variable name. Fix: - Use genesis.prefilledPublicData instead of bare prefilledPublicData - Pass genesis arg to NativeWorldStateService constructor in correct position - Fix options.prefilledPublicData -> options.genesis?.prefilledPublicData in server - Fix .toBuffer() call on already-Buffer blockHeaderHash in IPC instance
The C++ WorldState computes an initial block header using genesis_timestamp, and the TS side does the same in buildInitialHeader(). Without passing the timestamp to the wsdb binary, the C++ side defaults to 0 while TS uses the actual genesis timestamp, causing a hash mismatch and assertion failure during world state initialization. Add --genesis-timestamp CLI flag to aztec-wsdb and thread it through WsdbOptions, WsdbBackend, and all call sites in native_world_state.ts and server.ts.
The fromIpc factory was hardcoding EMPTY_GENESIS_DATA, causing the TS-side buildInitialHeader to use timestamp=0 while the C++ wsdb binary had the real genesis timestamp. This caused an archive tree hash mismatch. Add genesis parameter to fromIpc and forward it from createWorldState.
The merge from next brought a test that calls simulator.simulate(tx) as a
plain Promise, but this branch returns SimulationHandle { result, cancel }.
The NAPI path validated map sizes internally, but the IPC path didn't, allowing zero/negative values to either silently succeed or cause a socket timeout. Add early validation in NativeWorldStateService.new().
… hang The test vectors describe block was missing tester.close(), leaving the AvmBackend child process and CdbIpcServer socket alive, which prevented Jest from exiting and caused a CI timeout.
# Conflicts: # yarn-project/aztec-node/src/aztec-node/server.ts
The header moved from vm2/common/ to aztec/ on next.
…ctor args - custom_bc.test.ts: useCppSimulator boolean replaced by PublicSimulatorConfig after upstream refactor. IPC branch always uses C++ simulator, so remove the parameterized describe and use default config. - server.ts: remove avmPool/cdbServer from constructor call (they're set via properties, not constructor params). Fixes TS2554 (28 args, expected 20-26).
# Conflicts: # yarn-project/aztec-node/src/aztec-node/server.ts # yarn-project/bb-prover/src/bb/execute.ts # yarn-project/prover-node/src/actions/rerun-epoch-proving-job.ts # yarn-project/txe/src/oracle/txe_oracle_top_level_context.ts # yarn-project/world-state/src/synchronizer/factory.ts
charlielye
added a commit
that referenced
this pull request
May 12, 2026
## Summary
Adds the standalone `aztec-wsdb` binary plus all supporting code (C++
client library, TS spawner, IPC adapter) needed to move world state out
of the Node.js process. **This PR is inert: nothing yet uses the new
binary.** A follow-up PR (cl/ipc-3-avm-wsdb-cutover) will cut the NAPI
AVM and the TS world state over to use it.
## What's added
**C++:**
- `barretenberg/cpp/src/barretenberg/wsdb/` — \`aztec-wsdb\` standalone
binary that runs the world state DB as an IPC server. Same WorldState
surface as the in-process NAPI module, but exposed over msgpack via UDS
or shared memory.
- `barretenberg/cpp/src/barretenberg/wsdb_client/` — \`WsdbIpcMerkleDB\`
implements \`LowLevelMerkleDBInterface\` over WSDB IPC. The standalone
AVM (or NAPI AVM after cutover) will use this in place of an in-process
\`WorldState\` reference.
- `barretenberg/cpp/src/barretenberg/ipc/mpsc_shm_{client,server}.hpp` —
multi-producer single-consumer shared-memory transport. Lower latency
than UDS for the AVM↔WSDB hop.
**TypeScript (bb.js):**
- `barretenberg/ts/src/aztec-wsdb/` — \`WsdbBackend\` spawns the
\`aztec-wsdb\` binary and routes msgpack commands via the generated
\`AsyncApi\`. Implements \`IMsgpackBackendAsync\`.
- `barretenberg/ts/src/cbind/cpp_codegen.ts` — C++ codegen used by
\`aztec-wsdb\`'s \`generate.ts\` to produce
\`wsdb_ipc_client_generated.{cpp,hpp}\`. Small shared updates to
\`schema_visitor\` / \`typescript_codegen\` / \`rust_codegen\`.
**yarn-project:**
- `yarn-project/world-state/src/native/ipc_world_state_instance.ts` —
\`IpcWorldState\` implements \`NativeWorldStateInstance\` over WSDB IPC.
Not yet wired in.
## Why split this way
The full WSDB-out-of-process cutover involves rewiring the NAPI AVM
(which currently dereferences an in-process \`WorldState*\` pointer) to
talk to \`aztec-wsdb\` over IPC, plus replacing TS NAPI WorldState with
\`IpcWorldState\` everywhere it's used. This PR keeps the diff bounded
by landing the binary and supporting code first; the cutover lands
separately and should be a tiny diff.
## Verification
- `aztec-wsdb` builds: \`cd barretenberg/cpp/build && ninja aztec-wsdb\`
- \`wsdb_client\` static library builds: \`ninja wsdb_client\`
- bb.js builds (esm/cjs/browser): \`cd barretenberg/ts && yarn build:esm
&& yarn build:cjs && yarn build:browser\`
- \`@aztec/world-state\` typechecks clean (the only TS errors in the
build output are pre-existing on \`next\` in unrelated packages)
## Stack
This is part of a stack splitting up #21331. Plan:
\`/mnt/user-data/charlie/.claude/plans/glittery-snuggling-horizon.md\`.
- PR 2a: this PR — binary + supporting code (inert)
- PR 2b (next): NAPI AVM + TS world state cutover (~500 LOC, deletes
NAPI WorldState C++ module)
- PR 3: standalone \`aztec-avm\` + CDB IPC server, kills NAPI AVM
- PR 4-6: pool, cancellation, optional MPSC SHM transport
Collaborator
|
This issue was automatically closed because it was referenced in PR #23469 which has been merged to the default branch. |
danielntmd
pushed a commit
to danielntmd/aztec-packages
that referenced
this pull request
Jun 4, 2026
…AztecProtocol#23469) ## Summary `aztec start --local-network` reliably SIGBUSes a few blocks into a run on macOS arm64 (since `v5.0.0-nightly.20260520`, i.e. after AztecProtocol#21625 shipped the `shared_ptr` use-after-free fix). This is a **different** fault from the one AztecProtocol#21625 fixed: a stack-guard violation (stack overflow) on a `nodejs_module.node` worker thread running AVM-simulation code, not a use-after-free. This pins an explicit, generous stack size on the `ThreadedAsyncOperation` worker thread. ## Root cause `ThreadedAsyncOperation::Queue()` (introduced in AztecProtocol#21138) runs the AVM simulation (`_fn`) directly on a bare `std::thread(...).detach()`. A `std::thread` uses the OS default stack for non-main threads, which is **512 KB on macOS** versus **8 MB on Linux**. The AVM-simulation call chain is deep enough to overflow 512 KB, so on macOS arm64 the worker writes into its stack-guard page and the process aborts with: ``` EXC_BAD_ACCESS / SIGBUS, KERN_PROTECTION_FAILURE "Could not determine thread index for stack guard region" #0 _platform_memmove #1.. nodejs_module.node bb::nodejs (AVM simulation path) ``` Linux is unaffected because its 8 MB default is comfortably large. The previous `AsyncOperation` path never hit this either: it ran on the libuv threadpool, whose threads are sized from `RLIMIT_STACK` (8 MB soft on macOS), not the 512 KB raw-thread default. ## Fix `std::thread` can't set a stack size, so launch the worker via `pthreads` with `pthread_attr_setstacksize` pinned to a generous `WORKER_STACK_SIZE` (32 MB — 4× the 8 MB that the libuv path proved sufficient, with headroom for deeper future call chains). Falls back to a default-stack `std::thread` only if pthreads is unavailable (`_WIN32`) or `pthread_create` fails. The shared_ptr lifetime model from AztecProtocol#21625 is preserved exactly — both the worker lambda and the `BlockingCall` completion callback still capture `self`, so this does not reintroduce the use-after-free. Only the thread-launch mechanism changed. ## Testing - The full bb build is too heavy to run in this session, so this is **not yet a local end-to-end repro/fix verification** — it relies on CI for compilation and on a macOS arm64 `aztec start --local-network` run to confirm the crash is gone. - The pthread/`std::function` trampoline was compiled and run standalone under `-std=c++20 -Wall -Wextra -Werror`: the worker thread receives a 32 MB stack (`pthread_get_stacksize_np` reports `33554432`), and the work runs and completes. - **Requested:** verify against tonight's nightly on macOS arm64 (M3) — the reporter's exact repro. ## Notes for reviewers - Targets `next` (not `merge-train/barretenberg`) to match AztecProtocol#21625's base and to make the nightly, since this is an urgent release-affecting crash. Happy to retarget if you'd prefer it go through the merge train. - 32 MB is a deliberate over-provision; if you'd rather mirror the libuv path precisely we could instead size from `getrlimit(RLIMIT_STACK)`. The fixed constant is simpler and the virtual reservation only commits pages as touched. - The longer-term fix is the NAPI→IPC migration (AztecProtocol#21331 / AztecProtocol#23196 / AztecProtocol#23238), which removes this in-process worker entirely. This is a targeted stop-gap for the shipping NAPI path. Related: AztecProtocol#21138 (introduced the threaded model), AztecProtocol#21625 (use-after-free fix), AztecProtocol#21629 (open alternative). --- *Created by [claudebox](https://claudebox.work/v2/sessions/4bd36dc505c20254) · group: `slackbot`*
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the NAPI (Node.js native addon) bindings to the C++ world state database and AVM simulator with standalone IPC processes communicating over Unix domain sockets using the existing msgpack protocol. This decouples C++ subsystems from the Node.js runtime, enabling independent process lifecycle management, better fault isolation, and simpler deployment.
Architecture changes
World state (WSDB): The
NativeWorldStateNAPI class is replaced byaztec-wsdb, a standalone C++ binary that runs the world state database as an IPC server. The TypeScript side communicates viaWsdbBackend(UDS transport) andIpcWorldState(msgpack protocol adapter implementingNativeWorldStateInstance).AVM simulator: The AVM NAPI module is replaced by
aztec-avm, a standalone C++ binary. AnAvmSimulatorPoolmanages multiple AVM process instances for parallel simulation, with fork-ID routing so each simulation operates on its own world state fork.Contracts database (CDB): A new
CdbIpcServerruns in the TypeScript process, serving contract data to the C++ AVM over UDS. The C++ side has a generated IPC client (cdb_ipc_client_generated) that queries bytecode, class IDs, and other contract metadata during simulation.IPC transport: Adds
ipc::IpcServer/ipc::IpcClientabstractions over Unix domain sockets with length-prefixed msgpack framing. Also adds an optional MPSC shared-memory transport (mpsc_shm_client/mpsc_shm_server) for lower-latency communication.Key new components
barretenberg/cpp/src/barretenberg/wsdb/barretenberg/cpp/src/barretenberg/avm/barretenberg/cpp/src/barretenberg/cdb/barretenberg/cpp/src/barretenberg/ipc/barretenberg/ts/src/aztec-wsdb/yarn-project/world-state/.../ipc_world_state_instance.tsyarn-project/simulator/.../avm_backend.tsyarn-project/simulator/.../avm_simulator_pool.tsyarn-project/simulator/.../cdb_ipc_server.tsBehavioral changes
SimulationHandle:
PublicTxSimulator.simulate()now returns{ result: Promise<PublicTxResult>, cancel: () => void }instead of a plain Promise, enabling per-simulation cancellation via SIGUSR1 to the AVM process.Genesis data:
GenesisData(includinggenesisTimestampandprefilledPublicData) is threaded through all world state construction paths —new(),tmp(),fromIpc(), and the factory — and passed to the C++ binary via--genesis-timestampand--prefilled-public-dataCLI flags so both sides compute matching initial header hashes.Fork lifecycle: World state forks are created/deleted via IPC. The AVM process receives a
forkIdand uses it to route WSDB queries to the correct fork. CDB queries are routed by fork ID to per-forkPublicContractsDBinstances registered on the CDB server.Resource cleanup: All IPC child processes (WSDB, AVM) are properly destroyed on close, with parent-death monitoring (
PR_SET_PDEATHSIGon Linux,kqueueon macOS) as a safety net.What's removed
NativeWorldStateNAPI class and its queue-based request dispatchnodejs_module/avm_simulate/)nodejs_module/world_state/)getHandle()onNativeWorldStateInstance(NAPI external handle)Testing