Skip to content

refactor: replace NAPI with IPC for world state, AVM, and contracts DB#21331

Closed
charlielye wants to merge 82 commits into
nextfrom
cl/wsdb_cdb
Closed

refactor: replace NAPI with IPC for world state, AVM, and contracts DB#21331
charlielye wants to merge 82 commits into
nextfrom
cl/wsdb_cdb

Conversation

@charlielye

@charlielye charlielye commented Mar 10, 2026

Copy link
Copy Markdown
Contributor

Summary

Replaces the NAPI (Node.js native addon) bindings to the C++ world state database and AVM simulator with standalone IPC processes communicating over Unix domain sockets using the existing msgpack protocol. This decouples C++ subsystems from the Node.js runtime, enabling independent process lifecycle management, better fault isolation, and simpler deployment.

Architecture changes

  • World state (WSDB): The NativeWorldState NAPI class is replaced by aztec-wsdb, a standalone C++ binary that runs the world state database as an IPC server. The TypeScript side communicates via WsdbBackend (UDS transport) and IpcWorldState (msgpack protocol adapter implementing NativeWorldStateInstance).

  • AVM simulator: The AVM NAPI module is replaced by aztec-avm, a standalone C++ binary. An AvmSimulatorPool manages multiple AVM process instances for parallel simulation, with fork-ID routing so each simulation operates on its own world state fork.

  • Contracts database (CDB): A new CdbIpcServer runs in the TypeScript process, serving contract data to the C++ AVM over UDS. The C++ side has a generated IPC client (cdb_ipc_client_generated) that queries bytecode, class IDs, and other contract metadata during simulation.

  • IPC transport: Adds ipc::IpcServer / ipc::IpcClient abstractions over Unix domain sockets with length-prefixed msgpack framing. Also adds an optional MPSC shared-memory transport (mpsc_shm_client / mpsc_shm_server) for lower-latency communication.

Key new components

Component Language Role
barretenberg/cpp/src/barretenberg/wsdb/ C++ Standalone WSDB IPC server binary
barretenberg/cpp/src/barretenberg/avm/ C++ Standalone AVM IPC server binary with WSDB + CDB client
barretenberg/cpp/src/barretenberg/cdb/ C++ CDB IPC client (generated) for C++ AVM to query contracts
barretenberg/cpp/src/barretenberg/ipc/ C++ Generic IPC transport (UDS + shared memory)
barretenberg/ts/src/aztec-wsdb/ TS WsdbBackend — spawns and communicates with aztec-wsdb
yarn-project/world-state/.../ipc_world_state_instance.ts TS IpcWorldState — msgpack protocol adapter for WSDB
yarn-project/simulator/.../avm_backend.ts TS AvmBackend — spawns and communicates with aztec-avm
yarn-project/simulator/.../avm_simulator_pool.ts TS Pool of AVM backends for parallel simulation
yarn-project/simulator/.../cdb_ipc_server.ts TS CDB server — serves contract data to C++ AVM over UDS

Behavioral changes

  • SimulationHandle: PublicTxSimulator.simulate() now returns { result: Promise<PublicTxResult>, cancel: () => void } instead of a plain Promise, enabling per-simulation cancellation via SIGUSR1 to the AVM process.

  • Genesis data: GenesisData (including genesisTimestamp and prefilledPublicData) is threaded through all world state construction paths — new(), tmp(), fromIpc(), and the factory — and passed to the C++ binary via --genesis-timestamp and --prefilled-public-data CLI flags so both sides compute matching initial header hashes.

  • Fork lifecycle: World state forks are created/deleted via IPC. The AVM process receives a forkId and uses it to route WSDB queries to the correct fork. CDB queries are routed by fork ID to per-fork PublicContractsDB instances registered on the CDB server.

  • Resource cleanup: All IPC child processes (WSDB, AVM) are properly destroyed on close, with parent-death monitoring (PR_SET_PDEATHSIG on Linux, kqueue on macOS) as a safety net.

What's removed

  • NativeWorldState NAPI class and its queue-based request dispatch
  • NAPI-based AVM simulator module (nodejs_module/avm_simulate/)
  • NAPI-based world state module (nodejs_module/world_state/)
  • getHandle() on NativeWorldStateInstance (NAPI external handle)
  • Direct C++ ↔ Node.js memory sharing for contract data

Testing

  • All existing world state tests pass against the IPC backend
  • AVM proving tests updated to manage IPC resource lifecycle
  • Map size validation added at the TypeScript layer (was previously in NAPI)
  • E2E tests pass with the IPC architecture

# Conflicts:
#	barretenberg/cpp/src/barretenberg/nodejs_module/world_state/world_state.cpp
#	yarn-project/aztec-node/src/aztec-node/server.ts
#	yarn-project/prover-node/src/prover-node.test.ts
#	yarn-project/simulator/src/public/public_tx_simulator/public_tx_simulator.test.ts
#	yarn-project/world-state/src/native/native_world_state.ts
#	yarn-project/world-state/src/native/native_world_state_instance.ts
#	yarn-project/world-state/src/synchronizer/factory.ts
The merge of origin/next introduced GenesisData (replacing prefilledPublicData)
but the IPC code paths still referenced the old variable name. Fix:
- Use genesis.prefilledPublicData instead of bare prefilledPublicData
- Pass genesis arg to NativeWorldStateService constructor in correct position
- Fix options.prefilledPublicData -> options.genesis?.prefilledPublicData in server
- Fix .toBuffer() call on already-Buffer blockHeaderHash in IPC instance
The C++ WorldState computes an initial block header using genesis_timestamp,
and the TS side does the same in buildInitialHeader(). Without passing the
timestamp to the wsdb binary, the C++ side defaults to 0 while TS uses the
actual genesis timestamp, causing a hash mismatch and assertion failure
during world state initialization.

Add --genesis-timestamp CLI flag to aztec-wsdb and thread it through
WsdbOptions, WsdbBackend, and all call sites in native_world_state.ts
and server.ts.
The fromIpc factory was hardcoding EMPTY_GENESIS_DATA, causing the TS-side
buildInitialHeader to use timestamp=0 while the C++ wsdb binary had the
real genesis timestamp. This caused an archive tree hash mismatch.

Add genesis parameter to fromIpc and forward it from createWorldState.
The merge from next brought a test that calls simulator.simulate(tx) as a
plain Promise, but this branch returns SimulationHandle { result, cancel }.
The NAPI path validated map sizes internally, but the IPC path didn't,
allowing zero/negative values to either silently succeed or cause a
socket timeout. Add early validation in NativeWorldStateService.new().
… hang

The test vectors describe block was missing tester.close(), leaving the
AvmBackend child process and CdbIpcServer socket alive, which prevented
Jest from exiting and caused a CI timeout.
@charlielye charlielye changed the title refactor: C++ subsystems as separate processes [do not merge]. refactor: replace NAPI with IPC for world state and AVM simulator [do not merge] Apr 21, 2026
@charlielye charlielye changed the title refactor: replace NAPI with IPC for world state and AVM simulator [do not merge] refactor: replace NAPI with IPC for world state, AVM, and contracts DB [do not merge] Apr 21, 2026
@charlielye charlielye changed the title refactor: replace NAPI with IPC for world state, AVM, and contracts DB [do not merge] refactor: replace NAPI with IPC for world state, AVM, and contracts DB Apr 21, 2026
@charlielye charlielye removed the request for review from a team April 24, 2026 07:17
# Conflicts:
#	yarn-project/aztec-node/src/aztec-node/server.ts
The header moved from vm2/common/ to aztec/ on next.
…ctor args

- custom_bc.test.ts: useCppSimulator boolean replaced by PublicSimulatorConfig
  after upstream refactor. IPC branch always uses C++ simulator, so remove
  the parameterized describe and use default config.
- server.ts: remove avmPool/cdbServer from constructor call (they're set via
  properties, not constructor params). Fixes TS2554 (28 args, expected 20-26).
# Conflicts:
#	yarn-project/aztec-node/src/aztec-node/server.ts
#	yarn-project/bb-prover/src/bb/execute.ts
#	yarn-project/prover-node/src/actions/rerun-epoch-proving-job.ts
#	yarn-project/txe/src/oracle/txe_oracle_top_level_context.ts
#	yarn-project/world-state/src/synchronizer/factory.ts
@charlielye charlielye requested a review from MirandaWood as a code owner May 6, 2026 15:03
@charlielye charlielye removed the request for review from MirandaWood May 7, 2026 11:23
charlielye added a commit that referenced this pull request May 12, 2026
## Summary

Adds the standalone `aztec-wsdb` binary plus all supporting code (C++
client library, TS spawner, IPC adapter) needed to move world state out
of the Node.js process. **This PR is inert: nothing yet uses the new
binary.** A follow-up PR (cl/ipc-3-avm-wsdb-cutover) will cut the NAPI
AVM and the TS world state over to use it.

## What's added

**C++:**
- `barretenberg/cpp/src/barretenberg/wsdb/` — \`aztec-wsdb\` standalone
binary that runs the world state DB as an IPC server. Same WorldState
surface as the in-process NAPI module, but exposed over msgpack via UDS
or shared memory.
- `barretenberg/cpp/src/barretenberg/wsdb_client/` — \`WsdbIpcMerkleDB\`
implements \`LowLevelMerkleDBInterface\` over WSDB IPC. The standalone
AVM (or NAPI AVM after cutover) will use this in place of an in-process
\`WorldState\` reference.
- `barretenberg/cpp/src/barretenberg/ipc/mpsc_shm_{client,server}.hpp` —
multi-producer single-consumer shared-memory transport. Lower latency
than UDS for the AVM↔WSDB hop.

**TypeScript (bb.js):**
- `barretenberg/ts/src/aztec-wsdb/` — \`WsdbBackend\` spawns the
\`aztec-wsdb\` binary and routes msgpack commands via the generated
\`AsyncApi\`. Implements \`IMsgpackBackendAsync\`.
- `barretenberg/ts/src/cbind/cpp_codegen.ts` — C++ codegen used by
\`aztec-wsdb\`'s \`generate.ts\` to produce
\`wsdb_ipc_client_generated.{cpp,hpp}\`. Small shared updates to
\`schema_visitor\` / \`typescript_codegen\` / \`rust_codegen\`.

**yarn-project:**
- `yarn-project/world-state/src/native/ipc_world_state_instance.ts` —
\`IpcWorldState\` implements \`NativeWorldStateInstance\` over WSDB IPC.
Not yet wired in.

## Why split this way

The full WSDB-out-of-process cutover involves rewiring the NAPI AVM
(which currently dereferences an in-process \`WorldState*\` pointer) to
talk to \`aztec-wsdb\` over IPC, plus replacing TS NAPI WorldState with
\`IpcWorldState\` everywhere it's used. This PR keeps the diff bounded
by landing the binary and supporting code first; the cutover lands
separately and should be a tiny diff.

## Verification

- `aztec-wsdb` builds: \`cd barretenberg/cpp/build && ninja aztec-wsdb\`
- \`wsdb_client\` static library builds: \`ninja wsdb_client\`
- bb.js builds (esm/cjs/browser): \`cd barretenberg/ts && yarn build:esm
&& yarn build:cjs && yarn build:browser\`
- \`@aztec/world-state\` typechecks clean (the only TS errors in the
build output are pre-existing on \`next\` in unrelated packages)

## Stack

This is part of a stack splitting up #21331. Plan:
\`/mnt/user-data/charlie/.claude/plans/glittery-snuggling-horizon.md\`.

- PR 2a: this PR — binary + supporting code (inert)
- PR 2b (next): NAPI AVM + TS world state cutover (~500 LOC, deletes
NAPI WorldState C++ module)
- PR 3: standalone \`aztec-avm\` + CDB IPC server, kills NAPI AVM
- PR 4-6: pool, cancellation, optional MPSC SHM transport
@AztecBot

Copy link
Copy Markdown
Collaborator

This issue was automatically closed because it was referenced in PR #23469 which has been merged to the default branch.

View workflow run

@AztecBot AztecBot closed this May 22, 2026
danielntmd pushed a commit to danielntmd/aztec-packages that referenced this pull request Jun 4, 2026
…AztecProtocol#23469)

## Summary

`aztec start --local-network` reliably SIGBUSes a few blocks into a run
on macOS arm64 (since `v5.0.0-nightly.20260520`, i.e. after AztecProtocol#21625
shipped the `shared_ptr` use-after-free fix). This is a **different**
fault from the one AztecProtocol#21625 fixed: a stack-guard violation (stack
overflow) on a `nodejs_module.node` worker thread running AVM-simulation
code, not a use-after-free.

This pins an explicit, generous stack size on the
`ThreadedAsyncOperation` worker thread.

## Root cause

`ThreadedAsyncOperation::Queue()` (introduced in AztecProtocol#21138) runs the AVM
simulation (`_fn`) directly on a bare `std::thread(...).detach()`. A
`std::thread` uses the OS default stack for non-main threads, which is
**512 KB on macOS** versus **8 MB on Linux**. The AVM-simulation call
chain is deep enough to overflow 512 KB, so on macOS arm64 the worker
writes into its stack-guard page and the process aborts with:

```
EXC_BAD_ACCESS / SIGBUS, KERN_PROTECTION_FAILURE
"Could not determine thread index for stack guard region"
  #0 _platform_memmove
  #1.. nodejs_module.node  bb::nodejs (AVM simulation path)
```

Linux is unaffected because its 8 MB default is comfortably large. The
previous `AsyncOperation` path never hit this either: it ran on the
libuv threadpool, whose threads are sized from `RLIMIT_STACK` (8 MB soft
on macOS), not the 512 KB raw-thread default.

## Fix

`std::thread` can't set a stack size, so launch the worker via
`pthreads` with `pthread_attr_setstacksize` pinned to a generous
`WORKER_STACK_SIZE` (32 MB — 4× the 8 MB that the libuv path proved
sufficient, with headroom for deeper future call chains). Falls back to
a default-stack `std::thread` only if pthreads is unavailable (`_WIN32`)
or `pthread_create` fails.

The shared_ptr lifetime model from AztecProtocol#21625 is preserved exactly — both
the worker lambda and the `BlockingCall` completion callback still
capture `self`, so this does not reintroduce the use-after-free. Only
the thread-launch mechanism changed.

## Testing

- The full bb build is too heavy to run in this session, so this is
**not yet a local end-to-end repro/fix verification** — it relies on CI
for compilation and on a macOS arm64 `aztec start --local-network` run
to confirm the crash is gone.
- The pthread/`std::function` trampoline was compiled and run standalone
under `-std=c++20 -Wall -Wextra -Werror`: the worker thread receives a
32 MB stack (`pthread_get_stacksize_np` reports `33554432`), and the
work runs and completes.
- **Requested:** verify against tonight's nightly on macOS arm64 (M3) —
the reporter's exact repro.

## Notes for reviewers

- Targets `next` (not `merge-train/barretenberg`) to match AztecProtocol#21625's base
and to make the nightly, since this is an urgent release-affecting
crash. Happy to retarget if you'd prefer it go through the merge train.
- 32 MB is a deliberate over-provision; if you'd rather mirror the libuv
path precisely we could instead size from `getrlimit(RLIMIT_STACK)`. The
fixed constant is simpler and the virtual reservation only commits pages
as touched.
- The longer-term fix is the NAPI→IPC migration (AztecProtocol#21331 / AztecProtocol#23196 /
AztecProtocol#23238), which removes this in-process worker entirely. This is a
targeted stop-gap for the shipping NAPI path.

Related: AztecProtocol#21138 (introduced the threaded model), AztecProtocol#21625 (use-after-free
fix), AztecProtocol#21629 (open alternative).

---
*Created by
[claudebox](https://claudebox.work/v2/sessions/4bd36dc505c20254) ·
group: `slackbot`*
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants