Skip to content

feat: run AVM NAPI simulations on dedicated threads instead of libuv pool#21138

Merged
ludamad merged 4 commits into
merge-train/avmfrom
claudebox/avm-napi-async-threads
Mar 5, 2026
Merged

feat: run AVM NAPI simulations on dedicated threads instead of libuv pool#21138
ludamad merged 4 commits into
merge-train/avmfrom
claudebox/avm-napi-async-threads

Conversation

@AztecBot

@AztecBot AztecBot commented Mar 4, 2026

Copy link
Copy Markdown
Collaborator

Summary

AVM simulations previously used Napi::AsyncWorker which runs on the libuv thread pool. During simulation, C++ calls back to JS for contract data via BlockingCall, and those JS callbacks do async I/O (LMDB, WorldState) that also needs libuv threads. With all libuv threads occupied by sims, the callback I/O deadlocks.

This PR introduces ThreadedAsyncOperation which spawns a dedicated std::thread per simulation and signals completion back to the JS event loop via ThreadSafeFunction. This structurally prevents the deadlock — libuv threads are always available for callback I/O.

Changes

  • async_op.hpp: New ThreadedAsyncOperation class — spawns std::thread, uses TSFN for completion, self-destructs after resolving/rejecting the promise
  • avm_simulate_napi.cpp: Both simulate and simulateWithHintedDbs now use ThreadedAsyncOperation instead of AsyncOperation
  • native_module.ts: Deadlock-prevention semaphore removed. Optional resource-limit semaphore available via MAX_CONCURRENT_AVM_SIMULATIONS env var (default: unlimited)

Why this helps multi-sequencer tests

Previously, running multiple sequencers in one process required careful tuning of UV_THREADPOOL_SIZE and concurrent sim limits to avoid deadlock. With this change, AVM sims don't touch the libuv pool at all, so any number of concurrent simulations can run without starving each other's callbacks.

ClaudeBox log: http://ci.aztec-labs.com/884501541464780d-1

…pool

AVM simulations previously used Napi::AsyncWorker which runs on the libuv
thread pool. During simulation, C++ calls back to JS for contract data via
BlockingCall, and those JS callbacks do async I/O that also needs libuv
threads. With all libuv threads occupied by sims, the callback I/O deadlocks.

Introduces ThreadedAsyncOperation which spawns a dedicated std::thread per
simulation and signals completion back via ThreadSafeFunction. This
structurally prevents the deadlock — libuv threads are always available for
callback I/O. The TS-side concurrency semaphore is now optional (off by
default, configurable via MAX_CONCURRENT_AVM_SIMULATIONS env var).
@AztecBot AztecBot added the claudebox Owned by claudebox. it can push to this PR. label Mar 4, 2026
@ludamad ludamad added the ci-draft Run CI on draft PRs. label Mar 4, 2026
AztecBot added 2 commits March 4, 2026 20:30
With simulations on dedicated std::threads, the libuv deadlock is
structurally impossible. Remove the Semaphore entirely — the native
functions already return promises and need no TS-side gating.
Clean configurable concurrency limit for AVM simulations. Each sim
spawns a dedicated OS thread, so this controls resource usage rather
than preventing deadlocks. Set to 0 for unlimited.
@ludamad ludamad changed the base branch from next to merge-train/avm March 4, 2026 21:50
@ludamad ludamad marked this pull request as ready for review March 4, 2026 21:51
@ludamad ludamad added ci-merge-queue and removed ci-draft Run CI on draft PRs. claudebox Owned by claudebox. it can push to this PR. labels Mar 4, 2026
@ludamad ludamad enabled auto-merge (squash) March 5, 2026 03:02
@ludamad ludamad merged commit dd3974c into merge-train/avm Mar 5, 2026
28 of 32 checks passed
@ludamad ludamad deleted the claudebox/avm-napi-async-threads branch March 5, 2026 03:41
@AztecBot AztecBot mentioned this pull request Mar 5, 2026
AztecBot added a commit that referenced this pull request Mar 5, 2026
…pool (#21138)

## Summary

AVM simulations previously used `Napi::AsyncWorker` which runs on the **libuv thread pool**. During simulation, C++ calls back to JS for contract data via `BlockingCall`, and those JS callbacks do async I/O (LMDB, WorldState) that also needs libuv threads. With all libuv threads occupied by sims, the callback I/O deadlocks.

This PR introduces `ThreadedAsyncOperation` which spawns a **dedicated `std::thread` per simulation** and signals completion back to the JS event loop via `ThreadSafeFunction`. This structurally prevents the deadlock — libuv threads are always available for callback I/O.

### Changes
- **`async_op.hpp`**: New `ThreadedAsyncOperation` class — spawns `std::thread`, uses TSFN for completion, self-destructs after resolving/rejecting the promise
- **`avm_simulate_napi.cpp`**: Both `simulate` and `simulateWithHintedDbs` now use `ThreadedAsyncOperation` instead of `AsyncOperation`
- **`native_module.ts`**: Deadlock-prevention semaphore removed. Optional resource-limit semaphore available via `MAX_CONCURRENT_AVM_SIMULATIONS` env var (default: unlimited)

### Why this helps multi-sequencer tests
Previously, running multiple sequencers in one process required careful tuning of `UV_THREADPOOL_SIZE` and concurrent sim limits to avoid deadlock. With this change, AVM sims don't touch the libuv pool at all, so any number of concurrent simulations can run without starving each other's callbacks.

ClaudeBox log: http://ci.aztec-labs.com/884501541464780d-1
@AztecBot

AztecBot commented Mar 5, 2026

Copy link
Copy Markdown
Collaborator Author

✅ Successfully backported to backport-to-v4-staging #21064.

alexghr added a commit that referenced this pull request Mar 5, 2026
BEGIN_COMMIT_OVERRIDE
chore: chonk proof compression poc (#20645)
feat: Update L1 to L2 message APIs (#20913)
fix: adapt chonk proof compression for v4 Translator layout (#21067)
fix: omit bigint priceBumpPercentage from IPC config in testbench worker
(#21086)
feat: standby mode for prover broker (#21098)
fix(p2p): remove default block handler in favor of block handler
(#21105)
chore: prepare barretenberg-rs for crates.io publishing (#20496)
feat: reenable function selectors + additional validation in public
setup allowlist (backport #20909, #21122) (#21129)
chore: remove stale aes comments (#21133)
chore: remove auto-tag job (#21127)
feat: calldata length validation of public setup function allowlist
(#21139)
feat: run AVM NAPI simulations on dedicated threads instead of libuv
pool (#21138)
feat: Remove non-protocol contracts from public setup allowlist (#21154)
END_COMMIT_OVERRIDE

---------

Co-authored-by: ledwards2225 <ledwards2225@users.noreply.github.com>
Co-authored-by: PhilWindle <PhilWindle@users.noreply.github.com>
Co-authored-by: ludamad <adam.domurad@gmail.com>
Co-authored-by: mrzeszutko <mrzeszutko@users.noreply.github.com>
Co-authored-by: spalladino <spalladino@users.noreply.github.com>
Co-authored-by: johnathan79717 <johnathan79717@users.noreply.github.com>
Co-authored-by: nventuro <nventuro@users.noreply.github.com>
Co-authored-by: alexghr <alexghr@users.noreply.github.com>
Co-authored-by: AztecBot <AztecBot@users.noreply.github.com>
Co-authored-by: Martin Verzilli <martin@aztec-labs.com>
github-merge-queue Bot pushed a commit that referenced this pull request Mar 5, 2026
BEGIN_COMMIT_OVERRIDE
fix(avm)!: memory pre-audit (#21058)
fix(avm)!: memory trace changes (#21059)
fix!: AVM was missing range check on remainder for div in ALU (#21074)
feat: run AVM NAPI simulations on dedicated threads instead of libuv
pool (#21138)
feat(avm)!: Unify nullifier, written slots and retrieved bytecodes tree
traces (#20949)
fix(avm)!: public inputs pre-audit (#21162)
END_COMMIT_OVERRIDE
ludamad added a commit that referenced this pull request Mar 10, 2026
BEGIN_COMMIT_OVERRIDE
chore: chonk proof compression poc (#20645)
feat: Update L1 to L2 message APIs (#20913)
fix: adapt chonk proof compression for v4 Translator layout (#21067)
fix: omit bigint priceBumpPercentage from IPC config in testbench worker
(#21086)
feat: standby mode for prover broker (#21098)
fix(p2p): remove default block handler in favor of block handler
(#21105)
chore: prepare barretenberg-rs for crates.io publishing (#20496)
feat: reenable function selectors + additional validation in public
setup allowlist (backport #20909, #21122) (#21129)
chore: remove stale aes comments (#21133)
chore: remove auto-tag job (#21127)
feat: calldata length validation of public setup function allowlist
(#21139)
feat: run AVM NAPI simulations on dedicated threads instead of libuv
pool (#21138)
feat: Remove non-protocol contracts from public setup allowlist (#21154)
feat!: Expose offchain effects when simulating/sending txs (backport
#20563) (#21110)
chore: bump minor version (#21171)
chore: backport #21161 (tally slashing pruning improvements) to v4
(#21166)
chore: More updated Alpha configuration (backport #21155) (#21165)
fix(p2p): report most severe failure in runValidations (#21185)
feat: add ergonomic conversions for Noir's `Option<T>` (#21107)
docs: clarifying Noir fields vs struct fields in event metadata (#21172)
fix: bump lighthouse consensus client v7.1.0 -> v8.0.1 (#21170)
fix: update dependencies (#20997)
chore: New alpha-net environment (#20800) (#21202)
chore: code decuplication + refactor (public setup allowlist) (#21200)
feat: mask all ciphertext fields with Poseidon2-derived values (backport
#21009) (#21140)
chore: disable sponsored FPC in testnet (#21235)
feat!: exposing pub event pagination on wallet (#21197)
refactor(pxe): narrow tryGetPublicKeysAndPartialAddress return type
(backport #21208) (#21236)
feat: orchestrator enqueues via serial queue (#21247)
feat: rollup mana limit gas validation (#21219)
chore: deploy SPONSORED_FPC in test networks (#21254)
fix(sequencer): fix log when not enough txs (#21297)
END_COMMIT_OVERRIDE

---------

Co-authored-by: ledwards2225 <ledwards2225@users.noreply.github.com>
Co-authored-by: PhilWindle <PhilWindle@users.noreply.github.com>
Co-authored-by: ludamad <adam.domurad@gmail.com>
Co-authored-by: mrzeszutko <mrzeszutko@users.noreply.github.com>
Co-authored-by: spalladino <spalladino@users.noreply.github.com>
Co-authored-by: johnathan79717 <johnathan79717@users.noreply.github.com>
Co-authored-by: nventuro <nventuro@users.noreply.github.com>
Co-authored-by: alexghr <alexghr@users.noreply.github.com>
Co-authored-by: AztecBot <AztecBot@users.noreply.github.com>
Co-authored-by: Martin Verzilli <martin@aztec-labs.com>
Co-authored-by: PhilWindle <60546371+PhilWindle@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: mverzilli <mverzilli@users.noreply.github.com>
Co-authored-by: benesjan <benesjan@users.noreply.github.com>
Co-authored-by: danielntmd <danielntmd@users.noreply.github.com>
Co-authored-by: deffrian <deffrian@users.noreply.github.com>
Co-authored-by: benesjan <janbenes1234@gmail.com>
ludamad added a commit that referenced this pull request Mar 11, 2026
BEGIN_COMMIT_OVERRIDE
chore: chonk proof compression poc (#20645)
feat: Update L1 to L2 message APIs (#20913)
fix: adapt chonk proof compression for v4 Translator layout (#21067)
fix: omit bigint priceBumpPercentage from IPC config in testbench worker
(#21086)
feat: standby mode for prover broker (#21098)
fix(p2p): remove default block handler in favor of block handler
(#21105)
chore: prepare barretenberg-rs for crates.io publishing (#20496)
feat: reenable function selectors + additional validation in public
setup allowlist (backport #20909, #21122) (#21129)
chore: remove stale aes comments (#21133)
chore: remove auto-tag job (#21127)
feat: calldata length validation of public setup function allowlist
(#21139)
feat: run AVM NAPI simulations on dedicated threads instead of libuv
pool (#21138)
feat: Remove non-protocol contracts from public setup allowlist (#21154)
feat!: Expose offchain effects when simulating/sending txs (backport
#20563) (#21110)
chore: bump minor version (#21171)
chore: backport #21161 (tally slashing pruning improvements) to v4
(#21166)
chore: More updated Alpha configuration (backport #21155) (#21165)
fix(p2p): report most severe failure in runValidations (#21185)
feat: add ergonomic conversions for Noir's `Option<T>` (#21107)
docs: clarifying Noir fields vs struct fields in event metadata (#21172)
fix: bump lighthouse consensus client v7.1.0 -> v8.0.1 (#21170)
fix: update dependencies (#20997)
chore: New alpha-net environment (#20800) (#21202)
chore: code decuplication + refactor (public setup allowlist) (#21200)
feat: mask all ciphertext fields with Poseidon2-derived values (backport
#21009) (#21140)
chore: disable sponsored FPC in testnet (#21235)
feat!: exposing pub event pagination on wallet (#21197)
refactor(pxe): narrow tryGetPublicKeysAndPartialAddress return type
(backport #21208) (#21236)
feat: orchestrator enqueues via serial queue (#21247)
feat: rollup mana limit gas validation (#21219)
chore: deploy SPONSORED_FPC in test networks (#21254)
fix(sequencer): fix log when not enough txs (#21297)
fix: Simulate gas in n tps test. Set min txs per block to 1 (backport
#21312) (#21329)
fix(log): do not log validation error if unregistered handler (#21111)
fix(node): fix index misalignment in findLeavesIndexes (#21327)
fix: limit parallel blocks in prover to max AVM parallel simulations
(#21320)
fix: use native sha256 to speed up proving job id generation (#21292)
fix(validator): wait for l1 sync before processing block proposals
(#21336)
fix(txpool): cap priority fee with max fees when computing priority
(#21279)
chore: reduce severity of errors due to HA node not acquiring signature
(#21311)
fix: (A-643) add buffer to maxFeePerBlobGas for gas estimation and fix
bump loop truncation (#21323)
END_COMMIT_OVERRIDE
ludamad added a commit that referenced this pull request Mar 16, 2026
…on macOS (#21624)

Fixes SIGBUS crash on macOS in `ThreadedAsyncOperation` (#21138).

`delete op` was inside the `BlockingCall` callback, destroying the
object while `BlockingCall` was still unwinding on the worker thread.
macOS unmaps freed pages aggressively → SIGBUS. Linux keeps them mapped
→ silent use-after-free.

Fix: move `delete this` to after `BlockingCall` returns on the worker
thread.

[Full post mortem with
diagrams](https://gist.github.com/ludamad/443afe321853389a08693c4ff73676f7)
ludamad added a commit that referenced this pull request Mar 16, 2026
…v pool (#21138)

ThreadedAsyncOperation has a use-after-free that causes SIGBUS on macOS
and silent corruption on Linux. Reverting to AsyncOperation (libuv pool)
with the original UV_THREADPOOL_SIZE/2 deadlock-prevention semaphore
until a proper fix lands on next.
ludamad added a commit that referenced this pull request Mar 16, 2026
Reverts #21138 on v4. ThreadedAsyncOperation has a use-after-free that
causes SIGBUS on macOS and silent memory corruption on Linux. Restoring
AsyncOperation (libuv pool) with the original deadlock-prevention
semaphore (UV_THREADPOOL_SIZE / 2) until a proper fix lands on next
(#21625).

[Post
mortem](https://gist.github.com/ludamad/443afe321853389a08693c4ff73676f7)
alexghr added a commit that referenced this pull request Mar 17, 2026
BEGIN_COMMIT_OVERRIDE
fix(aztec-nr): return Option from decode functions and fix event
commitment capacity (backport #21264) (#21360)
fix: backport #21271 — handle bad note lengths on
compute_note_hash_and_nullifier (#21364)
fix: not reusing tags of partially reverted txs (#20817)
chore: revert accidental backport of #20817 (#21583)
feat: Implement commit all and revert all for world state checkpoints
(#21532)
cherry-pick: fix: dependabot alerts (#21531)
fix: dependabot alerts (backport #21531 to v4) (#21592)
fix: backport #21443 — Don't update state if we failed to execute
sufficient transactions (v4) (#21610)
chore: Fix msgpack serialisation (#21612)
fix(p2p): fall back to maxTxsPerCheckpoint for per-block tx validation
(#21605)
chore: merge v4 into backport-to-v4-staging (#21618)
fix(revert): avm sim uses event loop again (#21138) (#21630)
fix(e2e): remove historic/finalized block checks from epochs_multiple
test (#21642)
fix: clamp finalized block to oldest available in world-state (#21643)
fix: skip handleChainFinalized when block is behind oldest available
(#21656)
chore: demote finalized block skip log to trace (#21661)
fix: off-by-1 in getBlockHashMembershipWitness archive snapshot
(backport #21648) (#21663)
fix: capture txs not available error reason in proposal handler (#21670)
chore: add L1 inclusion time to stg public (#21665)
END_COMMIT_OVERRIDE

---------

Co-authored-by: Jan Beneš <janbenes1234@gmail.com>
Co-authored-by: PhilWindle <60546371+PhilWindle@users.noreply.github.com>
Co-authored-by: Phil Windle <philip.windle@gmail.com>
Co-authored-by: Santiago Palladino <santiago@aztecprotocol.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: ludamad <adam.domurad@gmail.com>
Co-authored-by: Alex Gherghisan <alexghr@users.noreply.github.com>
AztecBot pushed a commit that referenced this pull request May 19, 2026
Fixes use-after-free in `ThreadedAsyncOperation` (#21138) that causes SIGBUS on macOS and silent memory corruption on Linux. v4 is handled by reverting: #21630.

**Root cause**: TSFN `BlockingCall` (`napi_tsfn_blocking`) only blocks on *queue insertion*, NOT on callback completion. The callback runs asynchronously on the JS main thread, so `delete this` on the worker thread raced with the callback reading member fields. macOS's magazine malloc aggressively unmaps freed pages, turning this into a consistent SIGBUS. Linux glibc keeps pages mapped, so the race is silent.

**Fix**: manage `ThreadedAsyncOperation` via `shared_ptr` (`enable_shared_from_this`). Both the worker thread lambda and the TSFN callback capture a `shared_ptr`, so the object lives until both are done. Verified clean under ASAN with 1000+ concurrent operations (heap-use-after-free confirmed on buggy code, clean on fix).

[Full post mortem](https://gist.github.com/ludamad/443afe321853389a08693c4ff73676f7)
github-merge-queue Bot pushed a commit that referenced this pull request May 19, 2026
…cOS (#21625)

Fixes use-after-free in `ThreadedAsyncOperation` (#21138) that causes
SIGBUS on macOS and silent memory corruption on Linux. v4 is handled by
reverting: #21630.

**Root cause**: TSFN `BlockingCall` (`napi_tsfn_blocking`) only blocks
on *queue insertion*, NOT on callback completion. The callback runs
asynchronously on the JS main thread, so `delete this` on the worker
thread raced with the callback reading member fields. macOS's magazine
malloc aggressively unmaps freed pages, turning this into a consistent
SIGBUS. Linux glibc keeps pages mapped, so the race is silent.

**Fix**: manage `ThreadedAsyncOperation` via `shared_ptr`
(`enable_shared_from_this`). Both the worker thread lambda and the TSFN
callback capture a `shared_ptr`, so the object lives until both are
done. Verified clean under ASAN with 1000+ concurrent operations
(heap-use-after-free confirmed on buggy code, clean on fix).

[Full post
mortem](https://gist.github.com/ludamad/443afe321853389a08693c4ff73676f7)
danielntmd pushed a commit to danielntmd/aztec-packages that referenced this pull request Jun 4, 2026
…AztecProtocol#23469)

## Summary

`aztec start --local-network` reliably SIGBUSes a few blocks into a run
on macOS arm64 (since `v5.0.0-nightly.20260520`, i.e. after AztecProtocol#21625
shipped the `shared_ptr` use-after-free fix). This is a **different**
fault from the one AztecProtocol#21625 fixed: a stack-guard violation (stack
overflow) on a `nodejs_module.node` worker thread running AVM-simulation
code, not a use-after-free.

This pins an explicit, generous stack size on the
`ThreadedAsyncOperation` worker thread.

## Root cause

`ThreadedAsyncOperation::Queue()` (introduced in AztecProtocol#21138) runs the AVM
simulation (`_fn`) directly on a bare `std::thread(...).detach()`. A
`std::thread` uses the OS default stack for non-main threads, which is
**512 KB on macOS** versus **8 MB on Linux**. The AVM-simulation call
chain is deep enough to overflow 512 KB, so on macOS arm64 the worker
writes into its stack-guard page and the process aborts with:

```
EXC_BAD_ACCESS / SIGBUS, KERN_PROTECTION_FAILURE
"Could not determine thread index for stack guard region"
  #0 _platform_memmove
  #1.. nodejs_module.node  bb::nodejs (AVM simulation path)
```

Linux is unaffected because its 8 MB default is comfortably large. The
previous `AsyncOperation` path never hit this either: it ran on the
libuv threadpool, whose threads are sized from `RLIMIT_STACK` (8 MB soft
on macOS), not the 512 KB raw-thread default.

## Fix

`std::thread` can't set a stack size, so launch the worker via
`pthreads` with `pthread_attr_setstacksize` pinned to a generous
`WORKER_STACK_SIZE` (32 MB — 4× the 8 MB that the libuv path proved
sufficient, with headroom for deeper future call chains). Falls back to
a default-stack `std::thread` only if pthreads is unavailable (`_WIN32`)
or `pthread_create` fails.

The shared_ptr lifetime model from AztecProtocol#21625 is preserved exactly — both
the worker lambda and the `BlockingCall` completion callback still
capture `self`, so this does not reintroduce the use-after-free. Only
the thread-launch mechanism changed.

## Testing

- The full bb build is too heavy to run in this session, so this is
**not yet a local end-to-end repro/fix verification** — it relies on CI
for compilation and on a macOS arm64 `aztec start --local-network` run
to confirm the crash is gone.
- The pthread/`std::function` trampoline was compiled and run standalone
under `-std=c++20 -Wall -Wextra -Werror`: the worker thread receives a
32 MB stack (`pthread_get_stacksize_np` reports `33554432`), and the
work runs and completes.
- **Requested:** verify against tonight's nightly on macOS arm64 (M3) —
the reporter's exact repro.

## Notes for reviewers

- Targets `next` (not `merge-train/barretenberg`) to match AztecProtocol#21625's base
and to make the nightly, since this is an urgent release-affecting
crash. Happy to retarget if you'd prefer it go through the merge train.
- 32 MB is a deliberate over-provision; if you'd rather mirror the libuv
path precisely we could instead size from `getrlimit(RLIMIT_STACK)`. The
fixed constant is simpler and the virtual reservation only commits pages
as touched.
- The longer-term fix is the NAPI→IPC migration (AztecProtocol#21331 / AztecProtocol#23196 /
AztecProtocol#23238), which removes this in-process worker entirely. This is a
targeted stop-gap for the shipping NAPI path.

Related: AztecProtocol#21138 (introduced the threaded model), AztecProtocol#21625 (use-after-free
fix), AztecProtocol#21629 (open alternative).

---
*Created by
[claudebox](https://claudebox.work/v2/sessions/4bd36dc505c20254) ·
group: `slackbot`*
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants