Skip to content

fix: use shared_ptr in ThreadedAsyncOperation to prevent SIGBUS on macOS#21625

Merged
dbanks12 merged 1 commit into
nextfrom
fix/threaded-async-op-sigbus-next
May 19, 2026
Merged

fix: use shared_ptr in ThreadedAsyncOperation to prevent SIGBUS on macOS#21625
dbanks12 merged 1 commit into
nextfrom
fix/threaded-async-op-sigbus-next

Conversation

@ludamad

@ludamad ludamad commented Mar 16, 2026

Copy link
Copy Markdown
Collaborator

Fixes use-after-free in ThreadedAsyncOperation (#21138) that causes SIGBUS on macOS and silent memory corruption on Linux. v4 is handled by reverting: #21630.

Root cause: TSFN BlockingCall (napi_tsfn_blocking) only blocks on queue insertion, NOT on callback completion. The callback runs asynchronously on the JS main thread, so delete this on the worker thread raced with the callback reading member fields. macOS's magazine malloc aggressively unmaps freed pages, turning this into a consistent SIGBUS. Linux glibc keeps pages mapped, so the race is silent.

Fix: manage ThreadedAsyncOperation via shared_ptr (enable_shared_from_this). Both the worker thread lambda and the TSFN callback capture a shared_ptr, so the object lives until both are done. Verified clean under ASAN with 1000+ concurrent operations (heap-use-after-free confirmed on buggy code, clean on fix).

Full post mortem

@ludamad ludamad added the ci-barretenberg Run all barretenberg/cpp checks. label Mar 16, 2026
@ludamad ludamad force-pushed the fix/threaded-async-op-sigbus-next branch 2 times, most recently from 310a821 to 1f5c0c4 Compare March 16, 2026 19:03
@ludamad ludamad force-pushed the fix/threaded-async-op-sigbus-next branch from 1f5c0c4 to 8e2d14f Compare March 16, 2026 19:32
@ludamad ludamad changed the title fix: use NonBlockingCall in ThreadedAsyncOperation to prevent SIGBUS on macOS fix: use shared_ptr in ThreadedAsyncOperation to prevent SIGBUS on macOS Mar 16, 2026
ludamad added a commit that referenced this pull request Mar 16, 2026
Reverts #21138 on v4. ThreadedAsyncOperation has a use-after-free that
causes SIGBUS on macOS and silent memory corruption on Linux. Restoring
AsyncOperation (libuv pool) with the original deadlock-prevention
semaphore (UV_THREADPOOL_SIZE / 2) until a proper fix lands on next
(#21625).

[Post
mortem](https://gist.github.com/ludamad/443afe321853389a08693c4ff73676f7)
Fixes use-after-free in `ThreadedAsyncOperation` (#21138) that causes SIGBUS on macOS and silent memory corruption on Linux. v4 is handled by reverting: #21630.

**Root cause**: TSFN `BlockingCall` (`napi_tsfn_blocking`) only blocks on *queue insertion*, NOT on callback completion. The callback runs asynchronously on the JS main thread, so `delete this` on the worker thread raced with the callback reading member fields. macOS's magazine malloc aggressively unmaps freed pages, turning this into a consistent SIGBUS. Linux glibc keeps pages mapped, so the race is silent.

**Fix**: manage `ThreadedAsyncOperation` via `shared_ptr` (`enable_shared_from_this`). Both the worker thread lambda and the TSFN callback capture a `shared_ptr`, so the object lives until both are done. Verified clean under ASAN with 1000+ concurrent operations (heap-use-after-free confirmed on buggy code, clean on fix).

[Full post mortem](https://gist.github.com/ludamad/443afe321853389a08693c4ff73676f7)
@AztecBot AztecBot force-pushed the fix/threaded-async-op-sigbus-next branch from 75fe0db to e01da06 Compare May 19, 2026 16:33
@dbanks12 dbanks12 added this pull request to the merge queue May 19, 2026
Merged via the queue into next with commit 53db7b9 May 19, 2026
22 checks passed
@dbanks12 dbanks12 deleted the fix/threaded-async-op-sigbus-next branch May 19, 2026 17:28
danielntmd pushed a commit to danielntmd/aztec-packages that referenced this pull request Jun 4, 2026
…AztecProtocol#23469)

## Summary

`aztec start --local-network` reliably SIGBUSes a few blocks into a run
on macOS arm64 (since `v5.0.0-nightly.20260520`, i.e. after AztecProtocol#21625
shipped the `shared_ptr` use-after-free fix). This is a **different**
fault from the one AztecProtocol#21625 fixed: a stack-guard violation (stack
overflow) on a `nodejs_module.node` worker thread running AVM-simulation
code, not a use-after-free.

This pins an explicit, generous stack size on the
`ThreadedAsyncOperation` worker thread.

## Root cause

`ThreadedAsyncOperation::Queue()` (introduced in AztecProtocol#21138) runs the AVM
simulation (`_fn`) directly on a bare `std::thread(...).detach()`. A
`std::thread` uses the OS default stack for non-main threads, which is
**512 KB on macOS** versus **8 MB on Linux**. The AVM-simulation call
chain is deep enough to overflow 512 KB, so on macOS arm64 the worker
writes into its stack-guard page and the process aborts with:

```
EXC_BAD_ACCESS / SIGBUS, KERN_PROTECTION_FAILURE
"Could not determine thread index for stack guard region"
  #0 _platform_memmove
  #1.. nodejs_module.node  bb::nodejs (AVM simulation path)
```

Linux is unaffected because its 8 MB default is comfortably large. The
previous `AsyncOperation` path never hit this either: it ran on the
libuv threadpool, whose threads are sized from `RLIMIT_STACK` (8 MB soft
on macOS), not the 512 KB raw-thread default.

## Fix

`std::thread` can't set a stack size, so launch the worker via
`pthreads` with `pthread_attr_setstacksize` pinned to a generous
`WORKER_STACK_SIZE` (32 MB — 4× the 8 MB that the libuv path proved
sufficient, with headroom for deeper future call chains). Falls back to
a default-stack `std::thread` only if pthreads is unavailable (`_WIN32`)
or `pthread_create` fails.

The shared_ptr lifetime model from AztecProtocol#21625 is preserved exactly — both
the worker lambda and the `BlockingCall` completion callback still
capture `self`, so this does not reintroduce the use-after-free. Only
the thread-launch mechanism changed.

## Testing

- The full bb build is too heavy to run in this session, so this is
**not yet a local end-to-end repro/fix verification** — it relies on CI
for compilation and on a macOS arm64 `aztec start --local-network` run
to confirm the crash is gone.
- The pthread/`std::function` trampoline was compiled and run standalone
under `-std=c++20 -Wall -Wextra -Werror`: the worker thread receives a
32 MB stack (`pthread_get_stacksize_np` reports `33554432`), and the
work runs and completes.
- **Requested:** verify against tonight's nightly on macOS arm64 (M3) —
the reporter's exact repro.

## Notes for reviewers

- Targets `next` (not `merge-train/barretenberg`) to match AztecProtocol#21625's base
and to make the nightly, since this is an urgent release-affecting
crash. Happy to retarget if you'd prefer it go through the merge train.
- 32 MB is a deliberate over-provision; if you'd rather mirror the libuv
path precisely we could instead size from `getrlimit(RLIMIT_STACK)`. The
fixed constant is simpler and the virtual reservation only commits pages
as touched.
- The longer-term fix is the NAPI→IPC migration (AztecProtocol#21331 / AztecProtocol#23196 /
AztecProtocol#23238), which removes this in-process worker entirely. This is a
targeted stop-gap for the shipping NAPI path.

Related: AztecProtocol#21138 (introduced the threaded model), AztecProtocol#21625 (use-after-free
fix), AztecProtocol#21629 (open alternative).

---
*Created by
[claudebox](https://claudebox.work/v2/sessions/4bd36dc505c20254) ·
group: `slackbot`*
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-barretenberg Run all barretenberg/cpp checks.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants