fix: use shared_ptr in ThreadedAsyncOperation to prevent SIGBUS on macOS#21625
Merged
Conversation
310a821 to
1f5c0c4
Compare
This was referenced Mar 16, 2026
1f5c0c4 to
8e2d14f
Compare
ludamad
added a commit
that referenced
this pull request
Mar 16, 2026
Reverts #21138 on v4. ThreadedAsyncOperation has a use-after-free that causes SIGBUS on macOS and silent memory corruption on Linux. Restoring AsyncOperation (libuv pool) with the original deadlock-prevention semaphore (UV_THREADPOOL_SIZE / 2) until a proper fix lands on next (#21625). [Post mortem](https://gist.github.com/ludamad/443afe321853389a08693c4ff73676f7)
dbanks12
approved these changes
May 19, 2026
Fixes use-after-free in `ThreadedAsyncOperation` (#21138) that causes SIGBUS on macOS and silent memory corruption on Linux. v4 is handled by reverting: #21630. **Root cause**: TSFN `BlockingCall` (`napi_tsfn_blocking`) only blocks on *queue insertion*, NOT on callback completion. The callback runs asynchronously on the JS main thread, so `delete this` on the worker thread raced with the callback reading member fields. macOS's magazine malloc aggressively unmaps freed pages, turning this into a consistent SIGBUS. Linux glibc keeps pages mapped, so the race is silent. **Fix**: manage `ThreadedAsyncOperation` via `shared_ptr` (`enable_shared_from_this`). Both the worker thread lambda and the TSFN callback capture a `shared_ptr`, so the object lives until both are done. Verified clean under ASAN with 1000+ concurrent operations (heap-use-after-free confirmed on buggy code, clean on fix). [Full post mortem](https://gist.github.com/ludamad/443afe321853389a08693c4ff73676f7)
75fe0db to
e01da06
Compare
danielntmd
pushed a commit
to danielntmd/aztec-packages
that referenced
this pull request
Jun 4, 2026
…AztecProtocol#23469) ## Summary `aztec start --local-network` reliably SIGBUSes a few blocks into a run on macOS arm64 (since `v5.0.0-nightly.20260520`, i.e. after AztecProtocol#21625 shipped the `shared_ptr` use-after-free fix). This is a **different** fault from the one AztecProtocol#21625 fixed: a stack-guard violation (stack overflow) on a `nodejs_module.node` worker thread running AVM-simulation code, not a use-after-free. This pins an explicit, generous stack size on the `ThreadedAsyncOperation` worker thread. ## Root cause `ThreadedAsyncOperation::Queue()` (introduced in AztecProtocol#21138) runs the AVM simulation (`_fn`) directly on a bare `std::thread(...).detach()`. A `std::thread` uses the OS default stack for non-main threads, which is **512 KB on macOS** versus **8 MB on Linux**. The AVM-simulation call chain is deep enough to overflow 512 KB, so on macOS arm64 the worker writes into its stack-guard page and the process aborts with: ``` EXC_BAD_ACCESS / SIGBUS, KERN_PROTECTION_FAILURE "Could not determine thread index for stack guard region" #0 _platform_memmove #1.. nodejs_module.node bb::nodejs (AVM simulation path) ``` Linux is unaffected because its 8 MB default is comfortably large. The previous `AsyncOperation` path never hit this either: it ran on the libuv threadpool, whose threads are sized from `RLIMIT_STACK` (8 MB soft on macOS), not the 512 KB raw-thread default. ## Fix `std::thread` can't set a stack size, so launch the worker via `pthreads` with `pthread_attr_setstacksize` pinned to a generous `WORKER_STACK_SIZE` (32 MB — 4× the 8 MB that the libuv path proved sufficient, with headroom for deeper future call chains). Falls back to a default-stack `std::thread` only if pthreads is unavailable (`_WIN32`) or `pthread_create` fails. The shared_ptr lifetime model from AztecProtocol#21625 is preserved exactly — both the worker lambda and the `BlockingCall` completion callback still capture `self`, so this does not reintroduce the use-after-free. Only the thread-launch mechanism changed. ## Testing - The full bb build is too heavy to run in this session, so this is **not yet a local end-to-end repro/fix verification** — it relies on CI for compilation and on a macOS arm64 `aztec start --local-network` run to confirm the crash is gone. - The pthread/`std::function` trampoline was compiled and run standalone under `-std=c++20 -Wall -Wextra -Werror`: the worker thread receives a 32 MB stack (`pthread_get_stacksize_np` reports `33554432`), and the work runs and completes. - **Requested:** verify against tonight's nightly on macOS arm64 (M3) — the reporter's exact repro. ## Notes for reviewers - Targets `next` (not `merge-train/barretenberg`) to match AztecProtocol#21625's base and to make the nightly, since this is an urgent release-affecting crash. Happy to retarget if you'd prefer it go through the merge train. - 32 MB is a deliberate over-provision; if you'd rather mirror the libuv path precisely we could instead size from `getrlimit(RLIMIT_STACK)`. The fixed constant is simpler and the virtual reservation only commits pages as touched. - The longer-term fix is the NAPI→IPC migration (AztecProtocol#21331 / AztecProtocol#23196 / AztecProtocol#23238), which removes this in-process worker entirely. This is a targeted stop-gap for the shipping NAPI path. Related: AztecProtocol#21138 (introduced the threaded model), AztecProtocol#21625 (use-after-free fix), AztecProtocol#21629 (open alternative). --- *Created by [claudebox](https://claudebox.work/v2/sessions/4bd36dc505c20254) · group: `slackbot`*
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes use-after-free in
ThreadedAsyncOperation(#21138) that causes SIGBUS on macOS and silent memory corruption on Linux. v4 is handled by reverting: #21630.Root cause: TSFN
BlockingCall(napi_tsfn_blocking) only blocks on queue insertion, NOT on callback completion. The callback runs asynchronously on the JS main thread, sodelete thison the worker thread raced with the callback reading member fields. macOS's magazine malloc aggressively unmaps freed pages, turning this into a consistent SIGBUS. Linux glibc keeps pages mapped, so the race is silent.Fix: manage
ThreadedAsyncOperationviashared_ptr(enable_shared_from_this). Both the worker thread lambda and the TSFN callback capture ashared_ptr, so the object lives until both are done. Verified clean under ASAN with 1000+ concurrent operations (heap-use-after-free confirmed on buggy code, clean on fix).Full post mortem