Skip to content

fix: use POSIX pthread keys for thread pool to fix x86_64-macos segfault#21253

Closed
johnathan79717 wants to merge 3 commits into
merge-train/barretenbergfrom
jh/fix-tls-segfault
Closed

fix: use POSIX pthread keys for thread pool to fix x86_64-macos segfault#21253
johnathan79717 wants to merge 3 commits into
merge-train/barretenbergfrom
jh/fix-tls-segfault

Conversation

@johnathan79717

@johnathan79717 johnathan79717 commented Mar 9, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Fixes x86_64-macos segfault in bb prove --scheme chonk when the AVM transpiler (Rust static library) is linked
  • Root cause: Zig's Mach-O linker corrupts C++ thread_local offsets when a Rust static library with __thread_vars sections is linked into the same binary
  • Fix: Replace C++ thread_local with POSIX pthread_key_t-based per-thread storage for the ThreadPool. pthread keys use a runtime hashtable mechanism unaffected by the linker bug

Why pthread_key instead of a global pool with mutex?

  • A global pool with mutex would serialize concurrent parallel_for callers, regressing performance for aztec_process which spawns multiple threads that each call parallel_for
  • pthread_key preserves per-thread pools (same semantics as the original thread_local code) with no efficiency loss
  • The only overhead is one pthread_getspecific call per parallel_for invocation (~10ns)

Verification

  • Cross-compiled with Zig for x86_64-macos and tested on macOS 14 VM: no segfault (was exit 139, now completes)
  • Confirmed the original code segfaults both from the npm release binary and when built locally
  • Confirmed the crash only happens when the Rust AVM transpiler library is linked (its __thread_vars sections trigger the linker bug)
  • Native thread tests pass including SpawnedThreadsCanUseParallelFor (tests the aztec_process concurrent parallel_for pattern)
  • Native ultra_honk_tests pass (260/260)

Test plan

  • CI passes (barretenberg build + tests)
  • Cross-compilation for macOS x86_64 succeeds
  • No performance regression in CI benchmarks

Closes AztecProtocol/barretenberg#19769

…acos segfault

When the AVM transpiler (Rust static library) is linked into the zig-cross-compiled
bb binary, it introduces a __thread_data Mach-O section that corrupts C++ thread_local
variable offsets on x86_64-macos. This caused a #GP (trap 13) segfault during
parallel_for when accessing the thread_local ThreadPool.

Fix: use a process-global ThreadPool protected by a mutex. This is also more efficient
as thread_local previously created O(N²) threads (each thread got its own pool of N-1
workers), while a global pool creates exactly N-1 workers total.

Fixes AztecProtocol/barretenberg#1305
@johnathan79717 johnathan79717 added the ci-barretenberg Run all barretenberg/cpp checks. label Mar 9, 2026
@johnathan79717 johnathan79717 requested a review from ludamad March 9, 2026 15:40
@ludamad

ludamad commented Mar 9, 2026

Copy link
Copy Markdown
Collaborator

Aw but I want this thread local for future work upcoming

@ludamad

ludamad commented Mar 9, 2026

Copy link
Copy Markdown
Collaborator

There shouldn't be an efficiency loss

@johnathan79717

Copy link
Copy Markdown
Contributor Author

Aw but I want this thread local for future work upcoming

Yeah I don't think we're doing anything wrong but the Zig linker messed it up. I wasn't able to produce a minimal repro case to file a Zig bug.

Anyway, this does fix the reported crash. We can merge this first and find a way around this Zig linker bug when we do need to add it back in the future. @ludamad What do you think?

@ludamad

ludamad commented Mar 9, 2026

Copy link
Copy Markdown
Collaborator

If you really want to do this, you need to make aztec_process not rely on this being re-entrant

@johnathan79717 johnathan79717 self-assigned this Mar 10, 2026
@johnathan79717 johnathan79717 added ci-full Run all master checks. and removed ci-barretenberg Run all barretenberg/cpp checks. labels Mar 10, 2026
@johnathan79717

johnathan79717 commented Mar 10, 2026

Copy link
Copy Markdown
Contributor Author

Updated approach: instead of a global pool with mutex (which would serialize concurrent callers), this now uses POSIX pthread_key_t for per-thread pool storage. This preserves the original per-thread pool semantics with no efficiency loss.

The root cause is a known Zig Mach-O linker bug (ziglang/zig#19221) where __thread_vars sections from the Rust AVM transpiler library corrupt C++ thread_local offsets. pthread_key uses a different TLS mechanism (runtime hashtable, not Mach-O TLS sections) that is unaffected.

I also tried -femulated-tls but Zig doesn't provide the __emutls_get_address runtime on macOS.

@johnathan79717 johnathan79717 changed the title fix: replace thread_local ThreadPool with global pool to fix x86_64-macos segfault fix: use POSIX pthread keys for thread pool to fix x86_64-macos segfault Mar 10, 2026
…linker TLS bug

Replace C++ thread_local with pthread_key_t-based per-thread storage for the
ThreadPool in parallel_for_mutex_pool. This avoids Zig's Mach-O linker bug
(ziglang/zig#19221) where __thread_vars symbol resolution is corrupted when
linking C++ and Rust objects together, causing segfaults on x86_64-macos.

Unlike the previous global-pool-with-mutex approach, pthread_key preserves
per-thread pools so concurrent callers (like aztec_process VK generation)
are not serialized. The only overhead is one pthread_getspecific call per
parallel_for invocation (~10ns), negligible compared to the work done.
@johnathan79717

Copy link
Copy Markdown
Contributor Author

Testing ld64.lld as Alternative Linker (from the gist)

We tested Option 1/6 (use ld64.lld-20 for the final link step). It does not fix the bugld64.lld-20 produces the same misaligned TLS layout as Zig's built-in linker.

What we tested

Created a wrapper script that uses zig c++ for compilation (-c) but intercepts the link step, captures the zig ld ... invocation via -v, and replaces it with ld64.lld-20 -arch x86_64 -L /opt/zig/lib/libc/darwin .... Successfully built the full bb binary (259 targets).

TLS section layout in the resulting binary:

__thread_data: size 0x30, align 2^3 (8)    # from Rust .o files
__thread_bss:  size 0xCC8, align 2^4 (16)  # C++ thread_local objects

The __thread_bss offset from __thread_data start is 0x38, and 0x38 % 16 = 8still misaligned. The crashing ThreadPool variable has TLS descriptor offset 0x2B8, also 8 mod 16.

This is the same misalignment we see with Zig's linker (original binary had 0x28 % 16 = 8). Both are LLVM-based and share the same Mach-O TLS layout algorithm.

Other options that don't work

  • -fuse-ld=ld64.lld-20 and --ld-path=ld64.lld-20 passed to zig c++: Zig silently ignores both when cross-compiling (verified by passing a nonexistent path — no error).
  • -femulated-tls: Zig doesn't provide __emutls_get_address on macOS targets.

Conclusion

The TLS alignment bug is in the shared LLVM Mach-O linker code, not specific to Zig's fork. Neither Zig's linker nor ld64.lld-20 (LLVM 20) handles TLS template offset alignment correctly when __thread_bss (align 16) follows __thread_data (align 8) and the cumulative size is not a multiple of 16. Only Apple's native ld64 handles this correctly.

The pthread_key workaround in this PR remains the right fix until the upstream LLVM bug is resolved.

@johnathan79717

Copy link
Copy Markdown
Contributor Author

But I'll keep trying the remaining options in the gist.

johnathan79717 added a commit that referenced this pull request Mar 11, 2026
## Summary

- Fixes x86_64-macos segfault (`EXC_I386_GPFLT`) when `bb` is
cross-compiled with Zig and linked with a Rust static library (AVM
transpiler)
- Root cause: LLVM's Mach-O linker misaligns `__thread_bss` TLS template
offsets when `__thread_data` (from Rust) is also present, causing
16-byte-aligned `thread_local` objects (like `std::mutex`) to be placed
at 8-byte-aligned addresses
- Fix: a single `alignas(16) thread_local` variable forces
`__thread_data` section alignment to 16, making the linker pad it
correctly

Fixes #21225
Fixes #19769

## Details

Both Zig's built-in Mach-O linker and `ld64.lld-20` share the same LLVM
code for laying out TLS sections. When `__thread_data` (align 8, from
Rust objects) precedes `__thread_bss` (align 16, from C++
`thread_local`), the linker aligns the `__thread_bss` virtual address to
16 but the TLS template offset remains misaligned because
`__thread_data` starts at an 8-aligned VA.

At runtime, `dyld` allocates a 16-aligned TLS block and copies the
template at the recorded offsets. Variables that should be at `block +
0x40` (16-aligned) end up at `block + 0x38` (8-aligned), causing
`MOVAPS` instructions to fault.

The fix adds an `alignas(16)` initialized `thread_local` that forces the
`__thread_data` section alignment to 16, which makes the linker pad the
section end to a 16-byte boundary.

Upstream bug: https://codeberg.org/ziglang/zig/issues/31461

## Test plan

- [x] Cross-compiled `bb` binary with Zig for x86_64-macos
- [x] Verified TLS section alignment: `__thread_data` align 2^4 (16),
offset to `__thread_bss` is 0x40 (mod 16 = 0)
- [x] Tested on macOS VM: `bb prove --scheme ultra_honk` runs without
segfault
- [x] Previous binary (without pad) segfaults immediately with
`EXC_I386_GPFLT`

Supersedes #21253
AztecBot pushed a commit that referenced this pull request Mar 13, 2026
## Summary

- Fixes x86_64-macos segfault (`EXC_I386_GPFLT`) when `bb` is cross-compiled with Zig and linked with a Rust static library (AVM transpiler)
- Root cause: LLVM's Mach-O linker misaligns `__thread_bss` TLS template offsets when `__thread_data` (from Rust) is also present, causing 16-byte-aligned `thread_local` objects (like `std::mutex`) to be placed at 8-byte-aligned addresses
- Fix: a single `alignas(16) thread_local` variable forces `__thread_data` section alignment to 16, making the linker pad it correctly

Fixes #21225
Fixes #19769

## Details

Both Zig's built-in Mach-O linker and `ld64.lld-20` share the same LLVM code for laying out TLS sections. When `__thread_data` (align 8, from Rust objects) precedes `__thread_bss` (align 16, from C++ `thread_local`), the linker aligns the `__thread_bss` virtual address to 16 but the TLS template offset remains misaligned because `__thread_data` starts at an 8-aligned VA.

At runtime, `dyld` allocates a 16-aligned TLS block and copies the template at the recorded offsets. Variables that should be at `block + 0x40` (16-aligned) end up at `block + 0x38` (8-aligned), causing `MOVAPS` instructions to fault.

The fix adds an `alignas(16)` initialized `thread_local` that forces the `__thread_data` section alignment to 16, which makes the linker pad the section end to a 16-byte boundary.

Upstream bug: https://codeberg.org/ziglang/zig/issues/31461

## Test plan

- [x] Cross-compiled `bb` binary with Zig for x86_64-macos
- [x] Verified TLS section alignment: `__thread_data` align 2^4 (16), offset to `__thread_bss` is 0x40 (mod 16 = 0)
- [x] Tested on macOS VM: `bb prove --scheme ultra_honk` runs without segfault
- [x] Previous binary (without pad) segfaults immediately with `EXC_I386_GPFLT`

Supersedes #21253
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-full Run all master checks.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants