fix: use POSIX pthread keys for thread pool to fix x86_64-macos segfault#21253
fix: use POSIX pthread keys for thread pool to fix x86_64-macos segfault#21253johnathan79717 wants to merge 3 commits into
Conversation
…acos segfault When the AVM transpiler (Rust static library) is linked into the zig-cross-compiled bb binary, it introduces a __thread_data Mach-O section that corrupts C++ thread_local variable offsets on x86_64-macos. This caused a #GP (trap 13) segfault during parallel_for when accessing the thread_local ThreadPool. Fix: use a process-global ThreadPool protected by a mutex. This is also more efficient as thread_local previously created O(N²) threads (each thread got its own pool of N-1 workers), while a global pool creates exactly N-1 workers total. Fixes AztecProtocol/barretenberg#1305
|
Aw but I want this thread local for future work upcoming |
|
There shouldn't be an efficiency loss |
Yeah I don't think we're doing anything wrong but the Zig linker messed it up. I wasn't able to produce a minimal repro case to file a Zig bug. Anyway, this does fix the reported crash. We can merge this first and find a way around this Zig linker bug when we do need to add it back in the future. @ludamad What do you think? |
|
If you really want to do this, you need to make aztec_process not rely on this being re-entrant |
|
Updated approach: instead of a global pool with mutex (which would serialize concurrent callers), this now uses POSIX The root cause is a known Zig Mach-O linker bug (ziglang/zig#19221) where I also tried |
…linker TLS bug Replace C++ thread_local with pthread_key_t-based per-thread storage for the ThreadPool in parallel_for_mutex_pool. This avoids Zig's Mach-O linker bug (ziglang/zig#19221) where __thread_vars symbol resolution is corrupted when linking C++ and Rust objects together, causing segfaults on x86_64-macos. Unlike the previous global-pool-with-mutex approach, pthread_key preserves per-thread pools so concurrent callers (like aztec_process VK generation) are not serialized. The only overhead is one pthread_getspecific call per parallel_for invocation (~10ns), negligible compared to the work done.
076df41 to
6501306
Compare
Testing ld64.lld as Alternative Linker (from the gist)We tested Option 1/6 (use What we testedCreated a wrapper script that uses TLS section layout in the resulting binary: The This is the same misalignment we see with Zig's linker (original binary had Other options that don't work
ConclusionThe TLS alignment bug is in the shared LLVM Mach-O linker code, not specific to Zig's fork. Neither Zig's linker nor The pthread_key workaround in this PR remains the right fix until the upstream LLVM bug is resolved. |
|
But I'll keep trying the remaining options in the gist. |
## Summary - Fixes x86_64-macos segfault (`EXC_I386_GPFLT`) when `bb` is cross-compiled with Zig and linked with a Rust static library (AVM transpiler) - Root cause: LLVM's Mach-O linker misaligns `__thread_bss` TLS template offsets when `__thread_data` (from Rust) is also present, causing 16-byte-aligned `thread_local` objects (like `std::mutex`) to be placed at 8-byte-aligned addresses - Fix: a single `alignas(16) thread_local` variable forces `__thread_data` section alignment to 16, making the linker pad it correctly Fixes #21225 Fixes #19769 ## Details Both Zig's built-in Mach-O linker and `ld64.lld-20` share the same LLVM code for laying out TLS sections. When `__thread_data` (align 8, from Rust objects) precedes `__thread_bss` (align 16, from C++ `thread_local`), the linker aligns the `__thread_bss` virtual address to 16 but the TLS template offset remains misaligned because `__thread_data` starts at an 8-aligned VA. At runtime, `dyld` allocates a 16-aligned TLS block and copies the template at the recorded offsets. Variables that should be at `block + 0x40` (16-aligned) end up at `block + 0x38` (8-aligned), causing `MOVAPS` instructions to fault. The fix adds an `alignas(16)` initialized `thread_local` that forces the `__thread_data` section alignment to 16, which makes the linker pad the section end to a 16-byte boundary. Upstream bug: https://codeberg.org/ziglang/zig/issues/31461 ## Test plan - [x] Cross-compiled `bb` binary with Zig for x86_64-macos - [x] Verified TLS section alignment: `__thread_data` align 2^4 (16), offset to `__thread_bss` is 0x40 (mod 16 = 0) - [x] Tested on macOS VM: `bb prove --scheme ultra_honk` runs without segfault - [x] Previous binary (without pad) segfaults immediately with `EXC_I386_GPFLT` Supersedes #21253
## Summary - Fixes x86_64-macos segfault (`EXC_I386_GPFLT`) when `bb` is cross-compiled with Zig and linked with a Rust static library (AVM transpiler) - Root cause: LLVM's Mach-O linker misaligns `__thread_bss` TLS template offsets when `__thread_data` (from Rust) is also present, causing 16-byte-aligned `thread_local` objects (like `std::mutex`) to be placed at 8-byte-aligned addresses - Fix: a single `alignas(16) thread_local` variable forces `__thread_data` section alignment to 16, making the linker pad it correctly Fixes #21225 Fixes #19769 ## Details Both Zig's built-in Mach-O linker and `ld64.lld-20` share the same LLVM code for laying out TLS sections. When `__thread_data` (align 8, from Rust objects) precedes `__thread_bss` (align 16, from C++ `thread_local`), the linker aligns the `__thread_bss` virtual address to 16 but the TLS template offset remains misaligned because `__thread_data` starts at an 8-aligned VA. At runtime, `dyld` allocates a 16-aligned TLS block and copies the template at the recorded offsets. Variables that should be at `block + 0x40` (16-aligned) end up at `block + 0x38` (8-aligned), causing `MOVAPS` instructions to fault. The fix adds an `alignas(16)` initialized `thread_local` that forces the `__thread_data` section alignment to 16, which makes the linker pad the section end to a 16-byte boundary. Upstream bug: https://codeberg.org/ziglang/zig/issues/31461 ## Test plan - [x] Cross-compiled `bb` binary with Zig for x86_64-macos - [x] Verified TLS section alignment: `__thread_data` align 2^4 (16), offset to `__thread_bss` is 0x40 (mod 16 = 0) - [x] Tested on macOS VM: `bb prove --scheme ultra_honk` runs without segfault - [x] Previous binary (without pad) segfaults immediately with `EXC_I386_GPFLT` Supersedes #21253
Summary
bb prove --scheme chonkwhen the AVM transpiler (Rust static library) is linkedthread_localoffsets when a Rust static library with__thread_varssections is linked into the same binarythread_localwith POSIXpthread_key_t-based per-thread storage for the ThreadPool. pthread keys use a runtime hashtable mechanism unaffected by the linker bugWhy pthread_key instead of a global pool with mutex?
parallel_forcallers, regressing performance foraztec_processwhich spawns multiple threads that each callparallel_forthread_localcode) with no efficiency losspthread_getspecificcall perparallel_forinvocation (~10ns)Verification
__thread_varssections trigger the linker bug)SpawnedThreadsCanUseParallelFor(tests theaztec_processconcurrent parallel_for pattern)ultra_honk_testspass (260/260)Test plan
Closes AztecProtocol/barretenberg#19769