Skip to content

fix: Node 26 CI failures — wall profiler cleanup-hook UAF + OpenSSL LSan leak#360

Merged
szegedi merged 2 commits into
mainfrom
fix-wall-cleanup-hook-uaf
Jun 26, 2026
Merged

fix: Node 26 CI failures — wall profiler cleanup-hook UAF + OpenSSL LSan leak#360
szegedi merged 2 commits into
mainfrom
fix-wall-cleanup-hook-uaf

Conversation

@szegedi

@szegedi szegedi commented Jun 26, 2026

Copy link
Copy Markdown

Fixes the two Node.js 26 CI failures. They're bundled into one PR because each failure blocks the other's CI from going green, so neither can be merged alone.


1. build / alpine-test-26 — SIGSEGV (use-after-free)

Symptom

alpine-test-26 crashes with SIGSEGV (exit 139) during the Worker Threadsshould not crash when worker is terminated test, which spawns and terminates many workers that each run the wall profiler. Faulting thread:

Thread "WorkerThread" received signal SIGSEGV
#0  node::CleanupHookThunkRun(void*)
#1  node::CleanupQueue::Drain()
#2  node::Environment::RunCleanup()
#3  node::FreeEnvironment(node::Environment*)
#4  node::worker::Worker::Run()

Root cause

WallProfiler registers an environment cleanup hook (CleanupHook) via node::AddEnvironmentCleanupHook so it can stop profiling when a worker isolate is terminated without a beforeExit notification. On termination, CleanupHook → Cleanup → Dispose called node::RemoveEnvironmentCleanupHook for that same hook.

Node's cleanup-hook trampoline runs the hook and then dereferences its internal hook record again to remove the hook itself (disassembly — the fault is at ldr x0, [x19] right after the blr that calls our callback, and the fault address equals x19, the record pointer):

CleanupHookThunkRun(record):       ; x19 = record
  ldp x1, x0, [x0, #16]            ; x1 = fn_, x0 = arg_
  blr x1                           ; run our CleanupHook  → frees the record
=> ldr x0, [x19]                   ; SIGSEGV: record already freed
  ...
  b   RemoveEnvironmentCleanupHook ; Node removes the hook itself

Because our callback removed (and freed) the record mid-execution, Node read freed memory when it returned.

Fix

Add a removeCleanupHook flag to Dispose (default true). Cleanup — which only ever runs from inside CleanupHook — passes false and lets Node remove the hook itself. The still-running call sites (StopCore, StartImpl failure path) keep removing it, since there the hook is live and must not fire later against a destroyed profiler.

Verification

Reproduced and verified in an Alpine/musl Node 26 container under gdb: the worker-termination test crashed within 1–2 runs before the change, and ran 30/30 clean after it. Covered by the existing should not crash when worker is terminated test.


2. asan (26) — LeakSanitizer failure (OpenSSL)

Symptom

==XXXX==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 24 byte(s) in 1 object(s) allocated from:
    #1 CRYPTO_malloc
    #2 ossl_load_builtin_compressions
    ...
    #12 OPENSSL_init_crypto
    #13 node::InitializeOncePerProcessInternal(...)

Root cause & fix

Node 26's OpenSSL populates a global builtin-compressions list during OPENSSL_init_crypto that is never freed at process exit. This is the same class of benign one-time process-init leak already suppressed via leak:CRYPTO_zalloc, but this path allocates via CRYPTO_malloc and so isn't matched. Suppress the specific ossl_load_builtin_compressions frame (rather than broadening to CRYPTO_malloc, OpenSSL's general allocator, which could mask real leaks).


Supersedes #359 (LSan suppression), which is folded in here as a separate commit.

🤖 Generated with Claude Code

WallProfiler registers an environment cleanup hook (CleanupHook) via
node::AddEnvironmentCleanupHook so it can stop profiling when a worker
isolate is terminated without a beforeExit notification. On termination,
CleanupHook -> Cleanup -> Dispose called node::RemoveEnvironmentCleanupHook
for that same hook.

Node's cleanup-hook trampoline (CleanupHookThunkRun) invokes the hook and
then dereferences its internal hook record again to remove the hook itself.
Removing the hook from within the hook frees that record early, so when our
callback returns Node reads freed memory — a use-after-free that segfaults
on worker termination (observed reliably on Node 26, exercised by the
"should not crash when worker is terminated" test).

Add a removeCleanupHook flag to Dispose (default true). Cleanup, which only
runs from inside CleanupHook, passes false and lets Node remove the hook
itself. The still-running call sites (StopCore, StartImpl failure) keep
removing it, since there the hook is live and must not fire later against a
destroyed profiler.

Reproduced and verified fixed in an Alpine/musl Node 26 container under gdb:
crashed within 1-2 runs before, 30/30 clean after.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@datadog-prod-us1-4

This comment has been minimized.

@github-actions

Copy link
Copy Markdown

Overall package size

Self size: 2.19 MB
Deduped: 2.89 MB
No deduping: 2.89 MB

Dependency sizes | name | version | self size | total size | |------|---------|-----------|------------| | pprof-format | 2.2.2 | 500.53 kB | 500.53 kB | | source-map | 0.7.6 | 185.63 kB | 185.63 kB | | node-gyp-build | 4.8.4 | 13.86 kB | 13.86 kB |

🤖 This report was automatically generated by heaviest-objects-in-the-universe

@szegedi szegedi added the semver-patch Bug or security fixes, mainly label Jun 26, 2026
@szegedi szegedi enabled auto-merge (squash) June 26, 2026 13:31
Node 26's OpenSSL populates a global builtin-compressions list during
OPENSSL_init_crypto (ossl_load_builtin_compressions) that is never freed
at process exit, causing the ASAN job to fail with a 24-byte leak report.

This is the same class of benign one-time process-init leak we already
suppress via CRYPTO_zalloc, but this path allocates via CRYPTO_malloc and
so slips through. Suppress the specific ossl_load_builtin_compressions
frame rather than broadening to CRYPTO_malloc, which is OpenSSL's general
allocator and could mask genuine leaks in crypto code paths.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@szegedi szegedi changed the title fix(wall): don't remove the env cleanup hook from inside itself fix: Node 26 CI failures — wall profiler cleanup-hook UAF + OpenSSL LSan leak Jun 26, 2026
@szegedi szegedi merged commit 944c43f into main Jun 26, 2026
70 checks passed
@szegedi szegedi deleted the fix-wall-cleanup-hook-uaf branch June 26, 2026 13:53
IlyasShabi pushed a commit that referenced this pull request Jun 26, 2026
WallProfiler registers an environment cleanup hook (CleanupHook) via
node::AddEnvironmentCleanupHook so it can stop profiling when a worker
isolate is terminated without a beforeExit notification. On termination,
CleanupHook -> Cleanup -> Dispose called node::RemoveEnvironmentCleanupHook
for that same hook.

Node's cleanup-hook trampoline (CleanupHookThunkRun) invokes the hook and
then dereferences its internal hook record again to remove the hook itself.
Removing the hook from within the hook frees that record early, so when our
callback returns Node reads freed memory — a use-after-free that segfaults
on worker termination (observed reliably on Node 26, exercised by the
"should not crash when worker is terminated" test).

Add a removeCleanupHook flag to Dispose (default true). Cleanup, which only
runs from inside CleanupHook, passes false and lets Node remove the hook
itself. The still-running call sites (StopCore, StartImpl failure) keep
removing it, since there the hook is live and must not fire later against a
destroyed profiler.

Reproduced and verified fixed in an Alpine/musl Node 26 container under gdb:
crashed within 1-2 runs before, 30/30 clean after.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@IlyasShabi IlyasShabi mentioned this pull request Jun 26, 2026
IlyasShabi pushed a commit that referenced this pull request Jun 26, 2026
WallProfiler registers an environment cleanup hook (CleanupHook) via
node::AddEnvironmentCleanupHook so it can stop profiling when a worker
isolate is terminated without a beforeExit notification. On termination,
CleanupHook -> Cleanup -> Dispose called node::RemoveEnvironmentCleanupHook
for that same hook.

Node's cleanup-hook trampoline (CleanupHookThunkRun) invokes the hook and
then dereferences its internal hook record again to remove the hook itself.
Removing the hook from within the hook frees that record early, so when our
callback returns Node reads freed memory — a use-after-free that segfaults
on worker termination (observed reliably on Node 26, exercised by the
"should not crash when worker is terminated" test).

Add a removeCleanupHook flag to Dispose (default true). Cleanup, which only
runs from inside CleanupHook, passes false and lets Node remove the hook
itself. The still-running call sites (StopCore, StartImpl failure) keep
removing it, since there the hook is live and must not fire later against a
destroyed profiler.

Reproduced and verified fixed in an Alpine/musl Node 26 container under gdb:
crashed within 1-2 runs before, 30/30 clean after.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
szegedi added a commit that referenced this pull request Jun 26, 2026
WallProfiler registers an environment cleanup hook (CleanupHook) via
node::AddEnvironmentCleanupHook so it can stop profiling when a worker
isolate is terminated without a beforeExit notification. On termination,
CleanupHook -> Cleanup -> Dispose called node::RemoveEnvironmentCleanupHook
for that same hook.

Node's cleanup-hook trampoline (CleanupHookThunkRun) invokes the hook and
then dereferences its internal hook record again to remove the hook itself.
Removing the hook from within the hook frees that record early, so when our
callback returns Node reads freed memory — a use-after-free that segfaults
on worker termination (observed reliably on Node 26, exercised by the
"should not crash when worker is terminated" test).

Add a removeCleanupHook flag to Dispose (default true). Cleanup, which only
runs from inside CleanupHook, passes false and lets Node remove the hook
itself. The still-running call sites (StopCore, StartImpl failure) keep
removing it, since there the hook is live and must not fire later against a
destroyed profiler.

Reproduced and verified fixed in an Alpine/musl Node 26 container under gdb:
crashed within 1-2 runs before, 30/30 clean after.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

semver-patch Bug or security fixes, mainly

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants