fix(android): shutdown IPC socket on JS reload so backend cleans up subscriptions#167
Conversation
…ubscriptions On an RN JS-thread reload the main process stays alive, so the NodeJSIPC LocalSocket is only closed if ComapeoCoreModule.OnDestroy closes it. OnDestroy called disconnect(), which routes socket.close() through connectJob.cancelAndJoin() — but the receive coroutine is parked in a blocking readFully that only unblocks once the socket closes, so the join deadlocks and the fd never closes. The backend then never observes EOF, so its per-connection rpc-reflector cleanup never runs and every prior session's subscriptions stay attached to the singleton MapeoManager, emitting events into dead peers. Verified on device: the connection and manager-listener counts climb 1→2→3→4→5, one per reload. Add NodeJSIPC.close(): a synchronous terminal teardown that shutdown(2)s the socket (shutdownInput wakes the blocked read, shutdownOutput sends FIN to the peer) before close(2), then cancels the scope. OnDestroy now calls close() instead of disconnect(). shutdown-before-close is the same order iOS's disconnect() already uses, so iOS needs no change (comment only). Plain close() without shutdown does NOT fix the leak — the blocked readFully keeps the connection alive (verified: still climbs 1→5). Verified on Pixel_7a_API_34 and the iOS simulator: reloads now show a clean socket close + reopen with connection and listener counts bounded at 1. closeReleasesSocketSynchronously instrumented test added; full NodeJSIPCTest suite passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| * Not reusable after close; construct a new instance. | ||
| */ | ||
| fun close() { | ||
| scope.cancel() |
There was a problem hiding this comment.
xhigh review — close() non-joining teardown (3 verified findings)
scope.cancel() here does not join the connect/send/receive coroutines, opening three teardown races:
- Connect-race FD leak (CONFIRMED). If a reload lands while the socket is still connecting (
Connecting),close()returns having torn down nothing (::socket.isInitializedis still false), then the surviving connect coroutine finishes and assigns the now-openLocalSocketto the field. That fd leaks until process death — the exact reload leak this change exists to prevent, in the connect window. - In-flight send vs. stream close (PLAUSIBLE). If the send coroutine is mid-
out.writeinsendMessageInternalwhendataOutputStream?.close()runs, both touch the same stream unsynchronized. Caught + logged, so harmless-looking, but it's a use+close race the olddisconnect()avoided bycancelAndJoin-ing first. - Dropped buffered sends (PLAUSIBLE). Cancelling the scope before
sendChannel.close()kills the send job immediately, so apostMessagequeued just beforeOnDestroyis discarded instead of drained.disconnect()closed the channel first precisely so the send loop could drain.
All three resolve if close() wakes the read (shutdown) and then joins the connect coroutine before closing streams — e.g. a bounded runBlocking { withTimeoutOrNull(..) { connectJob?.join() } }. That's a design change (the method is deliberately synchronous), so flagging rather than auto-applying. The connect-race leak is worth confirming on-device.
| // stays alive. Close synchronously: the backend must see EOF on the | ||
| // old socket before the next OnCreate connects, otherwise the FD | ||
| // lingers and rpc-reflector listeners leak onto MapeoManager. | ||
| ipc.close() |
There was a problem hiding this comment.
xhigh review — lifecycle implications of disconnect() → close() (2 findings)
- Use-after-close (PLAUSIBLE).
close()cancels the scope permanently, so the instance is dead afterward. IfOnActivityEntersForegroundfiresipc.connect()on this same instance afterOnDestroybut before the nextOnCreateswaps in a freshNodeJSIPC,connect()passes itsDisconnectedguard butscope.launchon the cancelled scope never runs — state staysDisconnected, and subsequentpostMessagestrySendinto a drained channel and vanish with no error surfaced to JS. - Suppressed STOPPED transition (PLAUSIBLE). The old
disconnect()path delivered aDisconnecting/Disconnected(→ STOPPING/STOPPED) transition to the observer;close()cancels the collector, so on a final teardown not followed byOnCreate, JS consumers never observe STOPPED.
For the steady-state reload (OnDestroy→OnCreate) both are benign — listeners are recreated. They only bite the foreground-before-recreate and final-teardown orderings.
| // connect-cancel race. | ||
| Thread.sleep(200) | ||
|
|
||
| ipc.close() |
There was a problem hiding this comment.
xhigh review — test asserts peer EOF but not client teardown (PLAUSIBLE)
closeReleasesSocketSynchronously checks the server read unblocked, but never asserts the client receive coroutine actually terminated. Because close() cancels the scope without joining, the receive coroutine can still be unwinding (woken into its IOException→disconnect() path) when tearDown() closes serverSocket/boundSocket and deletes the socket file. A regression where close() returns before the client coroutine unwinds still passes here; the leaked coroutine racing teardown would surface only as a flaky failure in a later test. Consider asserting client-side teardown (a post-close settle + state assertion) as the sibling tests do.
|
Finding #1 ( Follow-up to the inline review and on-device verification. Verifying on device (killing the So this is hardening, not a live bug. It only bites if a future caller tears down a healthy connection via |
…down Generalize the shutdown-before-close fix into disconnect(): shutdown the socket to wake the receive loop parked in a blocking readFully before cancelAndJoin, so the join can't deadlock on a live but idle node backend. The deadlock is unreachable on today's callers (they run after the backend has exited), so this is hardening that keeps disconnect() correct in isolation and symmetric with close(). In close(), set the terminal Disconnected state right after scope.cancel() so the woken receive loop's disconnect() short-circuits on its state guard instead of relaunching teardown on the cancelled scope. Extract shared shutdownSocket()/closeStreamsAndSocket() helpers used by both paths so the two teardowns can't drift.
Summary
On a React Native JS-thread reload (Metro reload,
DevSettings.reload(), fast-refresh full reload) the main app process stays alive while the JS runtime is torn down and rebuilt. The Android backend runs in a separate process (:ComapeoCore), so theNodeJSIPCLocalSocketlives in the still-alive main process and is only closed ifComapeoCoreModule.OnDestroyexplicitly closes it.OnDestroycalleddisconnect(), which closes the socket only afterconnectJob.cancelAndJoin(). The receive coroutine is parked in a blockingdataInputStream.readFully()that only unblocks once the socket closes — so the join deadlocks and the fd is never closed. The backend never observes EOF, its per-connection rpc-reflector cleanup (wired incomapeo-rpc.js) never runs, and every prior session's event subscriptions stay attached to the singletonMapeoManager, with the backend emitting events into dead peers. The leak grows by one connection (and its listeners) per reload.This was originally raised in #51, but that PR's Android
close()did not actually fix the leak (see below).The fix
Add
NodeJSIPC.close()— a synchronous terminal teardown thatshutdown(2)s the socket beforeclose(2):shutdownInput()wakes the blockedreadFullyso the receive loop exits.shutdownOutput()sendsFINso the backend peer observes EOF immediately.close()+scope.cancel().OnDestroynow callsclose()instead of the deadlock-pronedisconnect(). This is the sameshutdown-before-closeordering iOS'sdisconnect()already uses (ios/NodeJSIPC.swift), which is why iOS needs no behavioural change — only a comment documenting it.disconnect()is unchanged and still used by the FGS side and IOException recovery.Why plain
close()is not enough (and why #51 didn't work)close()removes the fd from the table, but a thread blocked inreadFullystill holds the open socket description, so the kernel keeps the connection (and the peer) alive until that read returns. Onlyshutdown(2)forces the blocked read to return and signals the peer. Porting #51'sclose()verbatim (noshutdown) still leaked in testing.Verification (on-device)
Instrumented the backend with a per-connection counter plus a live
MapeoManagerlistener count, driving real reloads on aPixel_7a_API_34emulator and an iPhone 17 Pro simulator. Confirmed every reload actually fired via theCLOSE/OPENevents.activeconns /mgrListenersacross 4–5 reloadsmaintoday (OnDestroy→disconnect())1 → 2 → 3 → 4 → 5, zero cleanup — leakclose()(noshutdown)1 → 2 → 3 → 4 → 5— still leaksshutdown+close)CLOSE active=0→OPEN active=1per reload; bounded at 1disconnect())CLOSE/OPENper reload; bounded at 1Killing only the main process (leaving the FGS alive) makes the kernel close the leaked fds and the backend cleans up all of them at once — confirming the leaked connections are real and that the backend cleanup path itself works; it just never got a chance to run on reload.
Test plan
closeReleasesSocketSynchronouslyinstrumented test (pins the synchronous-close contract; fails withoutshutdown)NodeJSIPCTestsuite passes on device (10 tests)MapeoManagerlistener count stays bounded (Android + iOS)Notes
SocketMessagePortclose-event + rpc-reflector cleanup wiring this depends on already landed onmainvia Map server integration #86 (the load-bearing backend piece of Tear down RPC subscriptions on RN context reload #51 is therefore already present).🤖 Generated with Claude Code