Skip to content

fix(android): shutdown IPC socket on JS reload so backend cleans up subscriptions#167

Merged
gmaclennan merged 3 commits into
mainfrom
fix/android-ipc-sync-close-on-reload
Jun 25, 2026
Merged

fix(android): shutdown IPC socket on JS reload so backend cleans up subscriptions#167
gmaclennan merged 3 commits into
mainfrom
fix/android-ipc-sync-close-on-reload

Conversation

@gmaclennan

Copy link
Copy Markdown
Member

Summary

On a React Native JS-thread reload (Metro reload, DevSettings.reload(), fast-refresh full reload) the main app process stays alive while the JS runtime is torn down and rebuilt. The Android backend runs in a separate process (:ComapeoCore), so the NodeJSIPC LocalSocket lives in the still-alive main process and is only closed if ComapeoCoreModule.OnDestroy explicitly closes it.

OnDestroy called disconnect(), which closes the socket only after connectJob.cancelAndJoin(). The receive coroutine is parked in a blocking dataInputStream.readFully() that only unblocks once the socket closes — so the join deadlocks and the fd is never closed. The backend never observes EOF, its per-connection rpc-reflector cleanup (wired in comapeo-rpc.js) never runs, and every prior session's event subscriptions stay attached to the singleton MapeoManager, with the backend emitting events into dead peers. The leak grows by one connection (and its listeners) per reload.

This was originally raised in #51, but that PR's Android close() did not actually fix the leak (see below).

The fix

Add NodeJSIPC.close() — a synchronous terminal teardown that shutdown(2)s the socket before close(2):

  • shutdownInput() wakes the blocked readFully so the receive loop exits.
  • shutdownOutput() sends FIN so the backend peer observes EOF immediately.
  • then close() + scope.cancel().

OnDestroy now calls close() instead of the deadlock-prone disconnect(). This is the same shutdown-before-close ordering iOS's disconnect() already uses (ios/NodeJSIPC.swift), which is why iOS needs no behavioural change — only a comment documenting it. disconnect() is unchanged and still used by the FGS side and IOException recovery.

Why plain close() is not enough (and why #51 didn't work)

close() removes the fd from the table, but a thread blocked in readFully still holds the open socket description, so the kernel keeps the connection (and the peer) alive until that read returns. Only shutdown(2) forces the blocked read to return and signals the peer. Porting #51's close() verbatim (no shutdown) still leaked in testing.

Verification (on-device)

Instrumented the backend with a per-connection counter plus a live MapeoManager listener count, driving real reloads on a Pixel_7a_API_34 emulator and an iPhone 17 Pro simulator. Confirmed every reload actually fired via the CLOSE/OPEN events.

Scenario active conns / mgrListeners across 4–5 reloads
main today (OnDestroydisconnect()) climbs 1 → 2 → 3 → 4 → 5, zero cleanup — leak
#51's close() (no shutdown) climbs 1 → 2 → 3 → 4 → 5still leaks
This PR (shutdown + close) clean CLOSE active=0OPEN active=1 per reload; bounded at 1
iOS (unchanged disconnect()) clean CLOSE/OPEN per reload; bounded at 1

Killing only the main process (leaving the FGS alive) makes the kernel close the leaked fds and the backend cleans up all of them at once — confirming the leaked connections are real and that the backend cleanup path itself works; it just never got a chance to run on reload.

Test plan

  • closeReleasesSocketSynchronously instrumented test (pins the synchronous-close contract; fails without shutdown)
  • Full NodeJSIPCTest suite passes on device (10 tests)
  • Manual: subscribe to a backend event, reload repeatedly, confirm the MapeoManager listener count stays bounded (Android + iOS)

Notes

🤖 Generated with Claude Code

…ubscriptions

On an RN JS-thread reload the main process stays alive, so the NodeJSIPC
LocalSocket is only closed if ComapeoCoreModule.OnDestroy closes it.
OnDestroy called disconnect(), which routes socket.close() through
connectJob.cancelAndJoin() — but the receive coroutine is parked in a
blocking readFully that only unblocks once the socket closes, so the join
deadlocks and the fd never closes. The backend then never observes EOF,
so its per-connection rpc-reflector cleanup never runs and every prior
session's subscriptions stay attached to the singleton MapeoManager,
emitting events into dead peers. Verified on device: the connection and
manager-listener counts climb 1→2→3→4→5, one per reload.

Add NodeJSIPC.close(): a synchronous terminal teardown that shutdown(2)s
the socket (shutdownInput wakes the blocked read, shutdownOutput sends FIN
to the peer) before close(2), then cancels the scope. OnDestroy now calls
close() instead of disconnect(). shutdown-before-close is the same order
iOS's disconnect() already uses, so iOS needs no change (comment only).
Plain close() without shutdown does NOT fix the leak — the blocked
readFully keeps the connection alive (verified: still climbs 1→5).

Verified on Pixel_7a_API_34 and the iOS simulator: reloads now show a
clean socket close + reopen with connection and listener counts bounded
at 1. closeReleasesSocketSynchronously instrumented test added; full
NodeJSIPCTest suite passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the fix Bug fix (changelog) label Jun 25, 2026
* Not reusable after close; construct a new instance.
*/
fun close() {
scope.cancel()

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xhigh review — close() non-joining teardown (3 verified findings)

scope.cancel() here does not join the connect/send/receive coroutines, opening three teardown races:

  • Connect-race FD leak (CONFIRMED). If a reload lands while the socket is still connecting (Connecting), close() returns having torn down nothing (::socket.isInitialized is still false), then the surviving connect coroutine finishes and assigns the now-open LocalSocket to the field. That fd leaks until process death — the exact reload leak this change exists to prevent, in the connect window.
  • In-flight send vs. stream close (PLAUSIBLE). If the send coroutine is mid-out.write in sendMessageInternal when dataOutputStream?.close() runs, both touch the same stream unsynchronized. Caught + logged, so harmless-looking, but it's a use+close race the old disconnect() avoided by cancelAndJoin-ing first.
  • Dropped buffered sends (PLAUSIBLE). Cancelling the scope before sendChannel.close() kills the send job immediately, so a postMessage queued just before OnDestroy is discarded instead of drained. disconnect() closed the channel first precisely so the send loop could drain.

All three resolve if close() wakes the read (shutdown) and then joins the connect coroutine before closing streams — e.g. a bounded runBlocking { withTimeoutOrNull(..) { connectJob?.join() } }. That's a design change (the method is deliberately synchronous), so flagging rather than auto-applying. The connect-race leak is worth confirming on-device.

// stays alive. Close synchronously: the backend must see EOF on the
// old socket before the next OnCreate connects, otherwise the FD
// lingers and rpc-reflector listeners leak onto MapeoManager.
ipc.close()

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xhigh review — lifecycle implications of disconnect()close() (2 findings)

  • Use-after-close (PLAUSIBLE). close() cancels the scope permanently, so the instance is dead afterward. If OnActivityEntersForeground fires ipc.connect() on this same instance after OnDestroy but before the next OnCreate swaps in a fresh NodeJSIPC, connect() passes its Disconnected guard but scope.launch on the cancelled scope never runs — state stays Disconnected, and subsequent postMessages trySend into a drained channel and vanish with no error surfaced to JS.
  • Suppressed STOPPED transition (PLAUSIBLE). The old disconnect() path delivered a Disconnecting/Disconnected (→ STOPPING/STOPPED) transition to the observer; close() cancels the collector, so on a final teardown not followed by OnCreate, JS consumers never observe STOPPED.

For the steady-state reload (OnDestroy→OnCreate) both are benign — listeners are recreated. They only bite the foreground-before-recreate and final-teardown orderings.

// connect-cancel race.
Thread.sleep(200)

ipc.close()

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xhigh review — test asserts peer EOF but not client teardown (PLAUSIBLE)

closeReleasesSocketSynchronously checks the server read unblocked, but never asserts the client receive coroutine actually terminated. Because close() cancels the scope without joining, the receive coroutine can still be unwinding (woken into its IOExceptiondisconnect() path) when tearDown() closes serverSocket/boundSocket and deletes the socket file. A regression where close() returns before the client coroutine unwinds still passes here; the leaked coroutine racing teardown would surface only as a flaky failure in a later test. Consider asserting client-side teardown (a post-close settle + state assertion) as the sibling tests do.

@gmaclennan

gmaclennan commented Jun 25, 2026

Copy link
Copy Markdown
Member Author

Finding #1 (disconnect() deadlock) — verified low severity, hardening only

Follow-up to the inline review and on-device verification. disconnect() (unchanged by this PR) still does connectJob?.cancelAndJoin() of a receive loop parked in a blocking readFully, closing the socket only after the join — structurally the same deadlock the new close() avoids by shutting the socket down first. I'd initially flagged this as the most severe finding.

Verifying on device (killing the :ComapeoCore FGS process across several foreground/background cycles) shows the deadlock is not reachable in the current lifecycle. The deadlock needs the node backend (the server end of the NodeJSIPC client socket) to be connected but not sending, so the receive loop sits in readFully. Every disconnect() caller instead runs after node has already exited and closed the socket, so readFully returns on EOF and cancelAndJoin completes cleanly (observed "Receive job completed … cancelled", no hang; FDs released, no leak). Concretely — NodeJSService.stop() calls disconnect() after nodeJob.join() (node has already exited), fire-and-forget, under withTimeout(10s) + Process.killProcess; and the main-process ipc only reaches disconnect() via the receive/send loops' IOException path, where node is already gone.

So this is hardening, not a live bug. It only bites if a future caller tears down a healthy connection via disconnect() — node still running and idle, receive loop parked — e.g. a "pause sync" or "switch project" teardown that stops the IPC without killing the FGS. Today only close() (JS reload) tears down while node is still running, and it already shuts down first. I've staged a fix in the working tree (not yet pushed) that lifts the same shutdown-before-join into disconnect(), so both teardown paths wake the parked read identically rather than relying on "node has already exited" or killProcess as the backstop.

@gmaclennan gmaclennan added this pull request to the merge queue Jun 25, 2026
@gmaclennan gmaclennan removed this pull request from the merge queue due to a manual request Jun 25, 2026
…down

Generalize the shutdown-before-close fix into disconnect(): shutdown the
socket to wake the receive loop parked in a blocking readFully before
cancelAndJoin, so the join can't deadlock on a live but idle node backend.
The deadlock is unreachable on today's callers (they run after the backend
has exited), so this is hardening that keeps disconnect() correct in
isolation and symmetric with close().

In close(), set the terminal Disconnected state right after scope.cancel()
so the woken receive loop's disconnect() short-circuits on its state guard
instead of relaunching teardown on the cancelled scope.

Extract shared shutdownSocket()/closeStreamsAndSocket() helpers used by both
paths so the two teardowns can't drift.
@gmaclennan gmaclennan enabled auto-merge June 25, 2026 15:43
@gmaclennan gmaclennan added this pull request to the merge queue Jun 25, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 25, 2026
@gmaclennan gmaclennan added this pull request to the merge queue Jun 25, 2026
Merged via the queue into main with commit 1b63602 Jun 25, 2026
24 checks passed
@gmaclennan gmaclennan deleted the fix/android-ipc-sync-close-on-reload branch June 25, 2026 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fix Bug fix (changelog)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant