fix(android): shutdown IPC socket on JS reload so backend cleans up subscriptions by gmaclennan · Pull Request #167 · digidem/comapeo-core-react-native

gmaclennan · 2026-06-25T13:41:28Z

Summary

On a React Native JS-thread reload (Metro reload, DevSettings.reload(), fast-refresh full reload) the main app process stays alive while the JS runtime is torn down and rebuilt. The Android backend runs in a separate process (:ComapeoCore), so the NodeJSIPC LocalSocket lives in the still-alive main process and is only closed if ComapeoCoreModule.OnDestroy explicitly closes it.

OnDestroy called disconnect(), which closes the socket only after connectJob.cancelAndJoin(). The receive coroutine is parked in a blocking dataInputStream.readFully() that only unblocks once the socket closes — so the join deadlocks and the fd is never closed. The backend never observes EOF, its per-connection rpc-reflector cleanup (wired in comapeo-rpc.js) never runs, and every prior session's event subscriptions stay attached to the singleton MapeoManager, with the backend emitting events into dead peers. The leak grows by one connection (and its listeners) per reload.

This was originally raised in #51, but that PR's Android close() did not actually fix the leak (see below).

The fix

Add NodeJSIPC.close() — a synchronous terminal teardown that shutdown(2)s the socket before close(2):

shutdownInput() wakes the blocked readFully so the receive loop exits.
shutdownOutput() sends FIN so the backend peer observes EOF immediately.
then close() + scope.cancel().

OnDestroy now calls close() instead of the deadlock-prone disconnect(). This is the same shutdown-before-close ordering iOS's disconnect() already uses (ios/NodeJSIPC.swift), which is why iOS needs no behavioural change — only a comment documenting it. disconnect() is unchanged and still used by the FGS side and IOException recovery.

Why plain `close()` is not enough (and why #51 didn't work)

close() removes the fd from the table, but a thread blocked in readFully still holds the open socket description, so the kernel keeps the connection (and the peer) alive until that read returns. Only shutdown(2) forces the blocked read to return and signals the peer. Porting #51's close() verbatim (no shutdown) still leaked in testing.

Verification (on-device)

Instrumented the backend with a per-connection counter plus a live MapeoManager listener count, driving real reloads on a Pixel_7a_API_34 emulator and an iPhone 17 Pro simulator. Confirmed every reload actually fired via the CLOSE/OPEN events.

Scenario	`active` conns / `mgrListeners` across 4–5 reloads
`main` today (`OnDestroy` → `disconnect()`)	climbs `1 → 2 → 3 → 4 → 5`, zero cleanup — leak
#51's `close()` (no `shutdown`)	climbs `1 → 2 → 3 → 4 → 5` — still leaks
This PR (`shutdown` + `close`)	clean `CLOSE active=0` → `OPEN active=1` per reload; bounded at 1
iOS (unchanged `disconnect()`)	clean `CLOSE`/`OPEN` per reload; bounded at 1

Killing only the main process (leaving the FGS alive) makes the kernel close the leaked fds and the backend cleans up all of them at once — confirming the leaked connections are real and that the backend cleanup path itself works; it just never got a chance to run on reload.

Test plan

closeReleasesSocketSynchronously instrumented test (pins the synchronous-close contract; fails without shutdown)
Full NodeJSIPCTest suite passes on device (10 tests)
Manual: subscribe to a backend event, reload repeatedly, confirm the MapeoManager listener count stays bounded (Android + iOS)

Notes

No backend changes — the SocketMessagePort close-event + rpc-reflector cleanup wiring this depends on already landed on main via Map server integration #86 (the load-bearing backend piece of Tear down RPC subscriptions on RN context reload #51 is therefore already present).
Supersedes Tear down RPC subscriptions on RN context reload #51.

🤖 Generated with Claude Code

…ubscriptions On an RN JS-thread reload the main process stays alive, so the NodeJSIPC LocalSocket is only closed if ComapeoCoreModule.OnDestroy closes it. OnDestroy called disconnect(), which routes socket.close() through connectJob.cancelAndJoin() — but the receive coroutine is parked in a blocking readFully that only unblocks once the socket closes, so the join deadlocks and the fd never closes. The backend then never observes EOF, so its per-connection rpc-reflector cleanup never runs and every prior session's subscriptions stay attached to the singleton MapeoManager, emitting events into dead peers. Verified on device: the connection and manager-listener counts climb 1→2→3→4→5, one per reload. Add NodeJSIPC.close(): a synchronous terminal teardown that shutdown(2)s the socket (shutdownInput wakes the blocked read, shutdownOutput sends FIN to the peer) before close(2), then cancels the scope. OnDestroy now calls close() instead of disconnect(). shutdown-before-close is the same order iOS's disconnect() already uses, so iOS needs no change (comment only). Plain close() without shutdown does NOT fix the leak — the blocked readFully keeps the connection alive (verified: still climbs 1→5). Verified on Pixel_7a_API_34 and the iOS simulator: reloads now show a clean socket close + reopen with connection and listener counts bounded at 1. closeReleasesSocketSynchronously instrumented test added; full NodeJSIPCTest suite passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

gmaclennan · 2026-06-25T14:12:20Z

+     * Not reusable after close; construct a new instance.
+     */
+    fun close() {
+        scope.cancel()


xhigh review — close() non-joining teardown (3 verified findings)

scope.cancel() here does not join the connect/send/receive coroutines, opening three teardown races:

Connect-race FD leak (CONFIRMED). If a reload lands while the socket is still connecting (Connecting), close() returns having torn down nothing (::socket.isInitialized is still false), then the surviving connect coroutine finishes and assigns the now-open LocalSocket to the field. That fd leaks until process death — the exact reload leak this change exists to prevent, in the connect window.

In-flight send vs. stream close (PLAUSIBLE). If the send coroutine is mid-out.write in sendMessageInternal when dataOutputStream?.close() runs, both touch the same stream unsynchronized. Caught + logged, so harmless-looking, but it's a use+close race the old disconnect() avoided by cancelAndJoin-ing first.

Dropped buffered sends (PLAUSIBLE). Cancelling the scope before sendChannel.close() kills the send job immediately, so a postMessage queued just before OnDestroy is discarded instead of drained. disconnect() closed the channel first precisely so the send loop could drain.

All three resolve if close() wakes the read (shutdown) and then joins the connect coroutine before closing streams — e.g. a bounded runBlocking { withTimeoutOrNull(..) { connectJob?.join() } }. That's a design change (the method is deliberately synchronous), so flagging rather than auto-applying. The connect-race leak is worth confirming on-device.

gmaclennan · 2026-06-25T14:12:41Z

+            // stays alive. Close synchronously: the backend must see EOF on the
+            // old socket before the next OnCreate connects, otherwise the FD
+            // lingers and rpc-reflector listeners leak onto MapeoManager.
+            ipc.close()


xhigh review — lifecycle implications of disconnect() → close() (2 findings)

Use-after-close (PLAUSIBLE). close() cancels the scope permanently, so the instance is dead afterward. If OnActivityEntersForeground fires ipc.connect() on this same instance after OnDestroy but before the next OnCreate swaps in a fresh NodeJSIPC, connect() passes its Disconnected guard but scope.launch on the cancelled scope never runs — state stays Disconnected, and subsequent postMessages trySend into a drained channel and vanish with no error surfaced to JS.

Suppressed STOPPED transition (PLAUSIBLE). The old disconnect() path delivered a Disconnecting/Disconnected (→ STOPPING/STOPPED) transition to the observer; close() cancels the collector, so on a final teardown not followed by OnCreate, JS consumers never observe STOPPED.

For the steady-state reload (OnDestroy→OnCreate) both are benign — listeners are recreated. They only bite the foreground-before-recreate and final-teardown orderings.

gmaclennan · 2026-06-25T14:12:42Z

+        // connect-cancel race.
+        Thread.sleep(200)
+
+        ipc.close()


xhigh review — test asserts peer EOF but not client teardown (PLAUSIBLE)

closeReleasesSocketSynchronously checks the server read unblocked, but never asserts the client receive coroutine actually terminated. Because close() cancels the scope without joining, the receive coroutine can still be unwinding (woken into its IOException→disconnect() path) when tearDown() closes serverSocket/boundSocket and deletes the socket file. A regression where close() returns before the client coroutine unwinds still passes here; the leaked coroutine racing teardown would surface only as a flaky failure in a later test. Consider asserting client-side teardown (a post-close settle + state assertion) as the sibling tests do.

gmaclennan · 2026-06-25T14:44:46Z

Finding #1 (disconnect() deadlock) — verified low severity, hardening only

Follow-up to the inline review and on-device verification. disconnect() (unchanged by this PR) still does connectJob?.cancelAndJoin() of a receive loop parked in a blocking readFully, closing the socket only after the join — structurally the same deadlock the new close() avoids by shutting the socket down first. I'd initially flagged this as the most severe finding.

Verifying on device (killing the :ComapeoCore FGS process across several foreground/background cycles) shows the deadlock is not reachable in the current lifecycle. The deadlock needs the node backend (the server end of the NodeJSIPC client socket) to be connected but not sending, so the receive loop sits in readFully. Every disconnect() caller instead runs after node has already exited and closed the socket, so readFully returns on EOF and cancelAndJoin completes cleanly (observed "Receive job completed … cancelled", no hang; FDs released, no leak). Concretely — NodeJSService.stop() calls disconnect() after nodeJob.join() (node has already exited), fire-and-forget, under withTimeout(10s) + Process.killProcess; and the main-process ipc only reaches disconnect() via the receive/send loops' IOException path, where node is already gone.

So this is hardening, not a live bug. It only bites if a future caller tears down a healthy connection via disconnect() — node still running and idle, receive loop parked — e.g. a "pause sync" or "switch project" teardown that stops the IPC without killing the FGS. Today only close() (JS reload) tears down while node is still running, and it already shuts down first. I've staged a fix in the working tree (not yet pushed) that lifts the same shutdown-before-join into disconnect(), so both teardown paths wake the parked read identically rather than relying on "node has already exited" or killProcess as the backstop.

…down Generalize the shutdown-before-close fix into disconnect(): shutdown the socket to wake the receive loop parked in a blocking readFully before cancelAndJoin, so the join can't deadlock on a live but idle node backend. The deadlock is unreachable on today's callers (they run after the backend has exited), so this is hardening that keeps disconnect() correct in isolation and symmetric with close(). In close(), set the terminal Disconnected state right after scope.cancel() so the woken receive loop's disconnect() short-circuits on its state guard instead of relaunching teardown on the cancelled scope. Extract shared shutdownSocket()/closeStreamsAndSocket() helpers used by both paths so the two teardowns can't drift.

gmaclennan mentioned this pull request Jun 25, 2026

Tear down RPC subscriptions on RN context reload #51

Closed

9 tasks

github-actions Bot added the fix Bug fix (changelog) label Jun 25, 2026

gmaclennan commented Jun 25, 2026

View reviewed changes

gmaclennan added this pull request to the merge queue Jun 25, 2026

gmaclennan removed this pull request from the merge queue due to a manual request Jun 25, 2026

gmaclennan added 2 commits June 25, 2026 16:40

Merge branch 'main' into fix/android-ipc-sync-close-on-reload

0b48722

gmaclennan enabled auto-merge June 25, 2026 15:43

gmaclennan added this pull request to the merge queue Jun 25, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 25, 2026

gmaclennan added this pull request to the merge queue Jun 25, 2026

Merged via the queue into main with commit 1b63602 Jun 25, 2026
24 checks passed

gmaclennan deleted the fix/android-ipc-sync-close-on-reload branch June 25, 2026 17:10

optic-release-automation Bot mentioned this pull request Jun 25, 2026

[OPTIC-RELEASE-AUTOMATION] release/v1.0.0-pre.6 #168

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(android): shutdown IPC socket on JS reload so backend cleans up subscriptions#167

fix(android): shutdown IPC socket on JS reload so backend cleans up subscriptions#167
gmaclennan merged 3 commits into
mainfrom
fix/android-ipc-sync-close-on-reload

gmaclennan commented Jun 25, 2026

Uh oh!

gmaclennan Jun 25, 2026

Uh oh!

gmaclennan Jun 25, 2026

Uh oh!

gmaclennan Jun 25, 2026

Uh oh!

gmaclennan commented Jun 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

gmaclennan commented Jun 25, 2026

Summary

The fix

Why plain close() is not enough (and why #51 didn't work)

Verification (on-device)

Test plan

Notes

Uh oh!

gmaclennan Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

gmaclennan Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

gmaclennan Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

gmaclennan commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Why plain `close()` is not enough (and why #51 didn't work)

gmaclennan commented Jun 25, 2026 •

edited

Loading