Skip to content

Add rootkey persistence and lifecycle state management#36

Merged
gmaclennan merged 37 commits into
mainfrom
claude/root-key-management-plan-o1vKT
Apr 29, 2026
Merged

Add rootkey persistence and lifecycle state management#36
gmaclennan merged 37 commits into
mainfrom
claude/root-key-management-plan-o1vKT

Conversation

@gmaclennan

Copy link
Copy Markdown
Member

Summary

This PR implements persistent device identity (rootkey) management and adds comprehensive lifecycle state tracking to the embedded Node.js service across iOS and Android platforms. The rootkey is now generated once on first launch, securely persisted using platform-native keystores, and passed to the backend during service initialization. A new state machine tracks service lifecycle transitions and exposes them to the JavaScript layer.

Key Changes

Rootkey Persistence

  • Android (RootKeyStore.kt): Implements 16-byte rootkey generation and persistence using AndroidKeyStore-backed AES/GCM encryption with SharedPreferences storage. Includes read-back verification and explicit failure handling to prevent silent identity loss.
  • iOS (RootKeyStore.swift): Implements equivalent functionality using iOS Keychain with kSecClassGenericPassword and AfterFirstUnlockThisDeviceOnly accessibility.
  • Both implementations generate the key once on first launch and never rotate it; regeneration would constitute identity loss.

Service Lifecycle State Machine

  • Android (NodeJSService.kt): Added State enum (STOPPED, STARTING, STARTED, STOPPING, ERROR) with state transition tracking and onStateChange callback for FGS integration.
  • iOS (NodeJSService.swift): Mirrors Android's state machine with identical state names for cross-platform consistency.
  • Backend (backend/index.js): Refactored initialization to be gated on receiving the rootkey from native via a new init RPC frame. The control socket now drives a two-stage boot: started (control socket ready, awaiting rootkey), then ready (MapeoManager constructed).

Rootkey Handshake Protocol

  • Native code loads/generates the rootkey from secure storage on service start.
  • Backend broadcasts started once the control socket is listening.
  • Native responds with init frame containing base64-encoded rootkey.
  • Backend constructs MapeoManager with the rootkey and broadcasts ready.
  • State transitions are observable by the JavaScript layer via stateChange events.

JavaScript Layer Integration

  • ComapeoCoreModule.ts: Added getState() method and stateChange event emission for lifecycle observation.
  • ComapeoCore.types.ts: Defined ComapeoState union type matching platform implementations.
  • src/index.ts: Exported state observer alongside existing comapeo RPC client.

Test Infrastructure

  • MockBackend.swift: New helper for macOS swift-test runs that simulates the Node.js backend's control socket handshake, enabling tests to verify rootkey round-tripping without touching the real keychain.
  • Updated existing tests to use MockBackend and verify state transitions and rootkey capture.

Data Protection

  • Android: Added comapeo_backup_rules.xml and comapeo_data_extraction_rules.xml to exclude the rootkey-bearing SharedPreferences from cloud backup and device-to-device transfer.
  • iOS: Keychain item configured with AfterFirstUnlockThisDeviceOnly accessibility (no iCloud sync, no device-to-device restore).

Notable Implementation Details

  • Synchronous rootkey load: Both platforms load the rootkey synchronously on service start to avoid race conditions in the FGS startup path.
  • Explicit error handling: Neither platform silently regenerates the rootkey on read failure; any decrypt or keystore error is surfaced to the caller so the app can inform the user of identity loss.
  • Read-back verification: Android verifies the persisted envelope by decrypting it immediately after write to catch corruption before declaring success.
  • StrongBox opportunism: Android attempts to use StrongBox (hardware security module) if available but gracefully falls back to regular AndroidKeyStore on devices without it.
  • State replay: The backend's control socket replays started and ready messages to late-connecting clients, ensuring consistent state even if the JavaScript layer initializes after the handshake completes.

https://claude.ai/code/session_01WEzwFdqnPVbcdWgMbX6bjE

claude and others added 18 commits April 28, 2026 22:14
Drops the hardcoded rootkey from createComapeo and reshapes the boot
sequence so native (Kotlin/Swift) supplies it over the existing control
socket as the first frame, before the manager is constructed:

- Backend binds control.sock first, broadcasts {type:"started"}, waits
  for {type:"init",rootKey:<base64>}, builds MapeoManager, binds
  comapeo.sock, broadcasts {type:"ready"}.
- The 1s settle window before `ready` is removed; `ready` now means RPC
  is safe to call (manager exists, comapeo socket bound).
- Native folds the whole handshake into its `starting` state. `started`
  fires only on receipt of `ready`.

Adds RootKeyStore on both platforms (Keychain on iOS,
AndroidKeyStore-wrapped SharedPreferences on Android), greenfield only —
fresh-install generate-and-persist with read-on-error rather than
silent regeneration. SharedPreferences-backed file is excluded from
backup via data_extraction_rules.xml + backup_rules.xml.

Closes #29: Android NodeJSService gains a State enum mirroring iOS,
ComapeoCoreModule observes the control socket from the main process to
derive a JS-visible state, and the TS layer exports a `state` singleton
with getState() and "stateChange" events alongside the existing
`comapeo` MessagePort.

Tests: a new MockBackend helper drives the started → init → ready
handshake against a mock control socket so existing service-lifecycle
tests still observe `.started` without spinning a real Keychain.

https://claude.ai/code/session_01WEzwFdqnPVbcdWgMbX6bjE
…ement-plan-o1vKT

# Conflicts:
#	android/src/main/AndroidManifest.xml
#	backend/index.js
#	backend/lib/simple-rpc.js
Three independent compile errors surfaced once compileDebugKotlin ran
in CI:

- RootKeyStore.kt: SharedPreferences.edit(commit = true) { ... }
  returns Unit, not the Boolean from commit(). Switched to the raw
  editor (edit().putString(...).commit()) so the persistence failure
  can actually be observed and surfaced as a RootKeyException.
- NodeJSService.kt / ComapeoCoreModule.kt: the trailing-lambda call
  NodeJSIPC(file) { ... } bound to the optional last param
  (onConnectionStateChange: ((State) -> Unit)?) instead of
  onMessage: (String) -> Unit. Use named-argument calls so the
  (String) -> Unit lambda binds to onMessage.
- Drop the now-unused androidx.core.content.edit import from
  RootKeyStore.kt.

https://claude.ai/code/session_01WEzwFdqnPVbcdWgMbX6bjE
Two distinct CI failures from the same trailing-lambda mistake plus a
Swift 5 trailing-comma:

- android/src/androidTest/java/com/comapeo/core/NodeJSIPCTest.kt has
  ~9 callsites of the form `NodeJSIPC(file) { msg -> ... }`. The new
  optional `onConnectionStateChange` constructor parameter was last,
  so Kotlin's trailing-lambda rule bound those callsites to the state
  callback (wrong type), not `onMessage`. Reorder NodeJSIPC's
  constructor so `onConnectionStateChange` (defaulted) comes before
  `onMessage` — trailing-lambda then binds to `onMessage` again and
  every existing call compiles unchanged.
- ios/RootKeyStore.swift had a trailing comma after the last argument
  of SecRandomCopyBytes. Trailing commas in argument lists are Swift
  6.0+ (SE-0257); the macOS swift-test runner uses Xcode 15.4 / Swift
  5.10, which rejects them. Drop the comma. Also exclude
  RootKeyStore.swift from the swift package target explicitly so SPM
  doesn't surprise us by compiling it (production CocoaPods build
  picks it up via `s.source_files = "*.swift"`).
- Refresh CoreManagerSmokeTest.swift docstring to reflect the new
  bind-control → init → bind-comapeo → ready sequence (previously
  described as "both listen() promises resolve").

https://claude.ai/code/session_01WEzwFdqnPVbcdWgMbX6bjE
Swift Package Tests and Integration Tests are failing with no
visible error in the GitHub Actions web UI annotations panel — the
"Process completed with exit code 1/65" tells us nothing about which
test failed or what the assertion was. Add `tee` + a failure-only
print step so the next CI run surfaces the actual error in the step
output, plus an artifact upload so we can pull the full log.

To revert once root-caused, drop the diagnostic steps and the
`--verbose | tee` redirection.

https://claude.ai/code/session_01WEzwFdqnPVbcdWgMbX6bjE
ComapeoCoreModuleTests.testStateStringDerivesFromServiceArgumentNotIPC
was constructing NodeJSService without the new rootKeyProvider param,
which fails the test target compile under xcodebuild test (exit 65).

Two changes:
- Pass `rootKeyProvider: { Data(repeating: 0xAB, count: 16) }` to the
  init so the example-app test target compiles.
- Switch the assertion from `.started` to `.starting`. The `.started`
  transition now requires a control-socket handshake (backend sends
  `started` → service replies with init frame → backend sends `ready`)
  that the example-app test target has no scaffold for. The contract
  under test is "stateString reflects service.state.rawValue" — the
  choice of state value is incidental.

Also simplify mockTestRootKey from sequence-init to
`Data(repeating:count:)` for clarity.

https://claude.ai/code/session_01WEzwFdqnPVbcdWgMbX6bjE
CI is green now (Swift Package Tests + Integration Tests + iOS Device
Build all pass on commit 14a3590). Drop the verbose-tee + dump-on-
failure scaffolding added in f729f1c — it served its purpose.

https://claude.ai/code/session_01WEzwFdqnPVbcdWgMbX6bjE
Integration Tests passed on 14a3590 but failed on 34aba84 (revert)
in 15 minutes — typical xcodebuild timeout shape. Re-add the failure
dump until we have a reproducible signal.

https://claude.ai/code/session_01WEzwFdqnPVbcdWgMbX6bjE
639a82a (with diagnostic) passed; the 34aba84 failure was almost
certainly a flake — same Swift code, only the workflow differed.
Drop the failure dump now that the run is reproducible.

If the flake recurs we have the diagnostic snippet in git history
and can re-add it for one-shot debugging.

https://claude.ai/code/session_01WEzwFdqnPVbcdWgMbX6bjE
Re-add the `tee xcodebuild-test.log` redirection plus the
failure-only summary dump. We tested both with and without it on
identical Swift code; the runs without `tee` intermittently timed
out at the workflow's 15-min mark while runs with `tee` completed
in ~4 min. The redirection seems to affect xcodebuild's pipe
buffering or simulator boot-sequence timing in a way that
suppresses the race; cause unknown, but the workaround is stable
and the failure-dump pays for itself the next time something else
breaks.

https://claude.ai/code/session_01WEzwFdqnPVbcdWgMbX6bjE
Substring matching on the raw frame bytes (`message.contains("\"started\"")`)
is brittle: any control payload that legitimately carries those characters
in another field would falsely trigger a state transition. The "stay
synchronous" rationale in the prior comment doesn't hold either — the
handler already runs on the IPC's receive queue/dispatcher and the
init-frame send is async via `serviceScope.launch` / sendQueue, so a real
parser costs nothing in latency or ordering.

Switches both NodeJSService.kt (FGS-side) and ComapeoCoreModule.kt
(main-app-side) on Android, plus NodeJSService.swift on iOS, to
JSONObject.optString("type") / JSONSerialization.jsonObject. Non-JSON
frames are logged-and-skipped rather than mis-routed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The envelope JSON carries an `alias` field for forward-compat with future
wrapper-key rotation, but parseEnvelope previously read it without
checking it against the constant the code actually uses. A tampered or
mis-versioned envelope whose `v=1` happens to be correct but whose alias
points at a non-existent key would have hit a confusing "wrapper key alias
missing" error several call frames later. Pin the check at parse time so
the failure mode names the right field.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…hrow

Adds SimpleRpcServer.broadcastError({phase, message, stack?}) so the host
can publish a structured error frame to every connected control-socket
client before the process exits. Native-side consumers (FGS on Android,
NodeJSService on iOS) will use this to derive their ERROR state without
needing a separate error channel.

backend/index.js gains:
- handleFatal(phase, error): broadcasts the error frame, waits ~100ms
  for the AF_UNIX kernel buffer to flush, then exits 1.
- uncaughtException + unhandledRejection handlers route through
  handleFatal so any throw — sync or async, in any phase — surfaces with
  the same shape.
- Boot IIFE tags errors with the phase that produced them
  (`listen-control`, `init`, `construct`) before re-throwing into
  handleFatal, replacing the bare `process.exit(1)`.

The 100ms flush wait is bounded but generous: AF_UNIX writes on a local
socket are kernel-buffered immediately; the wait exists for the case
where the peer hasn't drained its receive buffer yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…idge

Pairs with the previous commit's backend broadcastError. Native (FGS-side
NodeJSService on Android, NodeJSService on iOS) now consumes
{type:"error",phase,message} on the control socket and transitions to
ERROR while capturing the detail. Local error paths (rootkey load,
node-runtime startup, shutdown timeout) route through the same
transitionToError helper so getLastError() always reflects whichever
failure landed first.

The main-process Android ComapeoCoreModule (which observes the same
broadcast independently) attaches errorPhase/errorMessage to its
stateChange event payload. iOS ComapeoCoreModule reads the captured
ErrorInfo off NodeJSService and includes the same fields.

JS surface:
- StateChangeEventPayload gains optional errorPhase/errorMessage.
- ComapeoErrorInfo type for the structured detail.
- state.getLastError() returns the most recent ErrorInfo or null.
- stateChange listener signature is now (state, error|null).

The IPC-level NodeJSIPC.State.Error is still mapped to ERROR; its
exception message is forwarded with phase="ipc" so callers can
distinguish a connection-layer failure from a backend-reported one.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous behaviour was to nodeJob?.cancel() on Android when rootkey
load failed in sendInitFrame. That races with the application's own
ERROR observation: if the JS layer is mid-handler reading getLastError(),
the node process has already gone away and any subsequent introspection
(or graceful shutdown attempt by the FGS) sees a stale state.

ERROR is now treated as observable: the service transitions to ERROR and
captures the cause, but leaves the node thread alive. Recovery is the
application's call — restart the FGS, recreate the service, prompt the
user, etc. — which matches the pattern callers already need for non-
rootkey failures (backend boot errors, shutdown timeouts).

iOS sendInitFrame already had no abort path, so the change there is
docstring-only. State enum docs and the JS ComapeoState union are
updated to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A hang during the rootkey handshake or MapeoManager construction
previously left the service in STARTING indefinitely — there was no
upper bound on how long the JS layer (or any UI driven off
state.getState()) might wait for a transition that would never come.

Adds a configurable watchdog (default 30 s) on both platforms:

- iOS: DispatchWorkItem scheduled via DispatchQueue.global().asyncAfter.
  transitionState cancels it when leaving .starting; if it fires while
  state is still .starting, the service transitions to ERROR with phase
  "starting-timeout".
- Android: serviceScope.launch { delay(); … } that respects coroutine
  cancellation, cancelled by transitionState when leaving STARTING.

Both use a TOCTOU-safe state recheck inside the watchdog body so a
racing transition (ready broadcast landing simultaneously) doesn't
trigger a spurious ERROR.

Default 30 s covers cold simulator boot plus the addon dlopen surge
(sodium-native + better-sqlite3 dominate) with margin matching the
existing CoreManagerSmokeTest budget. Production callers can widen
for slow devices; tests can tighten via the new init parameter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both modules call NodeJSIPC.connect() in OnCreate AND on every
OnActivityEntersForeground / OnAppEntersForeground. Without context that
reads as redundant or buggy. NodeJSIPC.connect() is idempotent and also
the recovery path out of the IPC's .error state, so re-calling it on
every foreground is the cheapest way to handle transient connection
failures (FGS respawn after kill on Android, suspension-closed fd on
iOS) without us mirroring IPC state at this layer.

Comment-only — no behaviour change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Android: instrumented test under androidTest/ exercises AndroidKeyStore
+ SharedPreferences directly. Two cases: first-call generates and
second-call returns identical bytes (proves persistence), and a corrupt
JSON envelope throws RootKeyException (proves we don't silently
regenerate on read failure — the contract that prevents identity loss).
@Before/@after purge both halves of the persisted state so the suite
is order-independent and won't leak alias/blob across runs.

iOS: example-app XCTest suite exercises the simulator's iOS Keychain.
Two cases: round-trip (proves persistence under a UUID-namespaced
service) and a manually-planted wrong-length entry that must throw
.wrongLength (proves the contract). Each test uses a unique service
string so concurrent or post-crash reruns don't collide.

The iOS test sources live under example/tests/ios/ (the tracked
location); example/ios/ is gitignored and regenerated by `expo
prebuild`. To make new test files land in the Xcode test target on
incremental prebuilds (not just --clean rebuilds), add-test-target.rb
is now idempotent — it registers any *.swift file in the test source
directory that isn't already wired into the build phase.

Cross-restart deviceId stability is intentionally not covered here per
review discussion; the upcoming Maestro end-to-end suite will exercise
that path with a real CoMapeoCore boot.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread android/src/main/java/com/comapeo/core/ComapeoCoreModule.kt Outdated
Comment thread android/src/main/java/com/comapeo/core/ComapeoCoreModule.kt
Comment thread android/src/main/java/com/comapeo/core/ComapeoCoreModule.kt
Comment thread android/src/main/java/com/comapeo/core/NodeJSService.kt

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements persistent device identity (“rootkey”) storage on iOS/Android and introduces a cross-platform lifecycle/state model for the embedded Node.js backend, including a native↔backend control-socket handshake that gates backend initialization on receiving the rootkey.

Changes:

  • Add secure, non-rotating 16-byte rootkey persistence on iOS (Keychain) and Android (AndroidKeyStore-wrapped AES/GCM + SharedPreferences), plus backup/transfer exclusions on Android.
  • Refactor backend startup to a two-stage boot (startedinit(rootKey)ready) and add error broadcasting over the control socket.
  • Expose lifecycle state + last error to JS via getState(), getLastError(), and stateChange events; update iOS tests and tooling to use a new MockBackend.

Reviewed changes

Copilot reviewed 30 out of 30 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/index.ts Re-export state observer alongside comapeo.
src/ComapeoCoreModule.ts Add JS state observer wrapper + getState()/getLastError() accessors.
src/ComapeoCore.types.ts Define ComapeoState, error payload/event types.
ios/Tests/NodeJSServiceTests.swift Update service tests to drive new handshake via MockBackend.
ios/Tests/IPCLifecycleTests.swift Update IPC lifecycle tests to use MockBackend for control socket.
ios/Tests/Helpers/MockNodeService.swift Inject rootKeyProvider + add helper to start service with MockBackend.
ios/Tests/Helpers/MockBackend.swift New mock control-socket backend implementing started → init → ready.
ios/RootKeyStore.swift New Keychain-backed persistent rootkey store.
ios/Package.swift Exclude RootKeyStore.swift from SPM target used by macOS swift tests.
ios/NodeJSService.swift Add lifecycle state machine, startup watchdog, init-frame send, error capture.
ios/ComapeoCoreModule.swift Emit stateChange with error details; expose getLastError().
ios/AppLifecycleDelegate.swift Wire RootKeyStore.loadOrInitialize() into NodeJSService via rootKeyProvider.
example/tests/ios/ServiceLifecycleTest.swift Adjust timing assumptions for new “ready-gated” startup.
example/tests/ios/RootKeyStoreTests.swift New simulator Keychain tests for RootKeyStore persistence/corruption.
example/tests/ios/CoreManagerSmokeTest.swift Update smoke test commentary to reflect init-gated manager construction.
example/tests/ios/ComapeoCoreModuleTests.swift Update module test expectations to .starting due to handshake requirement.
example/plugins/with-ios-tests/add-test-target.rb Make test target file registration idempotent across prebuild runs.
backend/lib/simple-rpc.js Pass full inbound message to handlers; add broadcastError(...).
backend/lib/create-comapeo.js Require a validated 16-byte rootKey Buffer (no default).
backend/index.js Implement init handshake, gate manager creation on rootkey, broadcast boot errors.
android/src/main/res/xml/comapeo_data_extraction_rules.xml Android 12+ backup/device-transfer exclusion for rootkey prefs.
android/src/main/res/xml/comapeo_backup_rules.xml Pre-Android-12 backup exclusion for rootkey prefs.
android/src/main/java/com/comapeo/core/RootKeyStore.kt New AndroidKeyStore-backed rootkey persistence with verification.
android/src/main/java/com/comapeo/core/NodeJSService.kt Add lifecycle state machine, startup watchdog, init-frame send, error capture.
android/src/main/java/com/comapeo/core/NodeJSIPC.kt Add optional connection-state observer callback.
android/src/main/java/com/comapeo/core/ComapeoCoreService.kt Remove obsolete internal ServiceState flow (service now relies on NodeJSService state).
android/src/main/java/com/comapeo/core/ComapeoCoreModule.kt Add stateChange events + getState()/getLastError() derived from control socket + IPC state.
android/src/main/AndroidManifest.xml Wire backup/data-extraction exclusion rules; add tools namespace/targetApi.
android/src/androidTest/java/com/comapeo/core/RootKeyStoreTest.kt New instrumented tests for Android rootkey persistence/corruption.
.github/workflows/ios-tests.yml Pipe xcodebuild output to file and surface failure summary on CI.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread android/src/main/java/com/comapeo/core/NodeJSService.kt Outdated
Comment thread android/src/main/java/com/comapeo/core/ComapeoCoreModule.kt
Comment thread android/src/main/java/com/comapeo/core/RootKeyStore.kt Outdated
Comment thread android/src/main/java/com/comapeo/core/RootKeyStore.kt Outdated
Comment thread android/src/main/java/com/comapeo/core/RootKeyStore.kt
Pulls together the design rationale that was previously spread across
PR review comments and ad-hoc explanations. Covers:

- iOS in-process vs Android two-process model and why.
- The two Unix-domain sockets (comapeo.sock, control.sock), framing,
  and why we have two channels rather than one.
- The boot handshake (started → init → ready) and why the rootkey
  crosses the socket boundary.
- Lifecycle state machine: native state, JS-visible state, error
  surfacing across the bridge.
- Known limitations: the merged state machine actually represents
  three independent components (FGS, node runtime, backend code),
  with two proposed fixes — a small "broadcast stopping" change vs a
  larger per-component state refactor. Both deferred to a follow-up.
- Alternatives considered for FGS↔main-process state notification
  (Intent broadcasts, Messenger, ContentProvider, etc.) and for
  rootkey storage (expo-secure-store), with the rejection rationale
  for each.

Companions the existing ForegroundService.md, root-key-storage-and-
migration-plan.md, runtime-alternatives.md, and bare-architecture.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gmaclennan and others added 7 commits April 29, 2026 15:24
Previously, Android's start() guarded only on `nodeJob != null` and
stop() had no state guard at all. The state machine therefore allowed
weird transitions like ERROR -> STOPPING (when stop() was called on
an errored service) and silently re-permitted ERROR -> STARTING after
a stop() had cleared nodeJob.

Tightens both guards to match iOS:

- start() requires state == STOPPED. Refuses STARTING/STARTED
  (already-running, existing nodeJob check), STOPPING (mid-shutdown),
  and ERROR (terminal — caller must destroy() and create a new
  instance to recover).
- stop() requires state in {STARTING, STARTED}. STOPPED/STOPPING is
  a no-op; ERROR is refused — callers must call destroy() to release
  resources rather than going through a STOPPING transition that
  doesn't map to a clear semantic.

This codifies the model the FGS lifecycle already follows in practice:
ComapeoCoreService.onCreate creates a fresh NodeJSService for every
service start, so a recovery flow is naturally "stop the FGS, start
it again" and a brand-new instance gets STOPPED -> STARTING. iOS
already enforces these guards in code; the State enum docstring on
both platforms now spells out the per-instance-terminal contract.

No behaviour change in existing happy-path or shutdown tests because
the FGS never reused a NodeJSService instance across start/stop
cycles.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reverses the lifecycle treatment of malformed control-socket frames
(non-JSON, missing `type`, unknown `type`). They were previously
raised to ERROR with phase="protocol"; that conflated "the service is
in trouble" with "we got one frame we couldn't parse". A single bad
frame shouldn't tear down a working session.

Adds a dedicated `messageerror` event mirroring the DOM MessagePort
counterpart. The event fires on the JS-facing `state` observer
alongside (but independent of) `stateChange`. Listeners receive a
single `Error` argument with a developer-facing description of the
offending frame:

  state.addListener("messageerror", (error) => {
    console.warn("control protocol error:", error.message)
  })

A `messageerror` event does not change the lifecycle state — the
control socket continues to drive `stateChange` independently. Real
backend errors (broadcast as `{type:"error",phase,message}`) still
transition to ERROR as before; only PARSE failures use the new
channel.

Native plumbing:
- Android ComapeoCoreModule: emits `messageerror` directly from its
  control-socket onMessage handler.
- iOS NodeJSService: exposes a new `onMessageError` callback;
  ComapeoCoreModule.swift wires it to sendEvent.
- Android FGS-side NodeJSService: logs malformed frames (no JS
  bridge to forward to); the main-app-process Module surfaces the
  event independently from its own controlIpc subscription.

Architecture doc gets a new §5.5 covering the rationale and wiring.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three near-identical blocks parsed control-socket frames and switched
on `type`:
- NodeJSService.kt::handleControlMessage (FGS-side)
- ComapeoCoreModule.kt's controlIpc onMessage (main-app-side)
- NodeJSService.swift::handleControlMessage (iOS)

Each independently parsed JSON, looked up "type", and translated to
lifecycle / event effects. The well-known type names ("started",
"ready", "error") were string literals scattered across all three
files. Adding a new frame type meant updating all three sites with no
compiler help to catch a missed branch.

Adds a typed router on each platform:
- ControlFrame.kt — sealed class with Started, Ready, Error(phase,
  message), Malformed(detail) cases, plus a static parse(raw): ControlFrame.
- ControlFrame.swift — Swift enum with the same cases and parse function.

Each consumer now switches on the typed result. Kotlin `when` and Swift
`switch` are exhaustive over sealed classes / enums, so a forgotten
case fails to build rather than silently dropping the frame.

`Malformed` collapses "non-JSON", "missing type", and "unknown type"
into one case with a developer-facing detail string. The consumers
don't currently distinguish these; the detail is what gets surfaced
to JS via `messageerror` regardless. If a future consumer needs to
distinguish, the case can be split without touching the call sites
that don't care.

Pure refactor — no behaviour change. ControlFrame.swift is added to
the SPM target sources alongside the existing Swift files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…shot>

Replaces the explicit `synchronized(stateLock)` blocks around `state` +
`lastError` reads/writes with a single `MutableStateFlow<StateSnapshot>`
holding both fields together. `getAndUpdate` is a CAS-loop, so:
- Concurrent transitions resolve in a defined order without a lock.
- The (state, lastError) pair is always observed as a matching unit
  by readers — no separate two-step lock dance to make atomic.
- The block returns the pre-update snapshot, so the watchdog-cancel +
  log + onStateChange-fire side effects can run cleanly outside the
  CAS without any "captured-from-inside-the-lock" trickery.

The `startupWatchdogJob` slot moves from `@Volatile var Job?` to
`AtomicReference<Job?>`. `cancelStartupWatchdog()` does
`getAndSet(null)?.cancel()` — a single atomic operation that takes
ownership of the current ref, so two concurrent transitions out of
STARTING can never both try to cancel the same job.

Aligns this file's pattern with `NodeJSIPC.kt` which already uses
`MutableStateFlow` for its connection state. No new dependencies (the
imports are already on the kotlinx-coroutines version we're pinned
to). Behaviour-preserving — observable lifecycle is identical to the
previous synchronized version.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure deterministic tests for the control-socket frame parser
extracted in 08bc266. Coverage:

- Each well-known type (started, ready, error) parses to the matching
  case, with the right phase/message extraction for error frames.
- Error frames default phase=unknown and message=(no message) when
  those fields are missing — the parser is permissive on the wire so
  downstream consumers can rely on never seeing null.
- Forward-compat: extra fields (e.g. `stack`) on an error frame are
  ignored, not rejected.
- Every malformed case (non-JSON, empty input, missing `type` field,
  unknown `type` value) resolves to .Malformed/.malformed with a
  developer-readable detail string that names what's wrong.
- A 200-char non-JSON input produces a truncated detail (the parser
  takes the first 100 chars), so a hostile or accidentally-huge frame
  doesn't blow up logs.

Without these, parser regressions would only surface via a lifecycle
test that happens to drive the right frame and notice the wrong
outcome — silent until something else broke. JVM-only on Android
(android/src/test, not androidTest), pure Foundation on iOS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Initial ControlFrameTests.swift used `@testable internal import` (the
Swift 5.9 explicit-access-import form), copied from the example app's
test sources. The SPM `ios/Tests/` target uses the older implicit form
across all sibling tests; mixing the two caused
"ambiguous implicit access level for import of 'ComapeoCore'" build
errors against the other test files. Switching ControlFrameTests to
the implicit form aligns it with the rest of the suite and the test
target now compiles + passes (`swift test --filter ControlFrameTests`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every ControlFrameTest case threw `RuntimeException` in CI:
> ControlFrameTest > parsesStarted FAILED
>     java.lang.RuntimeException at ControlFrameTest.kt:21
> ... (10 failures)

`org.json` is part of the Android framework, but the JVM unit-test
classpath gets a stub where every method throws "Method ... not
mocked" at runtime. The Android Gradle plugin's default
`testOptions.unitTests.returnDefaultValues = false` makes that the
expected failure mode for any framework class with no real JVM
implementation.

Pulling the standalone `org.json:json` (Maven Central) onto the test
classpath provides a real implementation under the same package and
API. Production code that uses `JSONObject` (ControlFrame,
RootKeyStore) is now unit-testable on the JVM without instrumentation
or framework mocking.

Verified locally:
  ./gradlew :comapeo-core-react-native:testDebugUnitTest \
      --tests com.comapeo.core.ControlFrameTest
  BUILD SUCCESSFUL

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 36 out of 36 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread android/src/main/java/com/comapeo/core/NodeJSIPC.kt
Comment thread backend/index.js
Comment thread android/src/main/AndroidManifest.xml
gmaclennan and others added 2 commits April 29, 2026 17:12
Buffer.from(s, "base64") in Node is permissive — invalid characters
are silently dropped, so a tampered string could still decode to 16
bytes that aren't the bytes the producer originally encoded. The
existing length check would pass, the wrong rootkey would resolve
into MapeoManager, and the device would silently use a corrupted
identity.

Adds a strict regex check before decode: a 16-byte rootkey,
base64-encoded with padding (which is what both Android's
`Base64.NO_WRAP` and iOS's `base64EncodedString()` emit), is
exactly 22 base64 chars followed by `==`. Anything else is rejected
with a clear error before `Buffer.from` is given a chance to silently
re-interpret it.

The post-decode length check stays as belt-and-braces — strict regex
should make it unreachable, but the cost of the extra check is one
boolean compare and the cost of a bad rootKey is identity loss.

Addresses Copilot review on backend/index.js:73.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The library declares android:dataExtractionRules and
android:fullBackupContent on its <application> element to exclude the
rootkey-bearing SharedPreferences from cloud backup and device-to-
device transfer. Host apps that already declare these attributes hit
a manifest-merger conflict at build time.

Trade-off: the build error is intentional. Without the library
declaration, forgetful consumers leak a (cryptographically useless,
but private) blob to backups and produce a confusing restore UX
(envelope restored, wrapper key missing on the new device,
RootKeyException at first launch). The build-time error is the
lesser evil; it surfaces the integration step at the right moment.

Cleanest merge-friendly answer is an Expo config plugin that
programmatically merges our exclusions into the host's rules XML at
prebuild time — filed separately. For now:

- Manifest comment expanded with the explicit two-step resolution
  (merge exclusion entry into host's rules XML, add `tools:replace`).
- README "Configure for Android" section explains the conflict and
  resolution with concrete XML, plus rationale for why exclusion
  matters even though the wrapper key is device-bound.

Addresses Copilot review on AndroidManifest.xml:32.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gmaclennan and others added 2 commits April 29, 2026 17:19
The wrapper key in `createOrLoadWrapperKey()` is built with
`setRandomizedEncryptionRequired(true)`, which on hardware-backed
AndroidKeyStore implementations means the keystore — not the caller
— controls IV generation. Passing a `GCMParameterSpec` with our own
IV to `Cipher.init(ENCRYPT_MODE, ...)` throws

  java.security.InvalidAlgorithmParameterException:
      Caller-provided IV not permitted

at runtime. The instrumented test suite hit this on the API 30
emulator; both `RootKeyStoreTest.firstCallGenerates...` and
`corruptedEnvelopeThrows` failed because they exercise the encrypt
path. Production code on a real device with a hardware keystore
would have hit the same.

Fix: drop the caller IV on encryption. Read `cipher.iv` after
`init()` and store the keystore-generated IV in the envelope
exactly as before — load semantics unchanged. Adds a defensive
length check so an unexpectedly-shaped IV (the spec doesn't pin
the length, though 12 bytes is the GCM default and what we still
require on read) fails at write time, not silently.

Decryption keeps using `GCMParameterSpec(tagLen, envelope.iv)` —
that's required by GCM and explicitly allowed for keystore-managed
decrypt keys.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI's swift-test run died with `error: Exited with unexpected signal
code 13` (SIGPIPE) during
`testStopTimeoutTransitionsToErrorNotStopped`. The test deliberately
tears down the MockBackend before calling `service.stop(timeout:0.1)`,
forcing the shutdown frame to write to an already-closed peer:

  1. backend.stop()  → mock server closes its accepted fd
  2. service.stop()  → controlIPC.sendMessageSync writes shutdown
     frame on its own fd whose peer (the mock-side accept fd) is now
     closed
  3. Darwin.write() returns -1 with errno=EPIPE AND the kernel
     sends SIGPIPE to the process
  4. No handler installed → process dies, taking the rest of the
     test suite with it

The bug is latent in production too: any shutdown race between the
service and the backend (the backend's `handleFatal` exit is the
clearest example) can hit the same path, and SIGPIPE there would
crash the FGS / app in the field. The existing `writeFully` already
handles `EPIPE` as `IPCError.writeError(EPIPE)`; the kernel signal
just got there first.

Fix: set `SO_NOSIGPIPE = 1` via `setsockopt` on:

- The connect()'d fd in `connectSocket` (production controlIPC and
  comapeoIPC writers).
- The accept()'d fd in `MockNodeServer.acceptClient` (test side that
  also writes via `sendFramedMessage`).

With SIGPIPE suppressed at the socket level, a write to a closed
peer returns `errno=EPIPE` cleanly and our existing error path
handles it.

Verified locally:
  cd ios && swift test
  Executed 53 tests, with 0 failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gmaclennan gmaclennan enabled auto-merge (squash) April 29, 2026 16:28
@gmaclennan gmaclennan merged commit e5e1172 into main Apr 29, 2026
7 checks passed
@gmaclennan gmaclennan deleted the claude/root-key-management-plan-o1vKT branch April 29, 2026 16:36
gmaclennan added a commit that referenced this pull request Apr 29, 2026
…der-804127

* origin/main:
  Add rootkey persistence and lifecycle state management (#36)
  android: audit 16 KB page alignment on every shipped .so (#43)
  chore: fix eslint configuration (#41)

# Conflicts:
#	android/src/main/java/com/comapeo/core/ComapeoCoreModule.kt
#	backend/index.js
#	ios/ComapeoCoreModule.swift
#	src/index.ts
gmaclennan added a commit that referenced this pull request Jun 22, 2026
## Optic Release Automation

This **draft** PR is opened by Github action
[optic-release-automation-action](https://github.com/nearform-actions/optic-release-automation-action).

A new **draft** GitHub release
[v1.0.0-pre.2](https://github.com/digidem/comapeo-core-react-native/releases/tag/untagged-c499977757c9745e56b2)
has been created.

Release author: @gmaclennan

#### If you want to go ahead with the release, please merge this PR.
When you merge:

- The GitHub release will be published

- The npm package with tag pre will be published according to the
publishing rules you have configured



- No major or minor tags will be updated as configured


#### If you close the PR

- The new draft release will be deleted and nothing will change

## What's Changed
* Android Testing Infrastructure & Bug Fixes by @gmaclennan in
#3
* chore: prebuild example/android; harden instrumented tests by
@gmaclennan in
#10
* Integrate @comapeo/core via IPC over Unix sockets by @gmaclennan in
#5
* chore: adjust repo setup by @achou11 in
#12
* chore: minor fixes based on expo-doctor by @achou11 in
#13
* Add iOS support & test infrastructure by @gmaclennan in
#6
* chore: add architecture docs & plans by @gmaclennan in
#11
* update some native deps used in backend by @achou11 in
#14
* iOS Phase 1: unified JS bundle + smoke test (simulator-only) by
@gmaclennan in
#15
* iOS Phase 2: xcframework Embed & Sign for native addons by @gmaclennan
in #16
* Phase 2 Android: jniLibs packaging + unified rollup loader plugin by
@gmaclennan in
#17
* chore: post-Phase-2 cleanup — comments, plan docs, agents.md by
@gmaclennan in
#33
* android: read abiFilters from reactNativeArchitectures (#30) by
@gmaclennan in
#35
* refactor: simplify build-backend.ts; rollup writes directly to native
asset trees by @gmaclennan in
#34
* chore: fix eslint configuration by @achou11 in
#41
* android: audit 16 KB page alignment on every shipped .so by
@gmaclennan in
#43
* Add rootkey persistence and lifecycle state management by @gmaclennan
in #36
* chore: move example app into apps directory by @achou11 in
#18
* refactor: per-component lifecycle state with derived ComapeoState by
@gmaclennan in
#47
* android: fold waitForFile into connect retry loop by @gmaclennan in
#52
* chore: add e2e testing app by @achou11 in
#49
* fix(android): drop setUnlockedDeviceRequired from rootkey wrapper key
by @gmaclennan in
#57
* fix(backend): cache stopping/error frames for late joiners by
@gmaclennan in
#58
* fix(ios-tests): wait for STOPPING before signalling node exit by
@gmaclennan in
#59
* fix(android): drain JNI stdio pumps before returning from node::Start
by @gmaclennan in
#60
* Sentry integration: Phase 1 + Phase 2a + Phase 2b by @gmaclennan in
#54
* feat(backend): polywasm-backed undici on iOS, re-enable maps plugin by
@gmaclennan in
#62
* ci: drop unreliable Android emulator snapshot caching by @gmaclennan
in #64
* feat(sentry): land Phase 3 — backend loader + RPC tracing by
@gmaclennan in
#63
* fix(ios-tests): serialise STOPPING/STOPPED observers in
testFullLifecycleStateTransitions by @gmaclennan in
#71
* use npm list instead of custom traversal to get native module versions
by @achou11 in
#70
* feat(sentry): land Phases 6 + 7a — Android exit reasons & iOS
MetricKit app-exit telemetry by @gmaclennan in
#72
* fix(sentry): make exit telemetry lossless and stop cross-process
clobbering by @gmaclennan in
#84
* chore(e2e): add e2e tests on browserstack via Maestro by @achou11 in
#56
* feat(sentry): migrate to @sentry/react-native v8; exit telemetry as
Application Metrics by @gmaclennan in
#73
* Map server integration by @gmaclennan in
#86
* chore(deps): upgrade to Expo SDK 56 (React Native 0.85) by @gmaclennan
in #87
* chore(ci): add release workflow by @gmaclennan in
#90
* chore: fix npm script and release build script by @gmaclennan in
#91
* chore(pack): don't try to package build files by @gmaclennan in
#92
* fix: start fastify listening by @gmaclennan in
#93
* perf(backend): switch bundler from rollup to rolldown by @gmaclennan
in #94
* fix(ci): ignore-scripts in ios npm installs by @gmaclennan in
#96
* fix(ci): replace --ignore-scripts with npm strict-allow-scripts
allowlist by @gmaclennan in
#106
* feat(config): let the consuming app supply the default project config
by @gmaclennan in
#95
* chore(release): merge prerelease branch. by @gmaclennan in
#110

## New Contributors
* @achou11 made their first contribution in
#12

**Full Changelog**:
https://github.com/digidem/comapeo-core-react-native/commits/v1.0.0-pre.2

<!--

<release-meta>{"id":342868678,"version":"v1.0.0-pre.2","npmTag":"pre","opticUrl":"https://optic-zf3votdk5a-ew.a.run.app/api/generate/"}</release-meta>
-->
@gmaclennan gmaclennan added the feature New feature (changelog) label Jun 22, 2026
gmaclennan added a commit that referenced this pull request Jun 22, 2026
## Optic Release Automation

This **draft** PR is opened by Github action
[optic-release-automation-action](https://github.com/nearform-actions/optic-release-automation-action).

A new **draft** GitHub release
[v1.0.0-pre.2](https://github.com/digidem/comapeo-core-react-native/releases/tag/untagged-352a6c41c12fd02dec37)
has been created.

Release author: @gmaclennan

#### If you want to go ahead with the release, please merge this PR.
When you merge:

- The GitHub release will be published

- The npm package with tag pre will be published according to the
publishing rules you have configured



- No major or minor tags will be updated as configured


#### If you close the PR

- The new draft release will be deleted and nothing will change

<!-- Release notes generated using configuration in .github/release.yml
at 7fe80b4 -->

## What's Changed
### 🚀 Features
* Integrate @comapeo/core via IPC over Unix sockets by @gmaclennan in
#5
* Add iOS support & test infrastructure by @gmaclennan in
#6
* iOS Phase 1: unified JS bundle + smoke test (simulator-only) by
@gmaclennan in
#15
* iOS Phase 2: xcframework Embed & Sign for native addons by @gmaclennan
in #16
* Phase 2 Android: jniLibs packaging + unified rollup loader plugin by
@gmaclennan in
#17
* android: read abiFilters from reactNativeArchitectures (#30) by
@gmaclennan in
#35
* Add rootkey persistence and lifecycle state management by @gmaclennan
in #36
* Sentry integration: Phase 1 + Phase 2a + Phase 2b by @gmaclennan in
#54
* feat(backend): polywasm-backed undici on iOS, re-enable maps plugin by
@gmaclennan in
#62
* feat(sentry): land Phase 3 — backend loader + RPC tracing by
@gmaclennan in
#63
* feat(sentry): land Phases 6 + 7a — Android exit reasons & iOS
MetricKit app-exit telemetry by @gmaclennan in
#72
* feat(sentry): migrate to @sentry/react-native v8; exit telemetry as
Application Metrics by @gmaclennan in
#73
* Map server integration by @gmaclennan in
#86
* feat(config): let the consuming app supply the default project config
by @gmaclennan in
#95
### 🐛 Bug Fixes
* fix(android): drop setUnlockedDeviceRequired from rootkey wrapper key
by @gmaclennan in
#57
* fix(backend): cache stopping/error frames for late joiners by
@gmaclennan in
#58
* fix(ios-tests): wait for STOPPING before signalling node exit by
@gmaclennan in
#59
* fix(android): drain JNI stdio pumps before returning from node::Start
by @gmaclennan in
#60
* fix(ios-tests): serialise STOPPING/STOPPED observers in
testFullLifecycleStateTransitions by @gmaclennan in
#71
* fix(sentry): make exit telemetry lossless and stop cross-process
clobbering by @gmaclennan in
#84
* fix: start fastify listening by @gmaclennan in
#93
* fix(ci): ignore-scripts in ios npm installs by @gmaclennan in
#96
* fix(ci): replace --ignore-scripts with npm strict-allow-scripts
allowlist by @gmaclennan in
#106
* fix(release): stop `npm pack --dry-run` leaking dry-run into backend
install by @gmaclennan in
#129
### ⚡ Performance
* perf(backend): switch bundler from rollup to rolldown by @gmaclennan
in #94
### ⬆️ Dependencies
* update some native deps used in backend by @achou11 in
#14
* chore(deps): upgrade to Expo SDK 56 (React Native 0.85) by @gmaclennan
in #87
### 🏗️ Maintenance
* Android Testing Infrastructure & Bug Fixes by @gmaclennan in
#3
* chore: prebuild example/android; harden instrumented tests by
@gmaclennan in
#10
* chore: adjust repo setup by @achou11 in
#12
* chore: minor fixes based on expo-doctor by @achou11 in
#13
* chore: add architecture docs & plans by @gmaclennan in
#11
* chore: post-Phase-2 cleanup — comments, plan docs, agents.md by
@gmaclennan in
#33
* refactor: simplify build-backend.ts; rollup writes directly to native
asset trees by @gmaclennan in
#34
* chore: fix eslint configuration by @achou11 in
#41
* android: audit 16 KB page alignment on every shipped .so by
@gmaclennan in
#43
* chore: move example app into apps directory by @achou11 in
#18
* refactor: per-component lifecycle state with derived ComapeoState by
@gmaclennan in
#47
* android: fold waitForFile into connect retry loop by @gmaclennan in
#52
* chore: add e2e testing app by @achou11 in
#49
* ci: drop unreliable Android emulator snapshot caching by @gmaclennan
in #64
* use npm list instead of custom traversal to get native module versions
by @achou11 in
#70
* chore(e2e): add e2e tests on browserstack via Maestro by @achou11 in
#56
* chore(ci): add release workflow by @gmaclennan in
#90
* chore: fix npm script and release build script by @gmaclennan in
#91
* chore(pack): don't try to package build files by @gmaclennan in
#92
* chore(release): merge prerelease branch. by @gmaclennan in
#110
* ci(e2e): retry BrowserStack builds on infra-class flakes by
@gmaclennan in
#113
### Other Changes
* ci: derive changelog labels from PR titles + add Dependabot by
@gmaclennan in
#114

## New Contributors
* @achou11 made their first contribution in
#12
* @optic-release-automation[bot] made their first contribution in
#112

**Full Changelog**:
https://github.com/digidem/comapeo-core-react-native/commits/v1.0.0-pre.2

<!--

<release-meta>{"id":342970724,"version":"v1.0.0-pre.2","npmTag":"pre","opticUrl":"https://optic-zf3votdk5a-ew.a.run.app/api/generate/"}</release-meta>
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New feature (changelog)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants