Symptom
The mount daemon enters a tight (~25 second) restart loop emitting:
mount local change failed: join cloud workspace: http 401: Unauthorized
cloud session refresh failed: refresh cloud session: http 403: Invalid or expired refresh token
mount sync cycle failed: context canceled
mount sync stopping: context canceled
Synced mirror started at …
…repeat…
Every cycle gets canceled before completing meaningful work. On a workspace with a 53k-event backlog this means non-convergent catch-up even though the actual catch-up logic (#177) works correctly: in today's session on rw_fc7b534b, the very first cycle after restart cleanly advanced the cursor evt_1 → evt_17493 and added ~1050 files, then the daemon entered the auth thrash loop and all subsequent cycles got canceled — files flat at 3766, cursor pinned at evt_17493 for the rest of the session.
Root cause
Refresh-token rotation racing between the daemon and concurrent relayfile CLI processes. Both share ~/.relayfile/cloud-credentials.json but each holds its own in-memory copy. Sequence:
- Daemon starts, loads
refreshToken = R1 into memory.
- Any concurrent
relayfile <subcommand> (status, tree, read, pull, …) that refreshes triggers the cloud to rotate R1 → R2 and write R2 to disk.
- Daemon's access token expires (currently a 30-minute TTL — short enough that this happens routinely).
- Daemon refreshes with its in-memory
R1. Cloud has already invalidated R1 → 403 "Invalid or expired refresh token".
- Daemon declares auth dead, restarts its sync loop, re-reads disk, briefly picks up
R2.
- Next CLI invocation rotates
R2 → R3. Goto 4.
The on-disk refresh token is not expired (it has a 30-day TTL); the cloud is invalidating it the moment a rotation happens elsewhere. This is invisible to a casual look at cloud-credentials.json (which still shows healthy refreshTokenExpiresAt).
Empirical evidence from today's session
- Disk state at investigation time:
accessTokenExpiresAt: ~18 min in the future ✅
refreshTokenExpiresAt: ~30 days in the future ✅
updatedAt: 12 minutes ago (recent rotation)
- Daemon process etime:
05:36 (alive and steady, never killed) — but Synced mirror started appeared 14 times in the same 5:36 window. That's the sync loop restarting ~every 24s inside the same process, not the process dying.
- Cycle outcome counts in the same window:
Synced mirror started: 14
mount sync stopping: 13
mount sync cycle failed: 15
mount sync cycle completed: 14 (all from earlier, healthy window)
- The thrash started immediately after the first cycle did real work — which is consistent with the first refresh attempt rotating, and from then on someone else (a CLI invocation or the daemon's own next attempt) racing it.
Why this surfaces now
PR #176 ("Harden mount daemon recovery follow-ups") appears to have added an auto-restart-on-auth-failure path. Pre-#176, a transient auth failure may have just been logged and the cycle continued or retried within the same loop. Post-#176, a persistent auth condition turns a single contention event into a continuous restart loop, which is now actively preventing convergence even though #177's checkpoint mechanism is working.
Proposed fixes (any one helps, ideally all three)
- Reread credentials from disk on auth failure before restarting. When the in-memory refresh token gets a 403, re-load
~/.relayfile/cloud-credentials.json and retry once with the disk value before declaring auth dead. Absorbs the rotation that happened in another process.
- Backoff + clear "needs re-login" state instead of tight restart. If reload-and-retry also 403s, the daemon should switch to a
daemon: auth-failed (run \relayfile login`)state surfaced inrelayfile status`, and NOT churn the sync loop every 25s. Catastrophic-mode behavior should be visible to operators, not silent thrash.
- (Cross-cut, optional cloud change) For daemon sessions, consider issuing non-rotating refresh tokens, or a dedicated long-lived daemon credential that doesn't rotate on use. The rotation model is fine for human-driven CLI use; it's a footgun for a long-running daemon that necessarily shares creds with concurrent CLI calls.
Related
Reproducible
Today's session is the repro: workspace with large event backlog, run the daemon, run a few relayfile status / relayfile tree / relayfile pull invocations concurrently while it's grinding. Daemon will enter the loop within minutes.
🤖 Generated with Claude Code
Symptom
The mount daemon enters a tight (~25 second) restart loop emitting:
Every cycle gets canceled before completing meaningful work. On a workspace with a 53k-event backlog this means non-convergent catch-up even though the actual catch-up logic (#177) works correctly: in today's session on
rw_fc7b534b, the very first cycle after restart cleanly advanced the cursorevt_1 → evt_17493and added ~1050 files, then the daemon entered the auth thrash loop and all subsequent cycles got canceled — files flat at 3766, cursor pinned atevt_17493for the rest of the session.Root cause
Refresh-token rotation racing between the daemon and concurrent
relayfileCLI processes. Both share~/.relayfile/cloud-credentials.jsonbut each holds its own in-memory copy. Sequence:refreshToken = R1into memory.relayfile <subcommand>(status, tree, read, pull, …) that refreshes triggers the cloud to rotateR1 → R2and writeR2to disk.R1. Cloud has already invalidatedR1→ 403 "Invalid or expired refresh token".R2.R2 → R3. Goto 4.The on-disk refresh token is not expired (it has a 30-day TTL); the cloud is invalidating it the moment a rotation happens elsewhere. This is invisible to a casual look at
cloud-credentials.json(which still shows healthyrefreshTokenExpiresAt).Empirical evidence from today's session
accessTokenExpiresAt: ~18 min in the future ✅refreshTokenExpiresAt: ~30 days in the future ✅updatedAt: 12 minutes ago (recent rotation)05:36(alive and steady, never killed) — butSynced mirror startedappeared 14 times in the same 5:36 window. That's the sync loop restarting ~every 24s inside the same process, not the process dying.Synced mirror started: 14mount sync stopping: 13mount sync cycle failed: 15mount sync cycle completed: 14 (all from earlier, healthy window)Why this surfaces now
PR #176 ("Harden mount daemon recovery follow-ups") appears to have added an auto-restart-on-auth-failure path. Pre-#176, a transient auth failure may have just been logged and the cycle continued or retried within the same loop. Post-#176, a persistent auth condition turns a single contention event into a continuous restart loop, which is now actively preventing convergence even though #177's checkpoint mechanism is working.
Proposed fixes (any one helps, ideally all three)
~/.relayfile/cloud-credentials.jsonand retry once with the disk value before declaring auth dead. Absorbs the rotation that happened in another process.daemon: auth-failed (run \relayfile login`)state surfaced inrelayfile status`, and NOT churn the sync loop every 25s. Catastrophic-mode behavior should be visible to operators, not silent thrash.Related
relayfile statusreportsdaemon: not runningfor foreground mounts) — both are visibility gaps where the daemon enters a bad state without the CLI showing it clearly.Reproducible
Today's session is the repro: workspace with large event backlog, run the daemon, run a few
relayfile status/relayfile tree/relayfile pullinvocations concurrently while it's grinding. Daemon will enter the loop within minutes.🤖 Generated with Claude Code