Skip to content

mount daemon: auth thrash loop on refresh-token rotation with concurrent CLI processes #178

Description

@khaliqgant

Symptom

The mount daemon enters a tight (~25 second) restart loop emitting:

mount local change failed: join cloud workspace: http 401: Unauthorized
cloud session refresh failed: refresh cloud session: http 403: Invalid or expired refresh token
mount sync cycle failed: context canceled
mount sync stopping: context canceled
Synced mirror started at …
…repeat…

Every cycle gets canceled before completing meaningful work. On a workspace with a 53k-event backlog this means non-convergent catch-up even though the actual catch-up logic (#177) works correctly: in today's session on rw_fc7b534b, the very first cycle after restart cleanly advanced the cursor evt_1 → evt_17493 and added ~1050 files, then the daemon entered the auth thrash loop and all subsequent cycles got canceled — files flat at 3766, cursor pinned at evt_17493 for the rest of the session.

Root cause

Refresh-token rotation racing between the daemon and concurrent relayfile CLI processes. Both share ~/.relayfile/cloud-credentials.json but each holds its own in-memory copy. Sequence:

  1. Daemon starts, loads refreshToken = R1 into memory.
  2. Any concurrent relayfile <subcommand> (status, tree, read, pull, …) that refreshes triggers the cloud to rotate R1 → R2 and write R2 to disk.
  3. Daemon's access token expires (currently a 30-minute TTL — short enough that this happens routinely).
  4. Daemon refreshes with its in-memory R1. Cloud has already invalidated R1403 "Invalid or expired refresh token".
  5. Daemon declares auth dead, restarts its sync loop, re-reads disk, briefly picks up R2.
  6. Next CLI invocation rotates R2 → R3. Goto 4.

The on-disk refresh token is not expired (it has a 30-day TTL); the cloud is invalidating it the moment a rotation happens elsewhere. This is invisible to a casual look at cloud-credentials.json (which still shows healthy refreshTokenExpiresAt).

Empirical evidence from today's session

  • Disk state at investigation time:
    • accessTokenExpiresAt: ~18 min in the future ✅
    • refreshTokenExpiresAt: ~30 days in the future ✅
    • updatedAt: 12 minutes ago (recent rotation)
  • Daemon process etime: 05:36 (alive and steady, never killed) — but Synced mirror started appeared 14 times in the same 5:36 window. That's the sync loop restarting ~every 24s inside the same process, not the process dying.
  • Cycle outcome counts in the same window:
    • Synced mirror started: 14
    • mount sync stopping: 13
    • mount sync cycle failed: 15
    • mount sync cycle completed: 14 (all from earlier, healthy window)
  • The thrash started immediately after the first cycle did real work — which is consistent with the first refresh attempt rotating, and from then on someone else (a CLI invocation or the daemon's own next attempt) racing it.

Why this surfaces now

PR #176 ("Harden mount daemon recovery follow-ups") appears to have added an auto-restart-on-auth-failure path. Pre-#176, a transient auth failure may have just been logged and the cycle continued or retried within the same loop. Post-#176, a persistent auth condition turns a single contention event into a continuous restart loop, which is now actively preventing convergence even though #177's checkpoint mechanism is working.

Proposed fixes (any one helps, ideally all three)

  1. Reread credentials from disk on auth failure before restarting. When the in-memory refresh token gets a 403, re-load ~/.relayfile/cloud-credentials.json and retry once with the disk value before declaring auth dead. Absorbs the rotation that happened in another process.
  2. Backoff + clear "needs re-login" state instead of tight restart. If reload-and-retry also 403s, the daemon should switch to a daemon: auth-failed (run \relayfile login`)state surfaced inrelayfile status`, and NOT churn the sync loop every 25s. Catastrophic-mode behavior should be visible to operators, not silent thrash.
  3. (Cross-cut, optional cloud change) For daemon sessions, consider issuing non-rotating refresh tokens, or a dedicated long-lived daemon credential that doesn't rotate on use. The rotation model is fine for human-driven CLI use; it's a footgun for a long-running daemon that necessarily shares creds with concurrent CLI calls.

Related

Reproducible

Today's session is the repro: workspace with large event backlog, run the daemon, run a few relayfile status / relayfile tree / relayfile pull invocations concurrently while it's grinding. Daemon will enter the loop within minutes.

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions