mount daemon: auth thrash loop on refresh-token rotation with concurrent CLI processes

## Symptom

The mount daemon enters a tight (~25 second) restart loop emitting:

```
mount local change failed: join cloud workspace: http 401: Unauthorized
cloud session refresh failed: refresh cloud session: http 403: Invalid or expired refresh token
mount sync cycle failed: context canceled
mount sync stopping: context canceled
Synced mirror started at …
…repeat…
```

Every cycle gets canceled before completing meaningful work. On a workspace with a 53k-event backlog this means **non-convergent catch-up** even though the actual catch-up logic (#177) works correctly: in today's session on `rw_fc7b534b`, the very first cycle after restart cleanly advanced the cursor `evt_1 → evt_17493` and added ~1050 files, then the daemon entered the auth thrash loop and **all subsequent cycles got canceled** — files flat at 3766, cursor pinned at `evt_17493` for the rest of the session.

## Root cause

**Refresh-token rotation racing between the daemon and concurrent `relayfile` CLI processes.** Both share `~/.relayfile/cloud-credentials.json` but each holds its own in-memory copy. Sequence:

1. Daemon starts, loads `refreshToken = R1` into memory.
2. Any concurrent `relayfile <subcommand>` (status, tree, read, pull, …) that refreshes triggers the cloud to rotate `R1 → R2` and write `R2` to disk.
3. Daemon's access token expires (currently a **30-minute** TTL — short enough that this happens routinely).
4. Daemon refreshes with its in-memory `R1`. Cloud has already invalidated `R1` → **403 "Invalid or expired refresh token"**.
5. Daemon declares auth dead, **restarts its sync loop**, re-reads disk, briefly picks up `R2`.
6. Next CLI invocation rotates `R2 → R3`. Goto 4.

The on-disk refresh token is *not* expired (it has a 30-day TTL); the cloud is invalidating it the moment a rotation happens elsewhere. This is invisible to a casual look at `cloud-credentials.json` (which still shows healthy `refreshTokenExpiresAt`).

## Empirical evidence from today's session

- Disk state at investigation time:
  - `accessTokenExpiresAt`: ~18 min in the future ✅
  - `refreshTokenExpiresAt`: ~30 days in the future ✅
  - `updatedAt`: 12 minutes ago (recent rotation)
- Daemon process etime: `05:36` (alive and steady, never killed) — but `Synced mirror started` appeared **14 times** in the same 5:36 window. That's the sync loop restarting ~every 24s **inside the same process**, not the process dying.
- Cycle outcome counts in the same window:
  - `Synced mirror started`: 14
  - `mount sync stopping`: 13
  - `mount sync cycle failed`: 15
  - `mount sync cycle completed`: 14 (all from earlier, healthy window)
- The thrash started immediately after the first cycle did real work — which is consistent with the first refresh attempt rotating, and from then on someone else (a CLI invocation or the daemon's own next attempt) racing it.

## Why this surfaces now

PR #176 ("Harden mount daemon recovery follow-ups") appears to have added an auto-restart-on-auth-failure path. Pre-#176, a transient auth failure may have just been logged and the cycle continued or retried within the same loop. Post-#176, a *persistent* auth condition turns a single contention event into a continuous restart loop, which is now actively preventing convergence even though #177's checkpoint mechanism is working.

## Proposed fixes (any one helps, ideally all three)

1. **Reread credentials from disk on auth failure before restarting.** When the in-memory refresh token gets a 403, re-load `~/.relayfile/cloud-credentials.json` and retry once with the disk value before declaring auth dead. Absorbs the rotation that happened in another process.
2. **Backoff + clear "needs re-login" state instead of tight restart.** If reload-and-retry also 403s, the daemon should switch to a `daemon: auth-failed (run \`relayfile login\`)` state surfaced in `relayfile status`, and NOT churn the sync loop every 25s. Catastrophic-mode behavior should be visible to operators, not silent thrash.
3. **(Cross-cut, optional cloud change)** For daemon sessions, consider issuing **non-rotating refresh tokens**, or a dedicated long-lived daemon credential that doesn't rotate on use. The rotation model is fine for human-driven CLI use; it's a footgun for a long-running daemon that necessarily shares creds with concurrent CLI calls.

## Related

- This is the actual blocker preventing #175 F's fix (PR #177) from delivering its intended benefit on workspaces with a real backlog. With #177 alone, the daemon can converge — but only if the auth path isn't thrashing concurrently.
- Symptom shape (silent restart loop, no clear operator surface) overlaps with #175 C (`relayfile status` reports `daemon: not running` for foreground mounts) — both are visibility gaps where the daemon enters a bad state without the CLI showing it clearly.

## Reproducible

Today's session is the repro: workspace with large event backlog, run the daemon, run a few `relayfile status` / `relayfile tree` / `relayfile pull` invocations concurrently while it's grinding. Daemon will enter the loop within minutes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mount daemon: auth thrash loop on refresh-token rotation with concurrent CLI processes #178

Symptom

Root cause

Empirical evidence from today's session

Why this surfaces now

Proposed fixes (any one helps, ideally all three)

Related

Reproducible

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

mount daemon: auth thrash loop on refresh-token rotation with concurrent CLI processes #178

Description

Symptom

Root cause

Empirical evidence from today's session

Why this surfaces now

Proposed fixes (any one helps, ideally all three)

Related

Reproducible

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions