Skip to content

fix(mounts): exponential backoff + cap on forced mount restarts (storm fix)#383

Merged
khaliqgant merged 1 commit into
mainfrom
fix/mount-restart-backoff
Jun 18, 2026
Merged

fix(mounts): exponential backoff + cap on forced mount restarts (storm fix)#383
khaliqgant merged 1 commit into
mainfrom
fix/mount-restart-backoff

Conversation

@khaliqgant

@khaliqgant khaliqgant commented Jun 18, 2026

Copy link
Copy Markdown
Member

Problem

The integration-mount watchdog respawned a wedged mount on every health poll (~45–60s) forever364 restarts on one github mount since 06-10 — and each fresh process re-leaks file descriptors from zero while it mirrors. Observed live: 12 mount processes, one pegged at ~41,000 fds (near the 92,160 macOS per-proc ceiling).

Root mechanism

queueForcedRestart's flat 60s throttle was defeated: the restart path (mount()stopHandle) deleted authRestartedAt every restart, so the throttle never actually spaced restarts apart. That's why it looped every poll.

Fix

  • Backoff state (authRestartedAt + new restartAttempts) persists across a forced restartstopHandle no longer clears it; cleared only on genuine teardown (stopAll) or when a path is dropped from the desired set.
  • Forced restarts back off exponentially from the 60s floor (60s → 120s → 240s → 480s → 960s), capped at 30min.
  • After 5 consecutive rapid restarts, auto-restart pauses and escalates via a new restart-cap-exceeded MountHealthAlert instead of hot-looping a fresh FD-leaking process. A mount stable for 30min resets the counter, so the far-apart hourly token-refresh restart never accrues backoff.
  • Tests: rewrote the multi-restart test to assert backoff (restart → held within window → allowed after) + a new cap/escalation test. 37 pass; tsc --noEmit clean.

Deliberately out of scope (follow-ups)

Two related issues remain, documented here from the investigation:

  1. Over-broad github mirror scopemountSpecsFor() blanket-applies syncMode: 'mirror' + read:<path>/** to every path (the justifying comment is Slack-specific). For /github/repos this reads down the entire contents/** tree (incl. .trajectories/...), which feeds the FD leak. Narrowing this needs a product decision on what github data Pear/the factory actually need locally (the factory reads github via its own cloud client, not this mirror), so it's split out.
  2. Upstream FD leak — the Go relayfile-mount binary appears not to close the REG/DIR handles it opens during the mirror reconcile walk; the leak is per-process and unbounded by scope. This backoff fix stops the storm (and the per-restart re-leak), but a single long-lived large mirror can still climb — the real fix is upstream + scope (1).

🤖 Generated with Claude Code

Review in cubic

…m fix)

The integration mount watchdog respawned a wedged mount every health poll
(~45-60s) indefinitely — 364 restarts on one github mount since 06-10 — and
each fresh process re-leaks file descriptors from zero while it mirrors
(observed: one mount pegged at ~41k fds, near the 92k macOS ceiling).

Root mechanism: queueForcedRestart's flat 60s throttle was DEFEATED because the
restart path (mount() -> stopHandle) deleted authRestartedAt every restart, so
the throttle never actually spaced restarts apart.

- Backoff state (authRestartedAt + a new restartAttempts counter) now PERSISTS
  across a forced restart; stopHandle no longer clears it. It is cleared only on
  genuine teardown (stopAll) or when a path is dropped from the desired set.
- Forced restarts back off exponentially from the 60s floor (60s, 120s, 240s,
  480s, 960s) capped at 30min, instead of a flat 60s.
- After MOUNT_RESTART_CONSECUTIVE_CAP (5) rapid restarts, auto-restart pauses and
  escalates via a new `restart-cap-exceeded` MountHealthAlert instead of looping
  a fresh FD-leaking process forever. A mount stable for MOUNT_RESTART_RESET_MS
  (30min) resets the counter, so the far-apart hourly token-refresh restart never
  accrues backoff.
- Tests: rewrote the multi-restart test to assert backoff (restart, held within
  window, allowed after) + a new cap+escalation test. 37 pass; typecheck clean.

Does NOT change the github mirror scope (the FD-leak amplifier) — see PR notes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@khaliqgant, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 4 minutes and 51 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 34f364ed-9604-4e6b-983d-96b91cd6fc15

📥 Commits

Reviewing files that changed from the base of the PR and between 6d52cdd and aa855a3.

📒 Files selected for processing (2)
  • src/main/integration-mounts.test.ts
  • src/main/integration-mounts.ts
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/mount-restart-backoff

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces exponential backoff and a consecutive restart cap for forced integration mount restarts in IntegrationMountManager to prevent rapid respawn loops and file descriptor leaks. The review feedback suggests using performance.now() instead of Date.now() to ensure a monotonic clock is used for measuring elapsed time, preventing issues caused by system clock adjustments.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@@ -511,8 +537,26 @@ export class IntegrationMountManager {
if (!this.handles.has(remotePath)) return false
const now = Date.now()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using Date.now() for measuring elapsed time for backoff/throttling can be susceptible to system clock changes, timezone adjustments, or NTP synchronization (especially common in desktop environments like Electron when a laptop is suspended/resumed).

Using performance.now() provides a monotonic clock that is guaranteed to only increase, preventing potential issues where clock drift or manual time changes bypass the backoff or cause extremely long, unexpected throttling periods.

Suggested change
const now = Date.now()
const now = performance.now()

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aa855a32d5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

console.warn(
`[integration-mounts] Mount ${remotePath} reached ${MOUNT_RESTART_CONSECUTIVE_CAP} consecutive restarts (${reason}); pausing auto-restart for ${Math.round(MOUNT_RESTART_RESET_MS / 60_000)}min pending manual recovery`
)
this.healthObserver?.({ type: 'restart-cap-exceeded', remotePath, attempts: nextAttempts, reason })

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Propagate restart-cap alerts to the renderer

When a mount hits this cap, queueForcedRestart stops scheduling retries until the reset window, so the new alert is the only structured escalation path. In production, the only observer installed in src/main/integrations.ts handles auth-required and then returns unless alert.type === 'auth-stall' around the mount-health observer, and the IPC event types do not include restart-cap-exceeded, so this alert is dropped and users get no recovery signal while the mount remains wedged.

Useful? React with 👍 / 👎.

@agent-relay-code

Copy link
Copy Markdown
Contributor

Confirmed: the only consumer of MountHealthAlert (integrations.ts:930) uses if-checks with early returns, not an exhaustive switch/never assertion. The new restart-cap-exceeded variant is safely ignored there. No downstream type breakage.

My review is complete. The PR is internally consistent, semantically conservative, and passes lint, typecheck, and the full vitest suite (451 tests). No edits are warranted — the change is in safety-critical restart/lifecycle code, which I am explicitly directed not to modify, and it requires no mechanical fixes.

Summary

PR #383 (fix/mount-restart-backoff) adds exponential backoff and a consecutive-restart cap to forced integration-mount restarts in src/main/integration-mounts.ts, plus two updated/new tests. I validated the diff against the current checkout — it applies cleanly and matches.

Verification (the way CI runs it):

  • npx eslint on both changed files — clean
  • npm run typecheck:node — clean
  • npx vitest run (full suite, the CI gate) — 31 files / 451 tests passing, including the 37 in integration-mounts.test.ts

Correctness/safety review:

  • The change only suppresses restarts (returns false from queueForcedRestart during the backoff window or after the cap). It never converts a failure/timeout into a fake success — direction is fail-safe, not fail-open.
  • The restart-cap-exceeded addition to MountHealthAlert is additive. The sole consumer (src/main/integrations.ts:930) uses if-based type narrowing with early returns (not an exhaustive switch/never), so the new variant is safely ignored — no type or runtime breakage downstream.
  • stopHandle deliberately no longer clears authRestartedAt/restartAttempts (the storm root cause); they are cleared only on genuine teardown (stopAll) and on desired-set drop (mount()), both of which the diff handles. The wall-clock throttle is independent of the pre-existing handledHealthErrorKeys bookkeeping, so a genuinely new error signature still re-enters and is correctly deferred.

This is lifecycle/restart code, which I am directed to treat as review-only. It needed no mechanical fixes, so I left the working tree unchanged.

Addressed comments

  • No bot or human review comments were present in the provided PR metadata (.workforce/context.json contains no comments/reviews, and there are no cached review threads in .workforce/). Nothing to address.

Advisory Notes

  • None. The change is correctly scoped to the PR's stated purpose (mount restart-storm fix) and introduces no out-of-scope edits.

I am not printing READY: I cannot observe live CI status or GitHub mergeability from this sandbox (those are post-harness/cloud-reported), and the no-comment state plus a green local run does not by itself establish that every required remote check has completed and passed.

@khaliqgant khaliqgant merged commit 9212c69 into main Jun 18, 2026
5 checks passed
@khaliqgant khaliqgant deleted the fix/mount-restart-backoff branch June 18, 2026 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant