fix(mounts): exponential backoff + cap on forced mount restarts (storm fix)#383
Conversation
…m fix) The integration mount watchdog respawned a wedged mount every health poll (~45-60s) indefinitely — 364 restarts on one github mount since 06-10 — and each fresh process re-leaks file descriptors from zero while it mirrors (observed: one mount pegged at ~41k fds, near the 92k macOS ceiling). Root mechanism: queueForcedRestart's flat 60s throttle was DEFEATED because the restart path (mount() -> stopHandle) deleted authRestartedAt every restart, so the throttle never actually spaced restarts apart. - Backoff state (authRestartedAt + a new restartAttempts counter) now PERSISTS across a forced restart; stopHandle no longer clears it. It is cleared only on genuine teardown (stopAll) or when a path is dropped from the desired set. - Forced restarts back off exponentially from the 60s floor (60s, 120s, 240s, 480s, 960s) capped at 30min, instead of a flat 60s. - After MOUNT_RESTART_CONSECUTIVE_CAP (5) rapid restarts, auto-restart pauses and escalates via a new `restart-cap-exceeded` MountHealthAlert instead of looping a fresh FD-leaking process forever. A mount stable for MOUNT_RESTART_RESET_MS (30min) resets the counter, so the far-apart hourly token-refresh restart never accrues backoff. - Tests: rewrote the multi-restart test to assert backoff (restart, held within window, allowed after) + a new cap+escalation test. 37 pass; typecheck clean. Does NOT change the github mirror scope (the FD-leak amplifier) — see PR notes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Warning Review limit reached
More reviews will be available in 4 minutes and 51 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces exponential backoff and a consecutive restart cap for forced integration mount restarts in IntegrationMountManager to prevent rapid respawn loops and file descriptor leaks. The review feedback suggests using performance.now() instead of Date.now() to ensure a monotonic clock is used for measuring elapsed time, preventing issues caused by system clock adjustments.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| @@ -511,8 +537,26 @@ export class IntegrationMountManager { | |||
| if (!this.handles.has(remotePath)) return false | |||
| const now = Date.now() | |||
There was a problem hiding this comment.
Using Date.now() for measuring elapsed time for backoff/throttling can be susceptible to system clock changes, timezone adjustments, or NTP synchronization (especially common in desktop environments like Electron when a laptop is suspended/resumed).
Using performance.now() provides a monotonic clock that is guaranteed to only increase, preventing potential issues where clock drift or manual time changes bypass the backoff or cause extremely long, unexpected throttling periods.
| const now = Date.now() | |
| const now = performance.now() |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: aa855a32d5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| console.warn( | ||
| `[integration-mounts] Mount ${remotePath} reached ${MOUNT_RESTART_CONSECUTIVE_CAP} consecutive restarts (${reason}); pausing auto-restart for ${Math.round(MOUNT_RESTART_RESET_MS / 60_000)}min pending manual recovery` | ||
| ) | ||
| this.healthObserver?.({ type: 'restart-cap-exceeded', remotePath, attempts: nextAttempts, reason }) |
There was a problem hiding this comment.
Propagate restart-cap alerts to the renderer
When a mount hits this cap, queueForcedRestart stops scheduling retries until the reset window, so the new alert is the only structured escalation path. In production, the only observer installed in src/main/integrations.ts handles auth-required and then returns unless alert.type === 'auth-stall' around the mount-health observer, and the IPC event types do not include restart-cap-exceeded, so this alert is dropped and users get no recovery signal while the mount remains wedged.
Useful? React with 👍 / 👎.
|
Confirmed: the only consumer of My review is complete. The PR is internally consistent, semantically conservative, and passes lint, typecheck, and the full vitest suite (451 tests). No edits are warranted — the change is in safety-critical restart/lifecycle code, which I am explicitly directed not to modify, and it requires no mechanical fixes. SummaryPR #383 ( Verification (the way CI runs it):
Correctness/safety review:
This is lifecycle/restart code, which I am directed to treat as review-only. It needed no mechanical fixes, so I left the working tree unchanged. Addressed comments
Advisory Notes
I am not printing READY: I cannot observe live CI status or GitHub mergeability from this sandbox (those are post-harness/cloud-reported), and the no-comment state plus a green local run does not by itself establish that every required remote check has completed and passed. |
Problem
The integration-mount watchdog respawned a wedged mount on every health poll (~45–60s) forever — 364 restarts on one github mount since 06-10 — and each fresh process re-leaks file descriptors from zero while it mirrors. Observed live: 12 mount processes, one pegged at ~41,000 fds (near the 92,160 macOS per-proc ceiling).
Root mechanism
queueForcedRestart's flat 60s throttle was defeated: the restart path (mount()→stopHandle) deletedauthRestartedAtevery restart, so the throttle never actually spaced restarts apart. That's why it looped every poll.Fix
authRestartedAt+ newrestartAttempts) persists across a forced restart —stopHandleno longer clears it; cleared only on genuine teardown (stopAll) or when a path is dropped from the desired set.restart-cap-exceededMountHealthAlertinstead of hot-looping a fresh FD-leaking process. A mount stable for 30min resets the counter, so the far-apart hourly token-refresh restart never accrues backoff.tsc --noEmitclean.Deliberately out of scope (follow-ups)
Two related issues remain, documented here from the investigation:
mountSpecsFor()blanket-appliessyncMode: 'mirror'+read:<path>/**to every path (the justifying comment is Slack-specific). For/github/reposthis reads down the entirecontents/**tree (incl..trajectories/...), which feeds the FD leak. Narrowing this needs a product decision on what github data Pear/the factory actually need locally (the factory reads github via its own cloud client, not this mirror), so it's split out.relayfile-mountbinary appears not to close the REG/DIR handles it opens during the mirror reconcile walk; the leak is per-process and unbounded by scope. This backoff fix stops the storm (and the per-restart re-leak), but a single long-lived large mirror can still climb — the real fix is upstream + scope (1).🤖 Generated with Claude Code