fix(mounts): exponential backoff + cap on forced mount restarts (storm fix) by khaliqgant · Pull Request #383 · AgentWorkforce/pear

khaliqgant · 2026-06-18T14:02:19Z

Problem

The integration-mount watchdog respawned a wedged mount on every health poll (~45–60s) forever — 364 restarts on one github mount since 06-10 — and each fresh process re-leaks file descriptors from zero while it mirrors. Observed live: 12 mount processes, one pegged at ~41,000 fds (near the 92,160 macOS per-proc ceiling).

Root mechanism

queueForcedRestart's flat 60s throttle was defeated: the restart path (mount() → stopHandle) deleted authRestartedAt every restart, so the throttle never actually spaced restarts apart. That's why it looped every poll.

Fix

Backoff state (authRestartedAt + new restartAttempts) persists across a forced restart — stopHandle no longer clears it; cleared only on genuine teardown (stopAll) or when a path is dropped from the desired set.
Forced restarts back off exponentially from the 60s floor (60s → 120s → 240s → 480s → 960s), capped at 30min.
After 5 consecutive rapid restarts, auto-restart pauses and escalates via a new restart-cap-exceeded MountHealthAlert instead of hot-looping a fresh FD-leaking process. A mount stable for 30min resets the counter, so the far-apart hourly token-refresh restart never accrues backoff.
Tests: rewrote the multi-restart test to assert backoff (restart → held within window → allowed after) + a new cap/escalation test. 37 pass; tsc --noEmit clean.

Deliberately out of scope (follow-ups)

Two related issues remain, documented here from the investigation:

Over-broad github mirror scope — mountSpecsFor() blanket-applies syncMode: 'mirror' + read:<path>/** to every path (the justifying comment is Slack-specific). For /github/repos this reads down the entire contents/** tree (incl. .trajectories/...), which feeds the FD leak. Narrowing this needs a product decision on what github data Pear/the factory actually need locally (the factory reads github via its own cloud client, not this mirror), so it's split out.
Upstream FD leak — the Go relayfile-mount binary appears not to close the REG/DIR handles it opens during the mirror reconcile walk; the leak is per-process and unbounded by scope. This backoff fix stops the storm (and the per-restart re-leak), but a single long-lived large mirror can still climb — the real fix is upstream + scope (1).

🤖 Generated with Claude Code

…m fix) The integration mount watchdog respawned a wedged mount every health poll (~45-60s) indefinitely — 364 restarts on one github mount since 06-10 — and each fresh process re-leaks file descriptors from zero while it mirrors (observed: one mount pegged at ~41k fds, near the 92k macOS ceiling). Root mechanism: queueForcedRestart's flat 60s throttle was DEFEATED because the restart path (mount() -> stopHandle) deleted authRestartedAt every restart, so the throttle never actually spaced restarts apart. - Backoff state (authRestartedAt + a new restartAttempts counter) now PERSISTS across a forced restart; stopHandle no longer clears it. It is cleared only on genuine teardown (stopAll) or when a path is dropped from the desired set. - Forced restarts back off exponentially from the 60s floor (60s, 120s, 240s, 480s, 960s) capped at 30min, instead of a flat 60s. - After MOUNT_RESTART_CONSECUTIVE_CAP (5) rapid restarts, auto-restart pauses and escalates via a new `restart-cap-exceeded` MountHealthAlert instead of looping a fresh FD-leaking process forever. A mount stable for MOUNT_RESTART_RESET_MS (30min) resets the counter, so the far-apart hourly token-refresh restart never accrues backoff. - Tests: rewrote the multi-restart test to assert backoff (restart, held within window, allowed after) + a new cap+escalation test. 37 pass; typecheck clean. Does NOT change the github mirror scope (the FD-leak amplifier) — see PR notes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

coderabbitai · 2026-06-18T14:02:28Z

Warning

Review limit reached

@khaliqgant, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 4 minutes and 51 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 34f364ed-9604-4e6b-983d-96b91cd6fc15

📥 Commits

Reviewing files that changed from the base of the PR and between 6d52cdd and aa855a3.

📒 Files selected for processing (2)

src/main/integration-mounts.test.ts
src/main/integration-mounts.ts

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/mount-restart-backoff

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces exponential backoff and a consecutive restart cap for forced integration mount restarts in IntegrationMountManager to prevent rapid respawn loops and file descriptor leaks. The review feedback suggests using performance.now() instead of Date.now() to ensure a monotonic clock is used for measuring elapsed time, preventing issues caused by system clock adjustments.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-18T14:04:48Z

@@ -511,8 +537,26 @@ export class IntegrationMountManager {
    if (!this.handles.has(remotePath)) return false
    const now = Date.now()


Using Date.now() for measuring elapsed time for backoff/throttling can be susceptible to system clock changes, timezone adjustments, or NTP synchronization (especially common in desktop environments like Electron when a laptop is suspended/resumed).

Using performance.now() provides a monotonic clock that is guaranteed to only increase, preventing potential issues where clock drift or manual time changes bypass the backoff or cause extremely long, unexpected throttling periods.

Suggested change

const now = Date.now()

const now = performance.now()

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aa855a32d5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-18T14:05:02Z

+      console.warn(
+        `[integration-mounts] Mount ${remotePath} reached ${MOUNT_RESTART_CONSECUTIVE_CAP} consecutive restarts (${reason}); pausing auto-restart for ${Math.round(MOUNT_RESTART_RESET_MS / 60_000)}min pending manual recovery`
+      )
+      this.healthObserver?.({ type: 'restart-cap-exceeded', remotePath, attempts: nextAttempts, reason })


Propagate restart-cap alerts to the renderer

When a mount hits this cap, queueForcedRestart stops scheduling retries until the reset window, so the new alert is the only structured escalation path. In production, the only observer installed in src/main/integrations.ts handles auth-required and then returns unless alert.type === 'auth-stall' around the mount-health observer, and the IPC event types do not include restart-cap-exceeded, so this alert is dropped and users get no recovery signal while the mount remains wedged.

Useful? React with 👍 / 👎.

agent-relay-code · 2026-06-18T14:08:22Z

Confirmed: the only consumer of MountHealthAlert (integrations.ts:930) uses if-checks with early returns, not an exhaustive switch/never assertion. The new restart-cap-exceeded variant is safely ignored there. No downstream type breakage.

My review is complete. The PR is internally consistent, semantically conservative, and passes lint, typecheck, and the full vitest suite (451 tests). No edits are warranted — the change is in safety-critical restart/lifecycle code, which I am explicitly directed not to modify, and it requires no mechanical fixes.

Summary

PR #383 (fix/mount-restart-backoff) adds exponential backoff and a consecutive-restart cap to forced integration-mount restarts in src/main/integration-mounts.ts, plus two updated/new tests. I validated the diff against the current checkout — it applies cleanly and matches.

Verification (the way CI runs it):

npx eslint on both changed files — clean
npm run typecheck:node — clean
npx vitest run (full suite, the CI gate) — 31 files / 451 tests passing, including the 37 in integration-mounts.test.ts

Correctness/safety review:

The change only suppresses restarts (returns false from queueForcedRestart during the backoff window or after the cap). It never converts a failure/timeout into a fake success — direction is fail-safe, not fail-open.
The restart-cap-exceeded addition to MountHealthAlert is additive. The sole consumer (src/main/integrations.ts:930) uses if-based type narrowing with early returns (not an exhaustive switch/never), so the new variant is safely ignored — no type or runtime breakage downstream.
stopHandle deliberately no longer clears authRestartedAt/restartAttempts (the storm root cause); they are cleared only on genuine teardown (stopAll) and on desired-set drop (mount()), both of which the diff handles. The wall-clock throttle is independent of the pre-existing handledHealthErrorKeys bookkeeping, so a genuinely new error signature still re-enters and is correctly deferred.

This is lifecycle/restart code, which I am directed to treat as review-only. It needed no mechanical fixes, so I left the working tree unchanged.

Addressed comments

No bot or human review comments were present in the provided PR metadata (.workforce/context.json contains no comments/reviews, and there are no cached review threads in .workforce/). Nothing to address.

Advisory Notes

None. The change is correctly scoped to the PR's stated purpose (mount restart-storm fix) and introduces no out-of-scope edits.

I am not printing READY: I cannot observe live CI status or GitHub mergeability from this sandbox (those are post-harness/cloud-reported), and the no-comment state plus a green local run does not by itself establish that every required remote check has completed and passed.

gemini-code-assist Bot reviewed Jun 18, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 18, 2026

View reviewed changes

khaliqgant merged commit 9212c69 into main Jun 18, 2026
5 checks passed

khaliqgant deleted the fix/mount-restart-backoff branch June 18, 2026 17:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mounts): exponential backoff + cap on forced mount restarts (storm fix)#383

fix(mounts): exponential backoff + cap on forced mount restarts (storm fix)#383
khaliqgant merged 1 commit into
mainfrom
fix/mount-restart-backoff

khaliqgant commented Jun 18, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 18, 2026

Review limit reached

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 18, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 18, 2026

Uh oh!

agent-relay-code Bot commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		@@ -511,8 +537,26 @@ export class IntegrationMountManager {
		if (!this.handles.has(remotePath)) return false
		const now = Date.now()

Conversation

khaliqgant commented Jun 18, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root mechanism

Fix

Deliberately out of scope (follow-ups)

Uh oh!

coderabbitai Bot commented Jun 18, 2026

Review limit reached

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

agent-relay-code Bot commented Jun 18, 2026

Summary

Addressed comments

Advisory Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

khaliqgant commented Jun 18, 2026 •

edited by cubic-dev-ai Bot

Loading