feat(integration-mounts): stalled-revision mount-recovery backstop#169
Conversation
|
Your free trial PR review limit of 300 PRs has been reached. Please upgrade your plan to continue using CodeAnt AI. |
|
Warning Review limit reached
More reviews will be available in 58 minutes and 34 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to detect and recover from stalled injected revisions (silently dropped incremental events) by scanning the integration events log and forcing a mount restart with a fresh full pull when necessary. It also disables websocket transport for Slack /threads subtrees to enforce more frequent polling. The review feedback highlights three key areas for improvement: optimizing performance by caching the parsed log lines to avoid redundant disk reads across multiple active mounts, narrowing the Slack threads detection to prevent false positives on other integration paths, and addressing potential unbounded growth of the integration events log file.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| let logText: string | ||
| try { | ||
| logText = await readFile(INTEGRATION_EVENT_LOG_PATH, 'utf8') | ||
| } catch { | ||
| return null | ||
| } | ||
|
|
||
| const syncedPaths = new Set(Object.keys(asRecord(state.files) ?? {})) | ||
| // omitempty in the daemon's publicState: a never-reconciled mount omits this, | ||
| // so lastReconcileMs stays null and every candidate is skipped below. | ||
| const lastReconcileMs = parseTimestamp( | ||
| typeof state.lastSuccessfulReconcileAt === 'string' ? state.lastSuccessfulReconcileAt : null | ||
| ) | ||
| const prefix = `${remotePath}/` | ||
| const now = Date.now() | ||
| const missing: Array<{ path: string; injectedAt: string }> = [] | ||
| const seen = new Set<string>() | ||
|
|
||
| for (const line of logText.trim().split(/\r?\n/u).slice(-STALLED_REVISION_LOG_SCAN_LINES)) { |
There was a problem hiding this comment.
Performance Bottleneck: Repeated File Reads and String Splitting
In checkMountHealth, readStalledInjectedRevisions is called sequentially for every active mount. Currently, each invocation reads the entire log file from disk and splits the entire string into lines. If there are
We can optimize this by caching the parsed lines for a short duration (e.g., 5 seconds) using static properties on the function itself. This ensures the file is read and parsed at most once per health check cycle, while automatically bypassing the cache during tests to keep them isolated.
const isTest = typeof process !== 'undefined' && (process.env.NODE_ENV === 'test' || process.env.VITEST === 'true')
const now = Date.now()
let lines: string[]
const self = readStalledInjectedRevisions as any
if (!isTest && self.cachedLines && now - self.cachedLinesTimestamp < 5000) {
lines = self.cachedLines
} else {
try {
const logText = await readFile(INTEGRATION_EVENT_LOG_PATH, 'utf8')
lines = logText.trim().split(/\\r?\\n/u).slice(-STALLED_REVISION_LOG_SCAN_LINES)
if (!isTest) {
self.cachedLines = lines
self.cachedLinesTimestamp = now
}
} catch {
return null
}
}
const syncedPaths = new Set(Object.keys(asRecord(state.files) ?? {}))
const lastReconcileMs = parseTimestamp(
typeof state.lastSuccessfulReconcileAt === 'string' ? state.lastSuccessfulReconcileAt : null
)
const prefix = `${remotePath}/`
const missing: Array<{ path: string; injectedAt: string }> = []
const seen = new Set<string>()
for (const line of lines) {| // cloud fs/events emitting the suffixed segment). The binary parses this via | ||
| // strconv.ParseBool, so 'false' definitively disables it and the var must be | ||
| // PASSED (omitting it falls back to true). | ||
| const isAliasSubtree = spec.remotePath.endsWith('/threads') |
There was a problem hiding this comment.
Overly Broad Path Match for Alias Subtree
The check spec.remotePath.endsWith('/threads') is used to identify Slack threads. However, this could accidentally match paths from other integration providers that happen to end with /threads (e.g., /linear/issues/threads or /github/threads), disabling their websocket transport unnecessarily.
To prevent false positives, we should restrict this check to Slack paths by ensuring the path starts with /slack/.
| const isAliasSubtree = spec.remotePath.endsWith('/threads') | |
| const isAliasSubtree = spec.remotePath.startsWith('/slack/') && spec.remotePath.endsWith('/threads') |
| const STALLED_REVISION_LOG_SCAN_LINES = 500 | ||
| // Mirrors integration-event-bridge.ts INTEGRATION_EVENT_LOG_PATH. Duplicated to | ||
| // avoid a mounts->bridge import cycle; hoist to a shared module if it grows. | ||
| const INTEGRATION_EVENT_LOG_PATH = join(homedir(), '.agentworkforce', 'pear', 'integration-events.log') |
There was a problem hiding this comment.
Unbounded Log File Growth
The log file integration-events.log is appended to indefinitely without any rotation or truncation mechanism. Over time, this file can grow to hundreds of megabytes or gigabytes, which will eventually cause readFile to fail or exhaust system memory.
Consider implementing a log rotation or truncation strategy (e.g., capping the file size or rotating it periodically) in the event bridge layer to ensure long-term stability.
|
Fixed one validated PR issue: the websocket stopgap now applies only to Slack channel/DM/user thread alias roots, instead of every mount ending in Added regression coverage that executes the generated launcher env for both Slack and Gmail cases: integration-mounts.test.ts. Addressed comments
Verification run:
|
|
✅ pr-reviewer applied fixes — committed and pushed Fixed one validated PR issue: the websocket stopgap now applies only to Slack channel/DM/user thread alias roots, instead of every mount ending in Added regression coverage that executes the generated launcher env for both Slack and Gmail cases: integration-mounts.test.ts. Addressed comments
Verification run:
|
There was a problem hiding this comment.
2 issues found across 2 files
You’re at about 98% of the monthly reviewed-line limit. You may want to disable incremental reviews to conserve quota. Reviews will continue until that limit is exceeded. If you need help avoiding interruptions, please contact contact@cubic.dev.
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
|
✅ pr-reviewer applied fixes — committed and pushed Implemented and verified fixes for the validated review findings. Changes made:
Addressed comments
Verification run:
I’m not printing |
|
Implemented one validated fix outside the original changed-file set: the new stalled-mount detector depended on Addressed comments
Local validation passed:
|
|
✅ pr-reviewer applied fixes — committed and pushed Implemented one validated fix outside the original changed-file set: the new stalled-mount detector depended on Addressed comments
Local validation passed:
|
|
✅ pr-reviewer applied fixes — committed and pushed Fixed a production duplicate-write bug in integration event breadcrumbs: non-debug Slack thread injections now persist through the breadcrumb path only, instead of also being written by generic debug logging. See integration-event-bridge.ts. Also tightened the breadcrumb test to wait for the async log write and assert exactly one persisted entry. See integration-event-bridge.test.ts and integration-event-bridge.test.ts. Addressed comments
Verification run:
|
|
ℹ️ pr-reviewer: review only — no file changes were applied to the PR (nothing to commit after review). The notes below are advisory and were not pushed. Reviewed PR #169 against Addressed comments
Local validation run:
I am not printing |
Adds a defense-in-depth detector to the mount health poll that catches a
mount which is cycling status:ready with no error yet has silently stopped
materializing new revisions — the failure mode behind the Slack threads
read-down outage (cloud emitted bare channel-id event paths that the mount's
literal-prefix isUnderRemoteRoot filter dropped). The durable fix for that
specific bug shipped server-side in cloud#2010 (live-verified in prod); this
is the general client-side backstop so any future silent incremental-sync
drop self-recovers instead of staying invisible until the ~100min full pull.
Detector (readStalledInjectedRevisions + a check in checkMountHealth, which
previously restarted ONLY on auth failure or deadline wedge): if the
event-bridge injected a revision under a mount (logged as "injecting" in
integration-events.log) that the mount's own state.json still doesn't list
after a 15min wall-clock grace AND the mount has successfully reconciled
since the revision arrived (lastSuccessfulReconcileAt > injectedAt), force a
queueForcedRestart(..., { clearState: true }) so the next full pull re-pulls
it. Reuses the existing 60s restart throttle + handledHealthErrorKeys dedup,
so one drop = one restart. General across mounts; only mounts with injecting
entries under their remoteRoot can match.
Correctness anchors (verified against the v0.8.19 daemon source): state.files
is keyed by the full normalized suffixed remote path, so the comparison is
apples-to-apples; lastSuccessfulReconcileAt is UTC RFC3339Nano + omitempty,
so a never-reconciled mount is skipped. Grace is wall-clock (not cycle-count)
to survive interval/transport retuning. Adds 7 unit tests for the detector.
(Originally also carried a threads-only RELAYFILE_MOUNT_WEBSOCKET=false
staleness-floor stopgap; dropped after cloud#2010 restored correct
incremental read-down with websocket on.)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
f4563b5 to
ce2d065
Compare
There was a problem hiding this comment.
1 issue found across 2 files
You’re at about 99% of the monthly reviewed-line limit. You may want to disable incremental reviews to conserve quota. Reviews will continue until that limit is exceeded. If you need help avoiding interruptions, please contact contact@cubic.dev.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="src/main/integration-mounts.ts">
<violation number="1" location="src/main/integration-mounts.ts:362">
P1: Stalled-event handling marks the issue as handled even when restart is throttled, so recovery may never be retried.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
| this.handledHealthErrorKeys.set(remotePath, healthErrorKey) | ||
| if (queued) { | ||
| console.warn( | ||
| `[integration-mounts] Mount stalled (injected revisions never materialized) for ${remotePath}; restarting with fresh full pull`, | ||
| { | ||
| missing: stalled.missingCount, | ||
| oldestInjectedAt: stalled.oldestInjectedAt, | ||
| examples: stalled.examples | ||
| } | ||
| ) |
There was a problem hiding this comment.
P1: Stalled-event handling marks the issue as handled even when restart is throttled, so recovery may never be retried.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/main/integration-mounts.ts, line 362:
<comment>Stalled-event handling marks the issue as handled even when restart is throttled, so recovery may never be retried.</comment>
<file context>
@@ -328,6 +346,32 @@ export class IntegrationMountManager {
+ const healthErrorKey = ['stalled-events', stalled.oldestInjectedAt, stalled.missingCount].join('|')
+ if (this.handledHealthErrorKeys.get(remotePath) === healthErrorKey) continue
+ const queued = this.queueForcedRestart(remotePath, 'stalled events', { clearState: true })
+ this.handledHealthErrorKeys.set(remotePath, healthErrorKey)
+ if (queued) {
+ console.warn(
</file context>
| this.handledHealthErrorKeys.set(remotePath, healthErrorKey) | |
| if (queued) { | |
| console.warn( | |
| `[integration-mounts] Mount stalled (injected revisions never materialized) for ${remotePath}; restarting with fresh full pull`, | |
| { | |
| missing: stalled.missingCount, | |
| oldestInjectedAt: stalled.oldestInjectedAt, | |
| examples: stalled.examples | |
| } | |
| ) | |
| if (queued) { | |
| this.handledHealthErrorKeys.set(remotePath, healthErrorKey) | |
| console.warn( | |
| `[integration-mounts] Mount stalled (injected revisions never materialized) for ${remotePath}; restarting with fresh full pull`, | |
| { | |
| missing: stalled.missingCount, | |
| oldestInjectedAt: stalled.oldestInjectedAt, | |
| examples: stalled.examples | |
| } | |
| ) | |
| } |
What
A defense-in-depth detector in the mount health poll that catches a mount which is cycling
status:readywith no error yet has silently stopped materializing new revisions — and force-restarts it (withclearState) so the next full pull re-pulls the missed files.This is the general client-side backstop for the failure class behind the Slack
threadsread-down outage (cloud emitted bare channel-id event paths that the mount's literal-prefixisUnderRemoteRootfilter silently dropped, so the mount kept cycling healthy while never tracking new replies). The durable fix for that specific bug shipped server-side in cloud#2010 and is live-verified in prod (a fresh post-deploy reply materialized via incremental sync within seconds, no Pear restart). This PR ensures any future silent incremental-sync drop self-recovers instead of staying invisible until the ~100min periodic full pull.How
checkMountHealthpreviously force-restarted only on auth failure or a deadline/sync wedge — nothing detected "cycling successfully but not pulling new revisions." The new check (readStalledInjectedRevisions+ a branch in the per-handle loop) firesqueueForcedRestart(remotePath, 'stalled events', { clearState: true })iff:"injecting"inintegration-events.log) whose path is absent from the mount's ownstate.jsonfilesafter a 15min wall-clock grace, ANDlastSuccessfulReconcileAt > injectedAt) — proving it's alive and had a chance to pull it, so absence is a real drop, not an in-flight/slow pull or a down/booting mount.Reuses the existing 60s restart throttle +
handledHealthErrorKeysdedup (one drop = one restart, no storm). General across mounts — only mounts with"injecting"entries under theirremoteRootcan match, sodiscovery/*etc. never false-fire.Why the grace is minutes, not seconds
Under normal operation a received revision lands within one reconcile (incremental in seconds; a ws→poll fallback within ~5min). 15min clears that normal latency with wide margin and is well under the ~100min periodic full-pull self-heal, so the detector recovers a genuine drop far sooner than a scheduled full pull would, without thrashing on a merely in-flight pull. Wall-clock (not cycle-count) stays correct if interval/transport knobs are retuned later.
Correctness anchors (verified against the v0.8.19 daemon source)
state.filesismap[string]trackedFilekeyed by the full normalized suffixed remote path, sosyncedPaths.has(injectedPath)is full-suffixed vs full-suffixed — apples-to-apples (if it were remoteRoot-relative this would storm; it isn't).lastSuccessfulReconcileAtis in the daemon's publicstate.json, UTC RFC3339Nano (directly ISO-comparable),omitempty→ a never-reconciled mount omits it → detector skips it.INTEGRATION_EVENT_LOG_PATHis duplicated inintegration-mounts.tsto avoid amounts → bridgeimport cycle.Tests
npx vitest run src/main/integration-mounts.test.ts→ 31 passed (7 new for the detector: fires when aged+reconciled+absent; no-op within grace, when already tracked, when not reconciled since receipt, when never reconciled, for other mount roots, and when the events log is unreadable).tsc --noEmitclean (exit 0).Rollout
Draft / no rush — not blocking (cloud#2010 already fixed the active bug). Lands on a normal restart cycle whenever convenient; the detector runs in-process so it takes effect on the next app start. Do not merge/restart without operator greenlight.
🤖 Generated with Claude Code