fix(mount): converge bootstrap by yielding slow/empty/429 export to resumable tree pull (#1499 / cloud #1516 Gate B)#223
Conversation
…esumable tree pull The atomic full-tree export (pullRemoteFullExport -> ExportFiles) is a single HTTP call that reports NO incremental progress until its whole body returns, and it has no resume cursor. On a slow / large / 429-contended workspace the no-progress bootstrap watchdog (bootstrapIdleTimeout, 90s) cancels it with zero files applied; the next cycle restarts the export from scratch and is cancelled again -> the production "non-empty without completed bootstrap -> forcing full reconcile" loop that never converges (the mount never loads the issue records, so the proactive persona times out with no artifact). #1499 / cloud #1516. Root cause (confirmed in code + prod outputTail): the non-resumable atomic export races the no-progress watchdog and, on cancel, does NOT fall through to the resumable per-page pullRemoteFullTree (exportSnapshotUnsupported does not match context cancellation / 429). The events cursor is NOT advanced on cancel (verified) -- the bug is purely the export-vs-watchdog race + no fall-through. Fix (daemon-side, durable, all cases): - Bound the export with its OWN sub-deadline (ExportTimeout / env RELAYFILE_EXPORT_TIMEOUT, default 45s), clamped strictly under bootstrapIdleTimeout. On sub-deadline expiry while the parent bootstrap ctx is still alive, fall through to the resumable, per-page pullRemoteFullTree (per-page prog.touch() feeds the watchdog; per-page saveState() persists BootstrapCursor so it resumes across cancels and the bootstrap COMPLETES). - Classify HTTP 429 workspace_busy as export-unsupported -> fall to the tree path (individually bounded + retried + resumable) instead of retrying the contended atomic export. - Empty-200 export for a workspace with tracked files: do NOT markBootstrapComplete (that locked in a stale/empty mirror); fall through to the tree pull for an authoritative listing via a different cloud code path. Cross-boundary: adds only ONE new OPTIONAL env (RELAYFILE_EXPORT_TIMEOUT, safe default) and touches ZERO existing cloud<->daemon contract points (no flag/env/ --state-file/per-page-saveState changes), so cloud's mount-script needs no lockstep change -- only the snapshot version bump. Tests: slow export -> tree fallback + bootstrap completes; 429 workspace_busy -> tree fallback; empty-200 export with tracked files -> tree recovery + not short-circuited; failed export -> cursor not advanced + bootstrap not complete; 429 classification. Full internal/mountsync suite green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Warning Review limit reached
More reviews will be available in 5 minutes and 54 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces an independent timeout (exportTimeout) for atomic full-tree exports to prevent slow or throttled exports from triggering the bootstrap watchdog and causing non-convergence loops. When the export times out, returns a 429 (workspace busy), or returns an empty list despite local tracked files, the syncer now falls back to a resumable, per-page tree pull. Extensive unit tests were added to verify these fallback behaviors. Feedback suggests logging a warning when exportTimeout is silently clamped to prevent operational confusion.
| if maxExportTimeout := bootstrapIdleTimeout * 3 / 4; maxExportTimeout > 0 && exportTimeout > maxExportTimeout { | ||
| exportTimeout = maxExportTimeout | ||
| } |
There was a problem hiding this comment.
When exportTimeout is silently clamped to maxExportTimeout, it can lead to operational confusion and make debugging difficult, as the user's explicit configuration (via RELAYFILE_EXPORT_TIMEOUT or SyncerOptions) is overridden without any indication. It would be highly beneficial to log a warning when this clamping occurs so that operators are aware of the adjustment.
if maxExportTimeout := bootstrapIdleTimeout * 3 / 4; maxExportTimeout > 0 && exportTimeout > maxExportTimeout {
if opts.Logger != nil {
opts.Logger.Printf("clamping exportTimeout from %s to %s (must be strictly under bootstrapIdleTimeout %s)", exportTimeout, maxExportTimeout, bootstrapIdleTimeout)
}
exportTimeout = maxExportTimeout
}
Relayfile Eval ReviewRun: Passed: 4 | Needs human: 0 | Reviewable: 0 | Missing output: 0 | Failed: 0 | Skipped: 0 Human Review CasesNo reviewable human-review cases captured Relayfile output. |
There was a problem hiding this comment.
1 issue found across 2 files
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
…ini) A silently-overridden operator config (RELAYFILE_EXPORT_TIMEOUT / opts.ExportTimeout) makes debugging confusing; surface a warning when the clamp to < bootstrapIdleTimeout actually changes the value. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Reviewed and fixed PR #223 locally. Changes made:
Verification run with temporary Go 1.22 toolchain:
|
|
✅ pr-reviewer applied fixes — committed and pushed Reviewed and fixed PR #223 locally. Changes made:
Verification run with temporary Go 1.22 toolchain:
|
There was a problem hiding this comment.
✅ pr-reviewer applied fixes — committed and pushed 5219a1b to this PR. The notes below describe what changed.
Reviewed and fixed PR #223 locally.
Changes made:
- Clamped
ExportTimeoutbelow a positiveBootstrapTimeout, not just the idle watchdog, so slow exports can still fall back to tree pull in hard-cap mode. - Narrowed 429 export fallback to
workspace_busyonly; other 429s likerate_limitedandqueue_fullnow remain visible after retries. - Added regression coverage for hard bootstrap cap fallback and 429 classification.
Verification run with temporary Go 1.22 toolchain:
go test ./internal/mountsyncgo test ./...go vet ./...
… narrow 429 fallback to workspace_busy (#224) Follow-up to #223 — brings the pr-reviewer bot's two hardening fixes (5219a1b) onto main. They landed on the PR branch AFTER #223 was merged, so they are not yet on main; hand-merged here (NOT cherry-picked) to preserve the gemini clamp-warning log already on main, and excluding the bot commit's unrelated .trajectories/* debris. 1. Hard-cap clamp: also bound exportTimeout below 3/4 of a POSITIVE bootstrapTimeout (RELAYFILE_BOOTSTRAP_TIMEOUT), taking the min with the no-progress idle-watchdog clamp. Without this, an operator-set hard cap shorter than the export sub-deadline cancels the parent bootstrap ctx before the export's own deadline fires -> pullRemoteFullExport propagates instead of falling through to the resumable tree pull, defeating same-cycle convergence. No-op under the recommended unset (0/unbounded) config; purely defensive. 2. Narrow the 429 -> tree fallback to workspace_busy specifically. Other 429 classes (global rate limits, queue pressure) are not export-specific and should remain visible to the caller after retries are exhausted rather than flooding the per-file tree path. Matches the prod signal (429 workspace_busy). Tests: TestReconcileFallsBackToTreeWhenExportExceedsHardBootstrapCap (hard cap shorter than sub-deadline -> still falls to tree + bootstrap completes); +2 supported-429 classification cases (rate_limited, queue_full stay visible). Full internal/mountsync suite green; go vet + gofmt clean. No cloud-contract change. Co-authored-by: Hubspot Phase B Bot <bot@agent-relay.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Reviewed PR #223 and fixed one correctness gap: Added a regression test covering slow export fallback under a hard bootstrap timeout. Validation run locally: |
There was a problem hiding this comment.
1) — fixes were not applied to the PR. The notes below are advisory and were not pushed.
Reviewed PR #223 and fixed one correctness gap: ExportTimeout now clamps below the actual bootstrap deadline, including configured hard BootstrapTimeout, so slow atomic exports fall back to the resumable tree pull before the parent bootstrap context is cancelled.
Added a regression test covering slow export fallback under a hard bootstrap timeout.
Validation run locally:
GO=/tmp/go/bin/go make test passed.
|
Reviewed PR #223 and made fixes in Changed:
Validation:
|
|
Reviewed PR #223 and made fixes in Changed:
Validation:
|
There was a problem hiding this comment.
1) — fixes were not applied to the PR. The notes below are advisory and were not pushed.
Reviewed PR #223 and made fixes in internal/mountsync.
Changed:
- Narrowed export fallback on HTTP 429 to only
workspace_busy, so genericrate_limited/queue_full429s do not explode intoListTree/ReadFilecalls. - Added tests covering generic 429 classification and
ExportTimeoutclamp logging, addressing the unresolved bot feedback.
Validation:
scripts/check-contract-surface.shpassed.- Could not run Go tests because
go/gofmtare not installed in this environment.
|
Reviewed PR #223 and made focused fixes in Changed:
Local verification:
|
|
Reviewed PR #223 and made focused fixes in Changed:
Local verification:
|
There was a problem hiding this comment.
1) — fixes were not applied to the PR. The notes below are advisory and were not pushed.
Reviewed PR #223 and made focused fixes in internal/mountsync.
Changed:
- Clamped
ExportTimeoutbelow the effective bootstrap deadline, including hardBootstrapTimeout, so slow exports can fall back before the parent context is cancelled. - Narrowed HTTP 429 export fallback to only
workspace_busy; genericrate_limited/queue_full429s now remain visible after retries. - Added regression coverage for hard-cap fallback, generic 429 classification, and clamp logging for the unresolved bot feedback.
Local verification:
scripts/check-contract-surface.sh/tmp/go/bin/go test ./internal/mountsync/tmp/go/bin/go test ./.../tmp/go/bin/go vet ./...
Gate B —
/fs/exportnon-convergence root cause + fixFixes the relayfile-mount daemon thread of #1499 / cloud #1516: a proactive persona's GitHub-issue workspace never finishes loading → the agent can't clone → 30-min timeout → no PR artifact.
Root cause (confirmed in v0.8.5 source + prod
outputTail)The atomic full-tree export
pullRemoteFullExport→ExportFilesis a single HTTP call that:prog.touch()only fires per-file after the body arrives), andSo on a slow / large / 429-contended workspace, the daemon's no-progress bootstrap watchdog (
bootstrapIdleTimeout = 90s,syncer.go:82) cancels the export with zero files applied. On cancel,exportSnapshotUnsupporteddoes not matchcontext.Canceled/429, sopullRemoteFullExportreturns(true, err)andpullRemoteFullnever falls through to the resumablepullRemoteFullTree.markBootstrapComplete()is never reached → next cycle:len(Files)>0 && !BootstrapComplete→ "detected non-empty state without completed bootstrap; forcing full reconcile" → atomic export → cancelled again. Forever.Prod
outputTail(ProbeV085) matched this verbatim:bootstrap watchdog: no progress for 1m30s; cancelling full pull (will resume next cycle)(34×),detected non-empty state without completed bootstrap; forcing full reconcile(33×),http 429 workspace_busy(12×),fresh remote export has 0 files but N tracked locally.Pivotal cursor check: a cancelled export does NOT advance the events cursor —
EventsCursoris only set after a successfulpullRemoteFull. The bug is the export-vs-watchdog race + missing fall-through, not cursor advance.Fix (daemon-side, durable — handles slow + hung + empty + busy)
ExportTimeout/ envRELAYFILE_EXPORT_TIMEOUT, default 45s), clamped strictly underbootstrapIdleTimeout. On sub-deadline expiry while the parent bootstrap ctx is still alive, fall through to the resumable per-pagepullRemoteFullTree: per-pageprog.touch()feeds the watchdog, per-pagesaveState()persistsBootstrapCursorso it resumes across cancels and the bootstrap COMPLETES → the loop terminates. (Parent-ctx-dead = outer watchdog fired → propagate; next cycle's sub-deadline converges via the tree path.)workspace_busy→ fall to tree. Classify HTTP 429 as export-unsupported (afterdoJSONexhausts its Retry-After backoff) so it yields to the per-file-bounded, individually-retried, resumable tree path instead of re-contending the one busy DO invocation.markBootstrapComplete()(which locked in a stale/empty mirror); it falls through to the tree pull for an authoritative listing via a different cloud code path.Cross-boundary discipline (CLAUDE.md relayfile-source-of-truth)
This change adds only one new OPTIONAL env (
RELAYFILE_EXPORT_TIMEOUT, safe 45s default) and touches ZERO existing cloud↔daemon contract points — no flag/env/--state-file/per-page-saveStatechanges. So cloud'smount-script.tsneeds no lockstep change, only the snapshot version bump. Verified the companion contract from v0.8.5 source for CIGate's cloud PR #1553: env nameRELAYFILE_BOOTSTRAP_IDLE_TIMEOUT(syncer.go:1085) +resolveDurationEnv→time.ParseDuration(so300sparses);--once mount-kind=initial-synchonors--state-file/RELAYFILE_MOUNT_STATE_FILE(main.go:77/93) andsaveState()writes it per page during the tree pull (syncer.go:3104/4724) → the cloud outer-wrapper's watched progress file sees resumable-tree progress.Tests (full
internal/mountsyncsuite green)TestReconcileFallsBackToTreeWhenExportExceedsSubDeadline— slow export → tree fallback + bootstrap completes (the convergence proof).TestReconcileFallsBackToTreeWhenExportWorkspaceBusy— 429 → tree fallback.TestExportEmptyButPopulatedTreeRecoversViaTreePull— empty-200 with tracked files → tree recovery, not short-circuited.TestFailedExportDoesNotAdvanceCursorOrCompleteBootstrap— pivotal cursor-safety: propagated 5xx export failure advances neither cursor nor BootstrapComplete.TestExportSnapshotOverloadedClassification— +429 workspace_busy case.Operator release steps (this PR ships code+tests only; release is
workflow_dispatch, operator-gated)vX.Y.Z(publish.yml — operator).RELAYFILE_MOUNT_VERSIONtovX.Y.Z(rebuild-snapshot.yml) — the prod mount-version lever is the Daytona snapshot pin, not@agent-relay/sdk RELAYFILE_VERSION.smallissue → export converges →runner.started→ realcodex/small-issue-<n>PR).🤖 Generated with Claude Code