Skip to content

fix(mount): converge bootstrap by yielding slow/empty/429 export to resumable tree pull (#1499 / cloud #1516 Gate B)#223

Merged
khaliqgant merged 2 commits into
mainfrom
fix/fs-export-empty-convergence
May 31, 2026
Merged

fix(mount): converge bootstrap by yielding slow/empty/429 export to resumable tree pull (#1499 / cloud #1516 Gate B)#223
khaliqgant merged 2 commits into
mainfrom
fix/fs-export-empty-convergence

Conversation

@khaliqgant

Copy link
Copy Markdown
Member

Gate B — /fs/export non-convergence root cause + fix

Fixes the relayfile-mount daemon thread of #1499 / cloud #1516: a proactive persona's GitHub-issue workspace never finishes loading → the agent can't clone → 30-min timeout → no PR artifact.

Root cause (confirmed in v0.8.5 source + prod outputTail)

The atomic full-tree export pullRemoteFullExportExportFiles is a single HTTP call that:

  1. reports NO incremental progress until its whole body returns (prog.touch() only fires per-file after the body arrives), and
  2. has no resume cursor ("Export is atomic … one-shot mirror with no resume cursor").

So on a slow / large / 429-contended workspace, the daemon's no-progress bootstrap watchdog (bootstrapIdleTimeout = 90s, syncer.go:82) cancels the export with zero files applied. On cancel, exportSnapshotUnsupported does not match context.Canceled/429, so pullRemoteFullExport returns (true, err) and pullRemoteFull never falls through to the resumable pullRemoteFullTree. markBootstrapComplete() is never reached → next cycle: len(Files)>0 && !BootstrapComplete"detected non-empty state without completed bootstrap; forcing full reconcile" → atomic export → cancelled again. Forever.

Prod outputTail (ProbeV085) matched this verbatim: bootstrap watchdog: no progress for 1m30s; cancelling full pull (will resume next cycle) (34×), detected non-empty state without completed bootstrap; forcing full reconcile (33×), http 429 workspace_busy (12×), fresh remote export has 0 files but N tracked locally.

Pivotal cursor check: a cancelled export does NOT advance the events cursor — EventsCursor is only set after a successful pullRemoteFull. The bug is the export-vs-watchdog race + missing fall-through, not cursor advance.

Fix (daemon-side, durable — handles slow + hung + empty + busy)

  • Export sub-deadline → resumable-tree fall-through. Bound the export with its own deadline (ExportTimeout / env RELAYFILE_EXPORT_TIMEOUT, default 45s), clamped strictly under bootstrapIdleTimeout. On sub-deadline expiry while the parent bootstrap ctx is still alive, fall through to the resumable per-page pullRemoteFullTree: per-page prog.touch() feeds the watchdog, per-page saveState() persists BootstrapCursor so it resumes across cancels and the bootstrap COMPLETES → the loop terminates. (Parent-ctx-dead = outer watchdog fired → propagate; next cycle's sub-deadline converges via the tree path.)
  • 429 workspace_busy → fall to tree. Classify HTTP 429 as export-unsupported (after doJSON exhausts its Retry-After backoff) so it yields to the per-file-bounded, individually-retried, resumable tree path instead of re-contending the one busy DO invocation.
  • Empty-200 export → no false completion. A successful-but-empty export for a workspace with tracked files no longer calls markBootstrapComplete() (which locked in a stale/empty mirror); it falls through to the tree pull for an authoritative listing via a different cloud code path.

Cross-boundary discipline (CLAUDE.md relayfile-source-of-truth)

This change adds only one new OPTIONAL env (RELAYFILE_EXPORT_TIMEOUT, safe 45s default) and touches ZERO existing cloud↔daemon contract points — no flag/env/--state-file/per-page-saveState changes. So cloud's mount-script.ts needs no lockstep change, only the snapshot version bump. Verified the companion contract from v0.8.5 source for CIGate's cloud PR #1553: env name RELAYFILE_BOOTSTRAP_IDLE_TIMEOUT (syncer.go:1085) + resolveDurationEnvtime.ParseDuration (so 300s parses); --once mount-kind=initial-sync honors --state-file/RELAYFILE_MOUNT_STATE_FILE (main.go:77/93) and saveState() writes it per page during the tree pull (syncer.go:3104 / 4724) → the cloud outer-wrapper's watched progress file sees resumable-tree progress.

Tests (full internal/mountsync suite green)

  • TestReconcileFallsBackToTreeWhenExportExceedsSubDeadline — slow export → tree fallback + bootstrap completes (the convergence proof).
  • TestReconcileFallsBackToTreeWhenExportWorkspaceBusy — 429 → tree fallback.
  • TestExportEmptyButPopulatedTreeRecoversViaTreePull — empty-200 with tracked files → tree recovery, not short-circuited.
  • TestFailedExportDoesNotAdvanceCursorOrCompleteBootstrap — pivotal cursor-safety: propagated 5xx export failure advances neither cursor nor BootstrapComplete.
  • TestExportSnapshotOverloadedClassification — +429 workspace_busy case.

Operator release steps (this PR ships code+tests only; release is workflow_dispatch, operator-gated)

  1. Merge this PR.
  2. Cut a relayfile release vX.Y.Z (publish.yml — operator).
  3. Bump cloud snapshot RELAYFILE_MOUNT_VERSION to vX.Y.Z (rebuild-snapshot.yml) — the prod mount-version lever is the Daytona snapshot pin, not @agent-relay/sdk RELAYFILE_VERSION.
  4. rebuild-snapshot → ProbeV085 runs the post-fix re-probe (fresh small issue → export converges → runner.started → real codex/small-issue-<n> PR).

🤖 Generated with Claude Code

…esumable tree pull

The atomic full-tree export (pullRemoteFullExport -> ExportFiles) is a single
HTTP call that reports NO incremental progress until its whole body returns,
and it has no resume cursor. On a slow / large / 429-contended workspace the
no-progress bootstrap watchdog (bootstrapIdleTimeout, 90s) cancels it with zero
files applied; the next cycle restarts the export from scratch and is cancelled
again -> the production "non-empty without completed bootstrap -> forcing full
reconcile" loop that never converges (the mount never loads the issue records,
so the proactive persona times out with no artifact). #1499 / cloud #1516.

Root cause (confirmed in code + prod outputTail): the non-resumable atomic
export races the no-progress watchdog and, on cancel, does NOT fall through to
the resumable per-page pullRemoteFullTree (exportSnapshotUnsupported does not
match context cancellation / 429). The events cursor is NOT advanced on cancel
(verified) -- the bug is purely the export-vs-watchdog race + no fall-through.

Fix (daemon-side, durable, all cases):
- Bound the export with its OWN sub-deadline (ExportTimeout / env
  RELAYFILE_EXPORT_TIMEOUT, default 45s), clamped strictly under
  bootstrapIdleTimeout. On sub-deadline expiry while the parent bootstrap ctx
  is still alive, fall through to the resumable, per-page pullRemoteFullTree
  (per-page prog.touch() feeds the watchdog; per-page saveState() persists
  BootstrapCursor so it resumes across cancels and the bootstrap COMPLETES).
- Classify HTTP 429 workspace_busy as export-unsupported -> fall to the tree
  path (individually bounded + retried + resumable) instead of retrying the
  contended atomic export.
- Empty-200 export for a workspace with tracked files: do NOT
  markBootstrapComplete (that locked in a stale/empty mirror); fall through to
  the tree pull for an authoritative listing via a different cloud code path.

Cross-boundary: adds only ONE new OPTIONAL env (RELAYFILE_EXPORT_TIMEOUT, safe
default) and touches ZERO existing cloud<->daemon contract points (no flag/env/
--state-file/per-page-saveState changes), so cloud's mount-script needs no
lockstep change -- only the snapshot version bump.

Tests: slow export -> tree fallback + bootstrap completes; 429 workspace_busy
-> tree fallback; empty-200 export with tracked files -> tree recovery + not
short-circuited; failed export -> cursor not advanced + bootstrap not complete;
429 classification. Full internal/mountsync suite green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented May 31, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@khaliqgant, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 5 minutes and 54 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: f04ac4c7-c742-488a-a465-7cf97aecc5d3

📥 Commits

Reviewing files that changed from the base of the PR and between 38ad124 and ce1d1d9.

📒 Files selected for processing (2)
  • internal/mountsync/syncer.go
  • internal/mountsync/syncer_test.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/fs-export-empty-convergence

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an independent timeout (exportTimeout) for atomic full-tree exports to prevent slow or throttled exports from triggering the bootstrap watchdog and causing non-convergence loops. When the export times out, returns a 429 (workspace busy), or returns an empty list despite local tracked files, the syncer now falls back to a resumable, per-page tree pull. Extensive unit tests were added to verify these fallback behaviors. Feedback suggests logging a warning when exportTimeout is silently clamped to prevent operational confusion.

Comment on lines +1117 to +1119
if maxExportTimeout := bootstrapIdleTimeout * 3 / 4; maxExportTimeout > 0 && exportTimeout > maxExportTimeout {
exportTimeout = maxExportTimeout
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When exportTimeout is silently clamped to maxExportTimeout, it can lead to operational confusion and make debugging difficult, as the user's explicit configuration (via RELAYFILE_EXPORT_TIMEOUT or SyncerOptions) is overridden without any indication. It would be highly beneficial to log a warning when this clamping occurs so that operators are aware of the adjustment.

	if maxExportTimeout := bootstrapIdleTimeout * 3 / 4; maxExportTimeout > 0 && exportTimeout > maxExportTimeout {
		if opts.Logger != nil {
			opts.Logger.Printf("clamping exportTimeout from %s to %s (must be strictly under bootstrapIdleTimeout %s)", exportTimeout, maxExportTimeout, bootstrapIdleTimeout)
		}
		exportTimeout = maxExportTimeout
	}

@github-actions

github-actions Bot commented May 31, 2026

Copy link
Copy Markdown

Relayfile Eval Review

Run: .relayfile/evals/runs/2026-05-31T06-49-56-396Z-HEAD-provider
Mode: provider
Git SHA: 5073914

Passed: 4 | Needs human: 0 | Reviewable: 0 | Missing output: 0 | Failed: 0 | Skipped: 0

Human Review Cases

No reviewable human-review cases captured Relayfile output.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread internal/mountsync/syncer.go
…ini)

A silently-overridden operator config (RELAYFILE_EXPORT_TIMEOUT / opts.ExportTimeout)
makes debugging confusing; surface a warning when the clamp to < bootstrapIdleTimeout
actually changes the value.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@khaliqgant khaliqgant merged commit f133096 into main May 31, 2026
7 checks passed
@khaliqgant khaliqgant deleted the fix/fs-export-empty-convergence branch May 31, 2026 06:48
@agent-relay-code

Copy link
Copy Markdown
Contributor

Reviewed and fixed PR #223 locally.

Changes made:

  • Clamped ExportTimeout below a positive BootstrapTimeout, not just the idle watchdog, so slow exports can still fall back to tree pull in hard-cap mode.
  • Narrowed 429 export fallback to workspace_busy only; other 429s like rate_limited and queue_full now remain visible after retries.
  • Added regression coverage for hard bootstrap cap fallback and 429 classification.

Verification run with temporary Go 1.22 toolchain:

  • go test ./internal/mountsync
  • go test ./...
  • go vet ./...

@agent-relay-code

Copy link
Copy Markdown
Contributor

pr-reviewer applied fixes — committed and pushed 5219a1b to this PR. The notes below describe what changed.

Reviewed and fixed PR #223 locally.

Changes made:

  • Clamped ExportTimeout below a positive BootstrapTimeout, not just the idle watchdog, so slow exports can still fall back to tree pull in hard-cap mode.
  • Narrowed 429 export fallback to workspace_busy only; other 429s like rate_limited and queue_full now remain visible after retries.
  • Added regression coverage for hard bootstrap cap fallback and 429 classification.

Verification run with temporary Go 1.22 toolchain:

  • go test ./internal/mountsync
  • go test ./...
  • go vet ./...

@agent-relay-code agent-relay-code Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pr-reviewer applied fixes — committed and pushed 5219a1b to this PR. The notes below describe what changed.

Reviewed and fixed PR #223 locally.

Changes made:

  • Clamped ExportTimeout below a positive BootstrapTimeout, not just the idle watchdog, so slow exports can still fall back to tree pull in hard-cap mode.
  • Narrowed 429 export fallback to workspace_busy only; other 429s like rate_limited and queue_full now remain visible after retries.
  • Added regression coverage for hard bootstrap cap fallback and 429 classification.

Verification run with temporary Go 1.22 toolchain:

  • go test ./internal/mountsync
  • go test ./...
  • go vet ./...

khaliqgant added a commit that referenced this pull request May 31, 2026
… narrow 429 fallback to workspace_busy (#224)

Follow-up to #223 — brings the pr-reviewer bot's two hardening fixes (5219a1b)
onto main. They landed on the PR branch AFTER #223 was merged, so they are not
yet on main; hand-merged here (NOT cherry-picked) to preserve the gemini
clamp-warning log already on main, and excluding the bot commit's unrelated
.trajectories/* debris.

1. Hard-cap clamp: also bound exportTimeout below 3/4 of a POSITIVE
   bootstrapTimeout (RELAYFILE_BOOTSTRAP_TIMEOUT), taking the min with the
   no-progress idle-watchdog clamp. Without this, an operator-set hard cap
   shorter than the export sub-deadline cancels the parent bootstrap ctx before
   the export's own deadline fires -> pullRemoteFullExport propagates instead of
   falling through to the resumable tree pull, defeating same-cycle convergence.
   No-op under the recommended unset (0/unbounded) config; purely defensive.

2. Narrow the 429 -> tree fallback to workspace_busy specifically. Other 429
   classes (global rate limits, queue pressure) are not export-specific and
   should remain visible to the caller after retries are exhausted rather than
   flooding the per-file tree path. Matches the prod signal (429 workspace_busy).

Tests: TestReconcileFallsBackToTreeWhenExportExceedsHardBootstrapCap (hard cap
shorter than sub-deadline -> still falls to tree + bootstrap completes); +2
supported-429 classification cases (rate_limited, queue_full stay visible). Full
internal/mountsync suite green; go vet + gofmt clean. No cloud-contract change.

Co-authored-by: Hubspot Phase B Bot <bot@agent-relay.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@agent-relay-code

Copy link
Copy Markdown
Contributor

⚠️ pr-reviewer push failed (exit 1) — fixes were not applied to the PR. The notes below are advisory and were not pushed.

Reviewed PR #223 and fixed one correctness gap: ExportTimeout now clamps below the actual bootstrap deadline, including configured hard BootstrapTimeout, so slow atomic exports fall back to the resumable tree pull before the parent bootstrap context is cancelled.

Added a regression test covering slow export fallback under a hard bootstrap timeout.

Validation run locally:
GO=/tmp/go/bin/go make test passed.

@agent-relay-code agent-relay-code Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ pr-reviewer push failed (exit 1) — fixes were not applied to the PR. The notes below are advisory and were not pushed.

Reviewed PR #223 and fixed one correctness gap: ExportTimeout now clamps below the actual bootstrap deadline, including configured hard BootstrapTimeout, so slow atomic exports fall back to the resumable tree pull before the parent bootstrap context is cancelled.

Added a regression test covering slow export fallback under a hard bootstrap timeout.

Validation run locally:
GO=/tmp/go/bin/go make test passed.

@agent-relay-code

Copy link
Copy Markdown
Contributor

Reviewed PR #223 and made fixes in internal/mountsync.

Changed:

  • Narrowed export fallback on HTTP 429 to only workspace_busy, so generic rate_limited / queue_full 429s do not explode into ListTree / ReadFile calls.
  • Added tests covering generic 429 classification and ExportTimeout clamp logging, addressing the unresolved bot feedback.

Validation:

  • scripts/check-contract-surface.sh passed.
  • Could not run Go tests because go / gofmt are not installed in this environment.

@agent-relay-code

Copy link
Copy Markdown
Contributor

⚠️ pr-reviewer push failed (exit 1) — fixes were not applied to the PR. The notes below are advisory and were not pushed.

Reviewed PR #223 and made fixes in internal/mountsync.

Changed:

  • Narrowed export fallback on HTTP 429 to only workspace_busy, so generic rate_limited / queue_full 429s do not explode into ListTree / ReadFile calls.
  • Added tests covering generic 429 classification and ExportTimeout clamp logging, addressing the unresolved bot feedback.

Validation:

  • scripts/check-contract-surface.sh passed.
  • Could not run Go tests because go / gofmt are not installed in this environment.

@agent-relay-code agent-relay-code Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ pr-reviewer push failed (exit 1) — fixes were not applied to the PR. The notes below are advisory and were not pushed.

Reviewed PR #223 and made fixes in internal/mountsync.

Changed:

  • Narrowed export fallback on HTTP 429 to only workspace_busy, so generic rate_limited / queue_full 429s do not explode into ListTree / ReadFile calls.
  • Added tests covering generic 429 classification and ExportTimeout clamp logging, addressing the unresolved bot feedback.

Validation:

  • scripts/check-contract-surface.sh passed.
  • Could not run Go tests because go / gofmt are not installed in this environment.

@agent-relay-code

Copy link
Copy Markdown
Contributor

Reviewed PR #223 and made focused fixes in internal/mountsync.

Changed:

  • Clamped ExportTimeout below the effective bootstrap deadline, including hard BootstrapTimeout, so slow exports can fall back before the parent context is cancelled.
  • Narrowed HTTP 429 export fallback to only workspace_busy; generic rate_limited / queue_full 429s now remain visible after retries.
  • Added regression coverage for hard-cap fallback, generic 429 classification, and clamp logging for the unresolved bot feedback.

Local verification:

  • scripts/check-contract-surface.sh
  • /tmp/go/bin/go test ./internal/mountsync
  • /tmp/go/bin/go test ./...
  • /tmp/go/bin/go vet ./...

@agent-relay-code

Copy link
Copy Markdown
Contributor

⚠️ pr-reviewer push failed (exit 1) — fixes were not applied to the PR. The notes below are advisory and were not pushed.

Reviewed PR #223 and made focused fixes in internal/mountsync.

Changed:

  • Clamped ExportTimeout below the effective bootstrap deadline, including hard BootstrapTimeout, so slow exports can fall back before the parent context is cancelled.
  • Narrowed HTTP 429 export fallback to only workspace_busy; generic rate_limited / queue_full 429s now remain visible after retries.
  • Added regression coverage for hard-cap fallback, generic 429 classification, and clamp logging for the unresolved bot feedback.

Local verification:

  • scripts/check-contract-surface.sh
  • /tmp/go/bin/go test ./internal/mountsync
  • /tmp/go/bin/go test ./...
  • /tmp/go/bin/go vet ./...

@agent-relay-code agent-relay-code Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ pr-reviewer push failed (exit 1) — fixes were not applied to the PR. The notes below are advisory and were not pushed.

Reviewed PR #223 and made focused fixes in internal/mountsync.

Changed:

  • Clamped ExportTimeout below the effective bootstrap deadline, including hard BootstrapTimeout, so slow exports can fall back before the parent context is cancelled.
  • Narrowed HTTP 429 export fallback to only workspace_busy; generic rate_limited / queue_full 429s now remain visible after retries.
  • Added regression coverage for hard-cap fallback, generic 429 classification, and clamp logging for the unresolved bot feedback.

Local verification:

  • scripts/check-contract-surface.sh
  • /tmp/go/bin/go test ./internal/mountsync
  • /tmp/go/bin/go test ./...
  • /tmp/go/bin/go vet ./...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant