feat(engine): per-workspace fleet rollout flag + migration single-delivery guarantees (Phase 6)#194
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Warning Review limit reached
More reviews will be available in 32 minutes and 13 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (7)
📝 WalkthroughWalkthroughIntroduces a per-workspace ChangesFleet Nodes Feature Flag Rollout
Sequence Diagram(s)sequenceDiagram
participant Client
rect rgba(70, 130, 180, 0.5)
note over Client,workspaceRoutes: Workspace fleet-nodes management
end
Client->>workspaceRoutes: PUT /workspace/fleet-nodes { enabled: true }
workspaceRoutes->>setFleetNodesOverride: (kv, workspaceId, true)
setFleetNodesOverride->>KVStore: write 'true'
setFleetNodesOverride->>TTLCache: invalidate workspaceId entry
workspaceRoutes->>getFleetNodesConfig: read updated state
workspaceRoutes-->>Client: { ok: true, enabled, default_enabled, override }
rect rgba(60, 179, 113, 0.5)
note over NodeClient,isFleetNodesEnabled: Node WS upgrade with flag check
end
NodeClient->>UpgradeHandler: HTTP Upgrade /v1/node/ws (Bearer token)
UpgradeHandler->>UpgradeHandler: resolve token, validate nt_live_, hash, load node
UpgradeHandler->>isFleetNodesEnabled: (kv, node.workspaceId, config.fleetNodesEnabled)
isFleetNodesEnabled->>KVStore: read override
KVStore-->>isFleetNodesEnabled: 'true'
isFleetNodesEnabled-->>UpgradeHandler: true
UpgradeHandler-->>NodeClient: 101 Switching Protocols
rect rgba(220, 80, 80, 0.5)
note over NodeClient,isFleetNodesEnabled: Flag off — upgrade rejected
end
NodeClient->>UpgradeHandler: HTTP Upgrade /v1/node/ws
UpgradeHandler->>isFleetNodesEnabled: (kv, node.workspaceId, false)
isFleetNodesEnabled-->>UpgradeHandler: false
UpgradeHandler-->>NodeClient: 404 fleet_nodes_disabled
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
willwashburn
left a comment
There was a problem hiding this comment.
RevA1 — adversarial review, round 1. Built the head branch in a clean worktree: typecheck ✅, targeted eslint ✅, full engine suite 111/111 ✅ (the 2 new files, 9 tests, execute and pass). Verified the four claimed boundaries are genuinely gated; the migration guarantee and the Bearer/query compat fix both hold under test. No blockers or majors. Findings below are all minor/nit — none block merge, but a couple touch the "every node surface inert when OFF" claim.
1. [minor] Token precedence is query-first, but the design calls for header precedence — and the both-present case is untested.
engine.ts:501 and entrypoints/node.ts:113 both do url.searchParams.get('token') ?? bearer, i.e. query wins when both are present. The Phase-6 brief specifies header precedence. In practice the broker sends only the header and SDK/Pear only the query, so no shipping transport sends both — impact is low — but the precedence is a silent, untested contract. nodeUpgradeAuth.test.ts only covers header-only and query-only in isolation. Pick a precedence deliberately, document it in the comment, and add a both-present assertion so it can't drift.
2. [minor] The invokeAction fleet gate is fail-OPEN while every other gate is fail-CLOSED.
engine/action.ts:357 defaults const fleetNodesEnabled = options.fleetNodesEnabled ?? true. The WS gate, requireFleetNodes, the trigger hook, and the workspace routes all default config.fleetNodesEnabled ?? false (fail-closed). Today there are only two callers — routes/action.ts threads the resolved flag, and engine/trigger.ts:180 relies on the upstream message.ts gate (so it passes nothing → defaults true). That works only because the message hook gates trigger evaluation. Any future caller that forgets the option silently re-opens spawn + node-handler dispatch with the flag off. Make the option required, or default it false, so the placement gate fails closed like the others.
3. [minor] The declarative-trigger management surface is not gated — only evaluation is.
routes/trigger.ts:22-76 (POST/GET/PATCH/DELETE /v1/triggers) carry only requireAuth, no requireFleetNodes. The PR body says "flag OFF → every node surface inert," and declarative triggers are explicitly part of the fleet control surface; fleetRolloutFlag.test.ts:188 even asserts POST /v1/triggers → 201 while OFF. So a flag-off workspace can fully create/edit/list triggers (they just never fire). That's a deliberate choice but it contradicts the "every surface" wording and is the second ungated node boundary. Either add requireFleetNodes to the trigger CRUD routes for consistency, or document the intentional exception and soften the claim.
4. [minor] Disabling the flag does not tear down live node connections or re-home via_node agents.
The WS gate guards the upgrade (new connections) and agent.register only. A node already connected when a workspace flips OFF keeps receiving deliver frames until it reconnects — the ON→OFF kill-switch direction is unguarded and untested (the migration test only exercises OFF→ON, which is the right primary direction). Fine for a forward dark-launch flag, but please either document the flag as forward-rollout-only or close live node sockets on disable so "inert when OFF" is true for in-flight nodes too.
5. [nit] OpenAPI under-specifies /workspace/fleet-nodes. GET/PUT advertise only a generic SuccessResponse; the real payload is { enabled, default_enabled, override }, and the PUT body's valid shapes are { enabled: boolean } XOR { mode: "inherit" }. Tighten the schema so generated clients see the real contract. (openapi.yaml:18-61)
6. [nit] Ambiguous PUT body precedence. routes/workspace.ts lets { enabled: true, mode: "inherit" } silently clear the override (mode wins) rather than 400-ing. Low risk; consider rejecting bodies that set both.
Verified correct (not rubber-stamping):
agent_location_conflict(engine/node.ts:349) genuinely fires for an active self_connected agent — theonConflictDoUpdate.setWherepredicate (node.ts:341-344) only permits the update when the row is inactive or already via-node on the same node. The migration test is non-vacuous: a broken guard would relocatelegacytovia_node, flipping bothlegacySock == 2andnodeDeliveries == 0.- Bearer header AND
?token=query each upgrade to 101 when the workspace flag is ON; 404 when OFF; 401 on missing/malformed — exercised against a real booted Node server. - Flag default OFF confirmed:
DEFAULT_ENABLED = false, every call site?? false.
Round-1 verdict: GO on substance — no blockers/majors. Recommend addressing #1–#4 before flag flip.
willwashburn
left a comment
There was a problem hiding this comment.
Adversarial review — round 3 (head b673dfb) — GO
Built head + ran the engine suite myself: build green, tsc --noEmit clean, eslint clean on all changed files, 111/111 tests pass (the two new files — nodeUpgradeAuth.test.ts, fleetRolloutFlag.test.ts — confirmed to actually execute via vitest list). All three round-3 items verify:
(1) Rollout flag — all 4 boundaries gated, default OFF, flip sane ✅
- WS upgrade: gated in both
engine.ts:204(hosted/Hono) andentrypoints/node.ts:133(self-host). Node roster:requireFleetNodeson POST/GET/GET:name(routes/node.ts:22,36,49). Triggers:routes/message.ts:158. Spawn placement + node-handler dispatch:engine/action.ts:373,397. - Default OFF:
DEFAULT_ENABLED=false; every call site passesconfig.fleetNodesEnabled ?? false. KV override + 10s cache invalidated onsetFleetNodesOverride(lib/fleetNodes.ts:794). Flip-on/flip-off behavior exercised by both new test files. Survived the rebase intact.
(2) Bearer-OR-query node-WS auth ✅
- Both upgrade paths read
query('token') ?? bearer, case-insensitive^bearer\s+, trimmed; missing→401, non-nt_live_→401, unknown hash→401, flag-off→404. Token is only hashed for lookup, never forwarded.nodeUpgradeAuth.test.tsexercises header-only and query-only against a realwsclient.
(3) Canonical agent.register reply frame ✅ (code correct)
engine/node.ts:519emits{v:1, id:requestId(message), type:'reply', ok:true, data:{agent_id, token, name}}. MatchesFleetReplyMessageSchema(.strict()) exactly, anddatacarries only{agent_id, token, name}—invocation_id/session_refcorrectly excluded, so the broker'sdeny_unknown_fieldsparse ofAgentRegisterReplyData {agent_id, token, name?}is satisfied.idechoesmessage.idwhen present → id-correlated.
Non-blocking findings (minor / nit)
-
[minor] Success-path id-correlation is untested.
engine/node.ts:520is the entire point of the fix (broker matchespending_agent_registrationsbyreply.id), yet no test asserts a successful reply'sidequals the requestid.delivery.test.ts:110sendsagent.registerwith noidand never readsreply.id; only the conflict/error path (fleetRolloutFlag.test.ts:236, idclaim-legacy) checks correlation. A regression toid: generateId()on the success branch would pass the whole suite — so the PR's "these regressions cannot reship" claim doesn't actually hold for this property. Fix: indelivery.test.tsregisterViaNode, sendid: 'reg-<name>'andexpect(reply.id).toBe('reg-<name>'). -
[minor/informational] Frozen contract guard eroded (originated in base, not this PR).
git diff ee2c001 -- packages/typesis not empty:AgentRegisterReplyDataSchemaand the goldenfixtures/fleet-wire/reply.agent_register.jsonwere deleted, andreply.json'sdatawas reshaped to{agent_id,name,token,invocation_id,session_ref}. These come from base-branch commits (8518e1a, 33ede3c, ddffa76), not #194 — #194 touches zero files underpackages/types, so the "engine-only, no Rust release" claim holds. But the cross-repo parity for the register-replydatashape now rests solely on the engine's hand-built object + the broker's Rust struct, with no types-level schema or golden fixture pinning it. Worth restoring a fixture/schema guard in a follow-up so finding #1's correlation + shape can't silently drift again. -
[nit]
requireFleetNodesruns beforerateLimiton all three node routes (routes/node.ts:22,36,49), so flag-off requests return 404 without passing the rate limiter. Each call is a cached KV read, so impact is negligible, but ordering is inverted vs. the usual…, rateLimit, handlerconvention. -
[nit] Precedence is query-over-header, not header-over-query.
query('token') ?? bearermeans a?token=wins over theAuthorization: Bearerheader if both are sent. No practical impact (broker sends header only, SDK query only), but it's the opposite of "header precedence" — pick one and note it.
Verdict: all three round-3 items verified, nothing new at blocker/major. GO.
willwashburn
left a comment
There was a problem hiding this comment.
Round 4 (adversarial) — GO ✅ clean
Independent third completed pass on head b673dfb. Built and tested from a fresh worktree of origin/feat/fleet-rollout-flag. No blockers or majors found; confirming rounds 1 & 3.
Independent verification
- Typecheck (
tsc --noEmit, after building@relaycast/types+@relaycast/a2a): clean - Engine suite (
vitest run): 111/111 passing, 13 files - ESLint on all 11 changed engine files: clean
Round-4 confirmation criteria
- Flag gates all 4 node boundaries, default OFF ✅
- WS upgrade:
engine.ts:202(Hono/in-process) +entrypoints/node.ts:131(self-host) → 404fleet_nodes_disabled - Roster routes
/v1/nodes*:requireFleetNodeson POST/GET/GET:name (routes/node.ts:22,36,49) → flat 404 - Trigger eval: skipped at the message hook (
routes/message.ts:153-163) - Spawn placement + node-handler dispatch: refuse in
invokeAction(engine/action.ts:362,397); agent-handler actions unaffected (asserted in test) - Default OFF:
DEFAULT_ENABLED=false(lib/fleetNodes.ts:22) +config.fleetNodesEnabled ?? falseat every call site.
- WS upgrade:
- /v1/node/ws Bearer-OR-query auth ✅ —
Authorization: BearerOR?token=(engine.ts:185-189,entrypoints/node.ts:109-113); malformed/missing → 401 (nodeUpgradeAuth.test.ts); flag-off → 404;nt_live_prefix enforced before hash lookup. - Canonical reply frame ✅ —
agent.registeremits{v:1, id:requestId(message), type:'reply', ok:true, data:{agent_id, token, name}}(engine/node.ts:522), id-correlated, notagent.registered. Minimal 3-fielddatamatches the broker'sdeny_unknown_fields+ #191AgentRegisterReplyData. Conformance test updated to assert thereplyframe (delivery.test.ts). Error path is symmetric:{v:1, id, type:'error', ok:false, code, message}matches the strictFleetErrorMessageSchema. - Frozen contract ✅ (with one clarification, below) — PR #194's diff vs its base (
feat/fleet-mailbox) touches zeropackages/typesfiles; the wire contract is untouched by this PR.
Minor / informational (non-blocking)
- (nit, pre-existing, not #194)
git diff ee2c001 -- packages/typesis not empty — but those changes (reply.jsongaininginvocation_id/session_ref, deletion ofreply.agent_register.json) belong to earlier PRs in the stack, not #194. The runtime engine emits only the 3-fielddata, so the brokerdeny_unknown_fieldscontract is satisfied. Note for the stack owner: thereply.jsonfixture'sdatanow carries 5 fields while the engine emits 3; harmless because the engine's replydatais freeform JSON, but a broker decoding that fixture verbatim withdeny_unknown_fieldswould reject the extra fields. Worth a glance in the broker repo's fixture-parity check. - (nit)
nodeUpgradeAuth.test.tstryUpgraderesolves only onopen/unexpected-response; a bare socketerror(no HTTP response) would hang the promise. Fine against a live localhost server, but slightly fragile. - (nit)
openapi.yamldocuments the GET/PUT/workspace/fleet-nodesresponses as genericSuccessResponserather than the actual{enabled, default_enabled, override}shape — consistent with the existing/workspace/streamdocs, so cosmetic only.
Verdict: GO. Phase 6 rollout flag + the two cross-repo compat fixes are correct and well-guarded.
489a0db to
6470e3f
Compare
b673dfb to
87f898b
Compare
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
…ivery guarantees (Phase 6) Gate the entire fleet node control surface behind a per-workspace `fleet_nodes_enabled` flag (default OFF), so fleet can ship dark and roll out workspace-by-workspace. Legacy per-agent WS delivery is unaffected either way. The flag is checked once at each genuine boundary (no scattered checks): - node control WS (`/v1/node/ws`) rejects with `fleet_nodes_disabled` (404) - node roster routes (`/v1/nodes*`) return a flat 404 via `requireFleetNodes` - declarative trigger evaluation is skipped at the message hook - spawn placement + node-handler dispatch refuse in `invokeAction` (agent-handler actions stay available) Flag source mirrors the workspace-stream pattern: a KV override with a short in-memory cache, defaulting to `EngineConfig.fleetNodesEnabled`. GET/PUT `/v1/workspace/fleet-nodes` toggles the per-workspace override. Tests: - flag OFF -> every node surface inert (roster, spawn, WS gate, triggers) - per-workspace override flips the surface on/off; WS gate follows the flag - migration single-delivery: a legacy self-connected agent is never also delivered via a node when the flag flips mid-stream (exclusive location; a node's `agent.register` for it is rejected `agent_location_conflict`) The conformance harness defaults the flag ON, so existing node integration tests pass unchanged. Full engine suite green (108 tests). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…WS upgrade behind fleet flag (Phase 6) Cross-repo compat fix surfaced by the Phase 6 two-node E2E: a real relay broker could never bring a node online against a self-hosted engine. Root cause: the node-control read-side is the Node HTTP-server `upgrade` handler in `entrypoints/node.ts` (the Hono `/v1/node/ws` route only answers the 426 — Node owns the 101). That handler read the token ONLY from the `?token=` query param, but the relay Rust broker's node_control client sends it as `Authorization: Bearer <nt_live_…>`. It also had NO fleet-flag gate for `/v1/node/ws` (only the rk_live workspace-stream path was gated), so the Phase 6 rollout flag did not actually cover the node control surface on the self-host adapter. Fix, both in the upgrade handler: - read the node token from `?token=` query OR `Authorization: Bearer` header (query stays for SDK/Pear; header unblocks the shipped broker — no Rust release needed) - gate the `/v1/node/ws` upgrade behind `isFleetNodesEnabled` (404 when off), mirroring the existing stream gate Also mirrored the dual-transport read in the Hono `/v1/node/ws` route for any adapter that routes upgrades through it. Accepted-stack PRs involved: engine read-side #192, broker send-side #1107. The hosted (Cloudflare DO) equivalent is handled in PR 5. Test: `nodeUpgradeAuth.test.ts` boots the real Node server and asserts a WS client authenticates via BOTH the Bearer header and the query param, that the upgrade is rejected while the workspace flag is off (404), and that a missing/malformed token is rejected (401). Full engine suite green (111). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…me (Phase 6 token authority)
Third cross-repo compat fix surfaced by the Phase 6 E2E (spawn scenarios):
spawn never completes end-to-end against a real broker. The relay broker's
node_control client awaits a `reply` frame keyed by the request id — it matches
`pending_agent_registrations` by `reply.id` and parses `data` as
`{agent_id, token, name}` with `deny_unknown_fields`. The engine instead
answered `agent.register` with a bare `{type:'agent.registered', ...}` carrying
the full object (incl. invocation_id/session_ref), which the broker never
matches → `register_fleet_agent_token` hangs to its 30s timeout → the spawn
action fails. This blocked every spawn-dependent path (placement completion,
mailbox delivery to via-node agents, resume).
Reply in the shape the shipped broker consumes; the broker already holds the
invocation_id/session_ref it sent, so only the minted identity is echoed. Same
root pattern as the node-token transport mismatch (#192 read-side ↔ #1107
broker send-side); no Rust release needed.
Updated the one conformance helper that asserted the old frame. Engine suite
green (111).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…L/depth-cap The `relaycast-engine` serve bin gains optional env tuning so operators (and the Phase 6 fleet E2E) can configure the bounded mailbox and the fleet rollout default without code changes: - RELAYCAST_FLEET_NODES_ENABLED=1 → EngineConfig.fleetNodesEnabled - RELAYCAST_MAILBOX_TTL_MS / RELAYCAST_MAILBOX_DEPTH_CAP → mailbox tuning Unset env leaves the existing defaults untouched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
87f898b to
a15580e
Compare
#192 (merged) owns 0017_spawn_reservation_and_retry_state and 0018; the Phase 2 mailbox migration was authored as 0017 on an older base, colliding on the 0017 prefix once #192/#193 landed in main. Renumber to 0019 (after 0018) so the D1 migration sequence is unique and ordered. Pure file rename — no code references the filename, and the migration has not been applied to any environment yet (engine unpublished), so there is no D1 re-apply risk.
💡 Codex ReviewWhen a client uses the per-delivery REST ack out of order, for example acking seq 2 while seq 1 is still queued/deferred, this call passes only that row to relaycast/packages/engine/src/db/migrations/0017_fleet_mailbox.sql Lines 45 to 48 in 87f898b For migrated data where an older accepted/deferred delivery remains active and a newer delivered row exists, this initializes relaycast/packages/engine/src/engine/delivery.ts Lines 532 to 534 in 87f898b On broker reconnect, any queued message whose channel belongs to a DM conversation is classified as When an agent is at the depth cap and its queued/delivered rows pass ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
…yStatus remap Changelogs here are hand-curated (no CI generation), and the fleet stack (#191-#194) was missing from them. Add the user-facing entries: - @relaycast/types: new CHANGELOG; document the breaking DeliveryStatus enum remap (accepted/deferred removed, acked/dead_lettered added, delivered re-meaning) with old->new mapping + flag-independent migration note, the new Delivery location/lifecycle fields, and the fleet-wire protocol module. - @relaycast/sdk-typescript: node roster API (nodes.list/get, triggers.list), capability objects, handler/dispatch node fields, JsonValue export, and the breaking action-output widening + delivery status value change. These confirm the next @relaycast/types + sdk-typescript publish is a MAJOR.
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
packages/engine/src/bin/serve.ts (1)
63-80: Use a Zod schema for environment variable parsing.The inline ad-hoc numeric and boolean parsing in
serve.ts(lines 63–80) introduces manual validation logic that diverges from the codebase's established pattern. Throughout the engine, Zod schemas are used consistently to validate request bodies and other inputs (e.g., in routes and engine logic). Environment variables should follow the same pattern with a schema to standardize parsing, error handling, and type safety.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/engine/src/bin/serve.ts` around lines 63 - 80, The inline ad-hoc parsing in the `num` helper function and conditional spreading for environment variables diverges from the codebase's established Zod schema pattern. Create a Zod schema to define and validate environment variables including RELAYCAST_MAILBOX_TTL_MS, RELAYCAST_MAILBOX_DEPTH_CAP, and RELAYCAST_FLEET_NODES_ENABLED, then replace the `num` function and the conditional spreading logic with calls to this schema's parse or parseAsync method to handle environment variable parsing in a consistent, standardized way that aligns with the rest of the engine codebase.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@openapi.yaml`:
- Around line 2149-2163: The PUT `/workspace/fleet-nodes` endpoint schema
currently allows both `enabled` and `mode` properties to be optional, but the
runtime rejects empty payloads with a 400 error. Add a `required` array to the
object schema containing at least one of the property names (either "enabled" or
"mode") to enforce that the payload cannot be empty. Additionally, add a missing
`400` response definition after the `200` response to document the
invalid_request error that is returned when validation fails at runtime.
In `@packages/engine/src/engine.ts`:
- Around line 182-186: The authHeader variable assignment on line 182 violates
repo guidelines by using a mixed-case fallback pattern with both 'Authorization'
and 'authorization' headers. Replace this mixed-case fallback with a single
canonical header lookup that normalizes the header name to lowercase when
retrieving it, so that c.req.header() is called only once with a consistent
case-normalized approach rather than trying multiple case variations as a
fallback.
In `@packages/engine/src/middleware/fleetNodes.ts`:
- Line 15: The condition at line 15 in fleetNodes.ts accesses
config.fleetNodesEnabled without guarding against config being undefined, since
config is optional in EngineDeps. This will throw a runtime TypeError if config
is not provided. Fix this by using optional chaining (config?.fleetNodesEnabled)
when accessing the fleetNodesEnabled property in the isFleetNodesEnabled call,
ensuring the property access is safe when config is undefined or null.
In `@packages/engine/src/routes/workspace.ts`:
- Around line 41-44: Refactor the updateFleetNodesSchema to use a discriminated
union with Zod instead of relying on manual validation in the handler. Replace
the current optional fields schema (with enabled and mode as optional
strings/booleans and passthrough) with a discriminated union pattern that
defines two distinct payloads: one with mode set to 'inherit' and another with
enabled as a boolean. Apply .strict() to the schema to enforce strict validation
and eliminate the need for manual runtime checks. Then update the handler logic
to remove the ad-hoc branching validation that currently checks which field is
provided and validates the types, since the Zod schema will now enforce this
contract at parse time.
---
Nitpick comments:
In `@packages/engine/src/bin/serve.ts`:
- Around line 63-80: The inline ad-hoc parsing in the `num` helper function and
conditional spreading for environment variables diverges from the codebase's
established Zod schema pattern. Create a Zod schema to define and validate
environment variables including RELAYCAST_MAILBOX_TTL_MS,
RELAYCAST_MAILBOX_DEPTH_CAP, and RELAYCAST_FLEET_NODES_ENABLED, then replace the
`num` function and the conditional spreading logic with calls to this schema's
parse or parseAsync method to handle environment variable parsing in a
consistent, standardized way that aligns with the rest of the engine codebase.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 8a03e497-b67a-46c2-8d31-9e5b677a0181
📒 Files selected for processing (20)
openapi.yamlpackages/engine/src/__tests__/conformance/delivery.test.tspackages/engine/src/__tests__/conformance/fleetRolloutFlag.test.tspackages/engine/src/__tests__/conformance/harness.tspackages/engine/src/__tests__/nodeUpgradeAuth.test.tspackages/engine/src/bin/serve.tspackages/engine/src/db/migrations/0019_fleet_mailbox.sqlpackages/engine/src/engine.tspackages/engine/src/engine/action.tspackages/engine/src/engine/node.tspackages/engine/src/entrypoints/node.tspackages/engine/src/lib/fleetNodes.tspackages/engine/src/middleware/fleetNodes.tspackages/engine/src/ports/index.tspackages/engine/src/routes/action.tspackages/engine/src/routes/message.tspackages/engine/src/routes/node.tspackages/engine/src/routes/workspace.tspackages/sdk-typescript/CHANGELOG.mdpackages/types/CHANGELOG.md
| const authHeader = c.req.header('Authorization') ?? c.req.header('authorization'); | ||
| const bearer = authHeader && /^bearer\s+/i.test(authHeader) | ||
| ? authHeader.replace(/^bearer\s+/i, '').trim() | ||
| : undefined; | ||
| const token = c.req.query('token') ?? bearer; |
There was a problem hiding this comment.
Remove the mixed-case header fallback in node token extraction.
Line 182 introduces a mixed-case fallback pattern that the repo guidelines forbid. Normalize through a single canonical header lookup.
Suggested fix
- const authHeader = c.req.header('Authorization') ?? c.req.header('authorization');
+ const authHeader = c.req.raw.headers.get('authorization') ?? undefined;As per coding guidelines, **/*.{js,ts,tsx,jsx}: Do not introduce mixed-case field fallbacks.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const authHeader = c.req.header('Authorization') ?? c.req.header('authorization'); | |
| const bearer = authHeader && /^bearer\s+/i.test(authHeader) | |
| ? authHeader.replace(/^bearer\s+/i, '').trim() | |
| : undefined; | |
| const token = c.req.query('token') ?? bearer; | |
| const authHeader = c.req.header('authorization'); | |
| const bearer = authHeader && /^bearer\s+/i.test(authHeader) | |
| ? authHeader.replace(/^bearer\s+/i, '').trim() | |
| : undefined; | |
| const token = c.req.query('token') ?? bearer; |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/engine/src/engine.ts` around lines 182 - 186, The authHeader
variable assignment on line 182 violates repo guidelines by using a mixed-case
fallback pattern with both 'Authorization' and 'authorization' headers. Replace
this mixed-case fallback with a single canonical header lookup that normalizes
the header name to lowercase when retrieving it, so that c.req.header() is
called only once with a consistent case-normalized approach rather than trying
multiple case variations as a fallback.
Source: Coding guidelines
| export const requireFleetNodes = createMiddleware<AppEnv>(async (c, next) => { | ||
| const workspace = c.get('workspace'); | ||
| const { kv, config } = c.get('engine'); | ||
| if (!workspace || !(await isFleetNodesEnabled(kv, workspace.id, config.fleetNodesEnabled ?? false))) { |
There was a problem hiding this comment.
Guard optional engine config before dereferencing.
At Line 15, config.fleetNodesEnabled is accessed even though config is optional in EngineDeps, which can throw a runtime TypeError and convert this gate into a 500 path.
Proposed fix
export const requireFleetNodes = createMiddleware<AppEnv>(async (c, next) => {
const workspace = c.get('workspace');
const { kv, config } = c.get('engine');
- if (!workspace || !(await isFleetNodesEnabled(kv, workspace.id, config.fleetNodesEnabled ?? false))) {
+ const defaultEnabled = config?.fleetNodesEnabled ?? false;
+ if (!workspace || !(await isFleetNodesEnabled(kv, workspace.id, defaultEnabled))) {
return c.json(
{ ok: false, error: { code: 'fleet_nodes_disabled', message: 'Fleet nodes are disabled for this workspace' } },
404,
);
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (!workspace || !(await isFleetNodesEnabled(kv, workspace.id, config.fleetNodesEnabled ?? false))) { | |
| export const requireFleetNodes = createMiddleware<AppEnv>(async (c, next) => { | |
| const workspace = c.get('workspace'); | |
| const { kv, config } = c.get('engine'); | |
| const defaultEnabled = config?.fleetNodesEnabled ?? false; | |
| if (!workspace || !(await isFleetNodesEnabled(kv, workspace.id, defaultEnabled))) { | |
| return c.json( | |
| { ok: false, error: { code: 'fleet_nodes_disabled', message: 'Fleet nodes are disabled for this workspace' } }, | |
| 404, | |
| ); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/engine/src/middleware/fleetNodes.ts` at line 15, The condition at
line 15 in fleetNodes.ts accesses config.fleetNodesEnabled without guarding
against config being undefined, since config is optional in EngineDeps. This
will throw a runtime TypeError if config is not provided. Fix this by using
optional chaining (config?.fleetNodesEnabled) when accessing the
fleetNodesEnabled property in the isFleetNodesEnabled call, ensuring the
property access is safe when config is undefined or null.
| const updateFleetNodesSchema = z.object({ | ||
| enabled: z.boolean().optional(), | ||
| mode: z.string().optional(), | ||
| }).passthrough(); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cd packages/engine/src/routes && wc -l workspace.tsRepository: AgentWorkforce/relaycast
Length of output: 85
🏁 Script executed:
cd packages/engine/src/routes && sed -n '30,60p' workspace.tsRepository: AgentWorkforce/relaycast
Length of output: 990
🏁 Script executed:
cd packages/engine/src/routes && sed -n '410,450p' workspace.tsRepository: AgentWorkforce/relaycast
Length of output: 1565
Encode the PUT payload contract in Zod instead of post-parse branching.
The schema accepts both enabled and mode as optional fields with .passthrough(), but the handler manually validates that exactly one is provided and has the correct type. This violates the guideline to prefer Zod schemas over ad-hoc manual checks. Define a discriminated union schema ({ mode: 'inherit' } | { enabled: boolean }) with .strict() to enforce the contract at parse time and eliminate the runtime branching logic.
Suggested refactor
-const updateFleetNodesSchema = z.object({
- enabled: z.boolean().optional(),
- mode: z.string().optional(),
-}).passthrough();
+const updateFleetNodesSchema = z.union([
+ z.object({ mode: z.literal('inherit') }).strict(),
+ z.object({ enabled: z.boolean() }).strict(),
+]);
...
- const body = parsed.data;
-
- let override: boolean | null;
- if (body?.mode === 'inherit') {
- override = null;
- } else if (typeof body?.enabled === 'boolean') {
- override = body.enabled;
- } else {
- return c.json({
- ok: false,
- error: { code: 'invalid_request', message: 'Provide { enabled: boolean } or { mode: "inherit" }' },
- }, 400);
- }
+ const body = parsed.data;
+ const override: boolean | null = 'mode' in body ? null : body.enabled;Applies to lines 41–44 and the handler at 423–433.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/engine/src/routes/workspace.ts` around lines 41 - 44, Refactor the
updateFleetNodesSchema to use a discriminated union with Zod instead of relying
on manual validation in the handler. Replace the current optional fields schema
(with enabled and mode as optional strings/booleans and passthrough) with a
discriminated union pattern that defines two distinct payloads: one with mode
set to 'inherit' and another with enabled as a boolean. Apply .strict() to the
schema to enforce strict validation and eliminate the need for manual runtime
checks. Then update the handler logic to remove the ad-hoc branching validation
that currently checks which field is provided and validates the types, since the
Zod schema will now enforce this contract at parse time.
Source: Coding guidelines
…view)
Address P2 findings from Codex review of the fleet mailbox delivery path:
1. ackDelivery (single per-delivery REST ack) advanced the cumulative cursor to
the row's own seq, so acking seq 2 while seq 1 is queued moved the cursor past
seq 1; deliverPendingToNode (seq > delivery_ack_seq) then skipped it forever on
node replay. Make the cursor advance opt-in (ackRows advanceCursorTo?) — single
acks no longer advance it; the row's acked status already excludes it from replay.
The node delivery.ack {up_to_seq} path still advances cumulatively. Regression test.
2. Migration 0019 seeded delivery_ack_seq = MAX(acked seq), skipping an older
still-queued row below a newer acked one. Seed from the contiguous acked prefix
(lowest active seq - 1; max seq when nothing is active).
3. Node-replay event classification checked dmType before threadId, so a thread
reply inside a DM/group DM would replay as dm.received instead of thread.reply
(the live routes/thread.ts routing). Check threadId first to mirror live.
4. Mailbox depth-cap count included expired-but-unswept rows, so an idle recipient
kept rejecting new sends as depth_cap after TTL instead of dead-lettering.
Exclude expired rows from the count (matches the replay query). Regression test.
Also classify the operator-only /v1/workspace/fleet-nodes flag route as non-SDK in
sdk-openapi-sync (pre-existing #194 gap that turbo test caching had masked).
…e/fleet-nodes The PUT handler rejects a payload lacking both `enabled` (boolean) and `mode: inherit` with a 400 invalid_request, but the schema marked both optional and documented only a 200. Add anyOf[required: enabled | required: mode] to reflect the runtime constraint, and document the 400 (ErrorResponse).
Rebased feature/engine-retention onto main, which now has the fleet stack (#191-#194). Three adaptations: 1. Renumber migration 0016_workspace_retention -> 0020_workspace_retention; #192 took 0016 (fleet_nodes) and #193 took 0019 (fleet_mailbox). 2. Delivery status model: #193 reworked the enum, so SETTLED_DELIVERY_STATUSES is now ['acked','failed','dead_lettered'] (was ['delivered','failed']). 'delivered' is now IN-FLIGHT (sent, awaiting cumulative ack), so retention must never prune it; 'acked' is terminal success. Updated tests to the new status names. 3. insertDelivery test helper assigns a distinct seq per agent — the mailbox migration added UNIQUE(workspace_id, agent_id, seq), so same-agent rows can no longer share the default seq 0. Note: turbo build/tsc is currently red on main itself (engine.ts:212 uses originInfo.origin_surface, which #188 removed from the telemetry contract) — a pre-existing #188/#192 collision unrelated to this PR. Engine vitest is green (132/132).
Rebased feature/engine-retention onto main, which now has the fleet stack (#191-#194). Three adaptations: 1. Renumber migration 0016_workspace_retention -> 0020_workspace_retention; #192 took 0016 (fleet_nodes) and #193 took 0019 (fleet_mailbox). 2. Delivery status model: #193 reworked the enum, so SETTLED_DELIVERY_STATUSES is now ['acked','failed','dead_lettered'] (was ['delivered','failed']). 'delivered' is now IN-FLIGHT (sent, awaiting cumulative ack), so retention must never prune it; 'acked' is terminal success. Updated tests to the new status names. 3. insertDelivery test helper assigns a distinct seq per agent — the mailbox migration added UNIQUE(workspace_id, agent_id, seq), so same-agent rows can no longer share the default seq 0. Note: turbo build/tsc is currently red on main itself (engine.ts:212 uses originInfo.origin_surface, which #188 removed from the telemetry contract) — a pre-existing #188/#192 collision unrelated to this PR. Engine vitest is green (132/132).
…llow-ups (#189) * feat(engine): retention pruning with per-workspace TTLs and outbox follow-ups Add pruneExpired: bounded-batch deletion of expired messages (leaf-first across thread parents), settled deliveries, message logs, and orphaned read receipts, with per-workspace TTLs in a new nullable workspaces.retention column. Message retention is opt-in; settled deliveries and message logs default to 90 days as operational logs. Runs on the Node adapter's outbox cleanup cadence and is exported for queue-backed scheduled handlers. cleanupOldEvents now settles exhausted pending_events rows as failed so they become prunable instead of lingering unclaimable, and sendWebhookEvent skips the outbox insert and queue send entirely (with a per-request memoized existence probe) for workspaces with no active event subscription. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * chore(engine): adapt retention to merged fleet engine (rebase onto main) Rebased feature/engine-retention onto main, which now has the fleet stack (#191-#194). Three adaptations: 1. Renumber migration 0016_workspace_retention -> 0020_workspace_retention; #192 took 0016 (fleet_nodes) and #193 took 0019 (fleet_mailbox). 2. Delivery status model: #193 reworked the enum, so SETTLED_DELIVERY_STATUSES is now ['acked','failed','dead_lettered'] (was ['delivered','failed']). 'delivered' is now IN-FLIGHT (sent, awaiting cumulative ack), so retention must never prune it; 'acked' is terminal success. Updated tests to the new status names. 3. insertDelivery test helper assigns a distinct seq per agent — the mailbox migration added UNIQUE(workspace_id, agent_id, seq), so same-agent rows can no longer share the default seq 0. Note: turbo build/tsc is currently red on main itself (engine.ts:212 uses originInfo.origin_surface, which #188 removed from the telemetry contract) — a pre-existing #188/#192 collision unrelated to this PR. Engine vitest is green (132/132). --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Phase 6 — per-workspace fleet rollout flag + E2E-surfaced compat fixes (RFC AgentWorkforce/relay#1056)
1. Per-workspace fleet rollout flag
Gates the entire fleet node control surface behind a per-workspace
fleet_nodes_enabledflag (default OFF) so fleet can ship dark and roll outworkspace-by-workspace. Legacy per-agent WS delivery is unaffected either way.
The flag is checked once at each genuine boundary (no scattered checks):
/v1/node/ws) →fleet_nodes_disabled(404) when off/v1/nodes*) → flat 404 viarequireFleetNodesinvokeAction(agent-handler actions stay available)
KV override with a short cache, defaulting to
EngineConfig.fleetNodesEnabled.GET/PUT /v1/workspace/fleet-nodestoggles the per-workspace override. Therelaycast-engineserve bin also readsRELAYCAST_FLEET_NODES_ENABLEDandRELAYCAST_MAILBOX_TTL_MS/RELAYCAST_MAILBOX_DEPTH_CAPfor self-host tuning.Migration single-delivery: a test proves a legacy self-connected agent is
never also delivered via a node when the flag flips mid-stream (one active
location per agent; a node's
agent.registerfor an already-connected agent isrejected
agent_location_conflict).2. Two cross-repo compat fixes surfaced by the Phase 6 E2E
The engine runtime deviated from the frozen wire contract (#191) in two
places despite the golden fixtures, so a real broker could never bring a node
online or complete a spawn. Both are fixed engine-side here (the broker is
contract-correct; no Rust release). Neither reopens frozen #192.
Node HTTP-server
upgradehandler inentrypoints/node.tsread the token onlyfrom
?token=, but the broker'snode_controlclient sendsAuthorization: Bearer <nt_live_…>. It also had no fleet-flag gate for/v1/node/ws. Fixed: read the token from query or Bearer header, and gatethe upgrade behind
isFleetNodesEnabled.agent.registerreply frame (token authority): the broker awaits a canonicalreplyframe{v, id, ok:true, data:{agent_id, token, name?}}keyed by requestid (
deny_unknown_fields); the engine emitted a bare{type:'agent.registered', …},so
register_fleet_agent_tokenhung to its 30s timeout and spawn nevercompleted. Fixed: emit the canonical reply frame.
Regression guards
nodeUpgradeAuth.test.ts: a real WS client authenticates via both the Bearerheader and the query param; flag-off → 404; missing/malformed token → 401.
and a spawn completes end-to-end (token mint+inject) — these regressions
cannot reship.
Version skew
Old v8.x brokers keep working (fleet flag off). Fleet nodes require the new
broker + engine + the per-workspace flag on. The hosted (Cloudflare DO)
equivalent of these fixes is handled in PR 5.
Tests
Full engine suite green (111). typecheck + targeted eslint clean.
🤖 Generated with Claude Code