Skip to content

feat(engine): bounded-durable mailbox, location routing, §5 cleanup (fleet Phase 2)#193

Merged
willwashburn merged 3 commits into
mainfrom
feat/fleet-mailbox
Jun 15, 2026
Merged

feat(engine): bounded-durable mailbox, location routing, §5 cleanup (fleet Phase 2)#193
willwashburn merged 3 commits into
mainfrom
feat/fleet-mailbox

Conversation

@willwashburn

Copy link
Copy Markdown
Member

Summary

Implements Fleet Delivery Phase 2 for AgentWorkforce/relay#1056 on top of feat/fleet-nodes-engine / relaycast#192.

  • Evolves durable deliveries into the bounded mailbox state machine: queued -> delivered -> acked, per-agent monotonic seq, cumulative node delivery.ack, TTL dead-lettering, and depth-cap reject-new sender feedback.
  • Routes inbound delivery by registered location: self-connected agents keep the existing WS push path, while via_node agents receive deliver {agent,msg_id,seq,mode,payload} over the node control connection.
  • Marks cumulative delivery ack as read state so /v1/inbox does not resurface acked messages, and preserves unacked deliveries for reconnect/broker-death redelivery.
  • Adds mailbox defaults in one config point: TTL 1h, depth cap 1000, with workspace-level overrides through engine config.
  • Updates delivery wire types, SDK tests/comments, OpenAPI delivery/node/trigger coverage, and the SQLite migration/schema.

Stack / Merge Order

This PR is stacked on relaycast#192 (feat/fleet-nodes-engine) and should merge after that PR lands. The base requested here is main, so the diff includes the stack context until #192 is merged.

§11 Audit / §5 Cleanup Gate

Audit commands searched the repo for relay://, /v1/ws, resource subscriptions, and per-agent stream consumers.

Findings:

  • No engine-owned relay:// resource definitions were found to delete in this repo.
  • Active consumers still exist outside the engine package: packages/mcp/src/resources/definitions.ts, packages/mcp/src/resources/ws-bridge.ts, packages/mcp/src/resources/subscriptions.ts, and MCP resource/subscription tests still register/read/subscribe relay://... resources.
  • Per-agent /v1/ws is still consumed by the TypeScript SDK (AgentClient.connect / subscription helpers), SDK docs/examples, and other SDK surfaces.

Result: the PTY-agent resource/subscription surface is kept in this PR. Actual deletes remain gated behind that audit because consumers still exist.

Verification

  • npm run typecheck --workspace=@relaycast/engine
  • npx vitest run src/__tests__/conformance/delivery.test.ts src/__tests__/conformance/node.test.ts src/__tests__/atomicity.test.ts from packages/engine
  • npm run test --workspace=@relaycast/types
  • npm run test --workspace=@relaycast/engine
  • npm run test --workspace=@relaycast/sdk
  • npm run build
  • npm run test

Root package.json does not define a repo-level typecheck script; engine workspace typecheck is the explicit typecheck run.

@gemini-code-assist

Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@willwashburn, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 11 minutes and 34 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 82135dab-7274-4068-a59a-b1fdc189a4ad

📥 Commits

Reviewing files that changed from the base of the PR and between 86425b2 and 6470e3f.

📒 Files selected for processing (30)
  • openapi.yaml
  • packages/engine/src/__tests__/atomicity.test.ts
  • packages/engine/src/__tests__/conformance/delivery.test.ts
  • packages/engine/src/__tests__/conformance/harness.ts
  • packages/engine/src/__tests__/conformance/node.test.ts
  • packages/engine/src/db/migrations/0017_fleet_mailbox.sql
  • packages/engine/src/db/schema.ts
  • packages/engine/src/engine/agent.ts
  • packages/engine/src/engine/delivery.ts
  • packages/engine/src/engine/deliveryWire.ts
  • packages/engine/src/engine/deliveryWrites.ts
  • packages/engine/src/engine/dm.ts
  • packages/engine/src/engine/groupDm.ts
  • packages/engine/src/engine/mailboxConfig.ts
  • packages/engine/src/engine/message.ts
  • packages/engine/src/engine/node.ts
  • packages/engine/src/engine/receipt.ts
  • packages/engine/src/engine/thread.ts
  • packages/engine/src/ports/index.ts
  • packages/engine/src/routes/delivery.ts
  • packages/engine/src/routes/deliveryRouting.ts
  • packages/engine/src/routes/dm.ts
  • packages/engine/src/routes/groupDm.ts
  • packages/engine/src/routes/inbox.ts
  • packages/engine/src/routes/message.ts
  • packages/engine/src/routes/thread.ts
  • packages/sdk-typescript/src/__tests__/deliveries.test.ts
  • packages/sdk-typescript/src/agent.ts
  • packages/types/src/__tests__/types.test.ts
  • packages/types/src/delivery.ts
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/fleet-mailbox

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

agent-relay-code Bot added a commit that referenced this pull request Jun 12, 2026
agent-relay-code Bot added a commit that referenced this pull request Jun 12, 2026

@willwashburn willwashburn left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified npm run build and npm run test pass after installing workspace deps.

  1. major packages/engine/src/routes/deliveryRouting.ts:46 — routing for via_node deliveries is based on the location_type/location_node_id snapshot stored on the delivery row, not the recipient’s current location. If an agent re-registers on a different transport before fanout runs, the message still goes to the stale node/socket and can be lost. Re-resolve the current agent location at send time, or migrate queued deliveries when location changes.

  2. major packages/engine/src/adapters/node/realtime.ts:311 — attachNodeSocket() adds sockets to a Set and never replaces an existing control connection. A reconnect can therefore leave two live node sockets for the same node, duplicating deliver/action.invoke traffic and any acks. Enforce a single active node socket by closing the prior one before attaching the new connection.

@willwashburn

Copy link
Copy Markdown
Member Author

Addressed the fanout race in : delivery fanout now re-resolves the recipient's live location from before push, and the mailbox suite now has a regression test that flips the recipient from to before dispatch to prove the delivery still lands.

I left the node socket supersede/close change to Fleet1 as requested and will rebase on that tip when it lands.

@willwashburn

Copy link
Copy Markdown
Member Author

Addressed the fanout race in deliveryRouting.ts: delivery fanout now re-resolves the recipient's live location from the agents table before push, and the mailbox suite now has a regression test that flips the recipient from via_node to self_connected before dispatch to prove the delivery still lands.

I left the node socket supersede and close change to Fleet1 as requested and will rebase on that tip when it lands.

@willwashburn willwashburn left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round 2 review against feat/fleet-nodes-engine. Verdict: NO-GO. GitHub would not allow this authenticated reviewer to request changes on its own pull request, so I am posting the blocking review as a comment.

  1. [major] TTL expiry can overwrite a concurrent ack and emit a false delivery_failed. expireDueDeliveries selects active rows at packages/engine/src/engine/delivery.ts:398-411, then updates the captured ids at packages/engine/src/engine/delivery.ts:416-425 without re-checking that status is still queued/delivered. If a node delivery.ack lands between those awaits, ackRows can set the row to acked, but the expiry update will still change it to dead_lettered and notify the sender. This violates queued -> delivered -> acked terminal success and the round-2 concurrent ack/TTL edge. Make the expiry update status-guarded and build notices only from rows actually transitioned, ideally in one transaction/returning update.

  2. [major] Reconnect redelivery does not preserve the live deliver payload and reports the recipient as the message author. Live fanout sends payload: { type, data } from routeDeliveryOutcomes at packages/engine/src/routes/deliveryRouting.ts:76-85. Reconnect replay rebuilds a different payload in deliverPendingToNode at packages/engine/src/engine/delivery.ts:462-485 and sends it at packages/engine/src/engine/delivery.ts:523-531. That reconstructed payload also sets message.agent_id from row.delivery.agentId at packages/engine/src/engine/delivery.ts:477-480, which is the target agent, not the sender. After broker death, an unacked message with the same msg_id/seq can therefore be reinjected with a different shape and wrong author id. Persist or reconstruct one canonical deliver payload for both live and replay paths, using the original sender id/name, and add a reconnect test that asserts payload equality, not only msg_id and seq.

Tests run:

  • npm ci
  • npm run build
  • npm run typecheck --workspace=@relaycast/engine
  • npx vitest run src/__tests__/conformance/delivery.test.ts src/__tests__/conformance/node.test.ts src/__tests__/atomicity.test.ts from packages/engine
  • npm run test --workspace=@relaycast/types
  • npm run test --workspace=@relaycast/sdk
  • npm run test --workspace=@relaycast/engine
  • npm run test

@willwashburn

Copy link
Copy Markdown
Member Author

Round 2 review items addressed in ea89632:

  • TTL expiry now guards active status in the UPDATE itself, and expired-notice fanout only uses rows actually updated, so an acked row cannot be dead-lettered or falsely notify the sender.
  • Reconnect redelivery now uses the same shared deliver frame builder as live fanout, and the replay path reconstructs the same payload shape from the persisted message row.
  • Added concurrent ack-vs-TTL and live-vs-replay payload-equality regressions.

Fleet1-owned single-socket-per-node work remains untouched here as requested.

@willwashburn willwashburn left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round 3 review: NO-GO. GitHub rejected --request-changes for this account with: Review Can not request changes on your own pull request, so this is posted as a review comment with the same blocking findings.

  1. [blocker] PR 193 is not rebased on the PR 192 head fix and reintroduces multi-socket node delivery. git merge-base --is-ancestor origin/feat/fleet-nodes-engine HEAD is false for ea89632 vs 8949f842, and the current code stores nodeSockets as a Set at packages/engine/src/adapters/node/realtime.ts:66, then sends every control frame to every socket at packages/engine/src/adapters/node/realtime.ts:244, while attachNodeSocket only adds the new socket at packages/engine/src/adapters/node/realtime.ts:311. A reconnect with an old socket still present can duplicate deliver and action.invoke frames, which is exactly the single-socket class of bug round 3 asked us to verify. Rebase or merge feat/fleet-nodes-engine tip, keep one current socket per node, close the superseded socket, and keep the reconnect/no-duplicate test on this stack.

  2. [major] Broker-death replay does not preserve the live delivery wire frame for non-channel messages. Live DM, group DM, and thread deliveries call routeDeliveryOutcomes with dm.received, group_dm.received, and thread.reply at packages/engine/src/routes/dm.ts:131, packages/engine/src/routes/groupDm.ts:185, and packages/engine/src/routes/thread.ts:126, but deliverPendingToNode always rebuilds the pending row as message.created at packages/engine/src/engine/delivery.ts:562. After reconnect, the same delivery can change payload.type and data shape, breaking consumers and fixture parity. Persist the original delivery event type/payload or reconstruct through the same builders used by the live fanout path, and add reconnect redelivery tests for DM, group DM, and thread reply deliveries.

  3. [major] A future cumulative ack can permanently suppress later replay. ackRows advances agents.deliveryAckSeq directly to any up_to_seq at packages/engine/src/engine/delivery.ts:313, while new delivery seqs are assigned from only MAX(deliveries.seq)+1 at packages/engine/src/engine/deliveryWrites.ts:66, and replay filters rows with deliveries.seq > agents.deliveryAckSeq at packages/engine/src/engine/delivery.ts:550. I reproduced this by sending delivery.ack {up_to_seq:100} before the first message: the live frame used seq:1, then after node reconnect/inventory sync replayDeliveries was 0. Do not advance the ack cursor past rows that actually exist and were acked, or make seq allocation account for the ack cursor; add a regression for future/stale acks before first delivery.

  4. [major] The new TTL-vs-ack race coverage is flaky under fresh execution. A focused run of npx vitest run src/__tests__/conformance/delivery.test.ts src/__tests__/conformance/node.test.ts src/__tests__/atomicity.test.ts failed at packages/engine/src/__tests__/conformance/delivery.test.ts:706 because expireDueDeliveries returned one notice in does not dead-letter an acked delivery when ack and TTL expiry race. A later forced full npm run test -- --force passed, which points to nondeterministic scheduling in the test or implementation. Make this deterministic: either enforce the intended ack-wins ordering in storage, or adjust the test to assert exactly one terminal winner and no duplicate sender fanout.

Verification run from /tmp/review-Rev2x3:

  • npm ci
  • npm run build --workspace=@relaycast/types and npm run build --workspace=@relaycast/a2a to create workspace dist/ outputs for fresh typecheck
  • npm run typecheck --workspace=@relaycast/engine passed
  • focused engine vitest command above failed once on the TTL race
  • npm run test --workspace=@relaycast/types passed
  • npm run test --workspace=@relaycast/sdk passed
  • npm run build passed
  • npm run test -- --force passed with 18/18 tasks forced, 0 cached

willwashburn pushed a commit that referenced this pull request Jun 12, 2026
willwashburn pushed a commit that referenced this pull request Jun 12, 2026

@willwashburn willwashburn left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: NO-GO. GitHub would not allow this account to request changes on its own PR, but the finding below should block merge until fixed.

  1. [major] openapi.yaml:3815 Duplicate path keys make the OpenAPI document invalid. This PR adds detailed /nodes, /nodes/{name}, /triggers, and /triggers/{id} specs at openapi.yaml:2848, but leaves the old stub definitions starting at openapi.yaml:3815. YAML maps cannot contain duplicate keys; running node -e "const fs=require("fs"); const YAML=require("yaml"); YAML.parse(fs.readFileSync("openapi.yaml","utf8"));" fails with Map keys must be unique at line 3815. Generators/parsers cannot reliably consume the API spec, and less-strict parsers will shadow the detailed node/trigger request schemas/security with the later generic stubs. Remove the old stub path blocks or merge their content into the new detailed definitions so each path appears exactly once.

  2. [nit] memory/workspace/.relay/state.json:1 Generated relay sync state changed only timestamps/counters. Drop this file from the PR or ignore generated state so the fleet mailbox change does not carry local runtime metadata.

Verification run in /tmp/review-Rev2x4:

  • git merge-base --is-ancestor 8949f842 HEAD passed at c39fc79.
  • npm install completed; initial root npm run build/npm run test passed via Turbo cache.
  • Direct runs passed: npm run build --workspace=@relaycast/engine, npm run typecheck --workspace=@relaycast/engine, npx vitest run src/__tests__/conformance/delivery.test.ts src/__tests__/conformance/node.test.ts src/__tests__/atomicity.test.ts from packages/engine, and npm run test --workspace=@relaycast/types.
  • Round-4 target checks: replay/redelivery now preserves original payload event types via the shared builder and equality regressions; TTL expiry update is status-guarded; fanout re-resolves current recipient location before routing.

@willwashburn willwashburn marked this pull request as ready for review June 15, 2026 15:28

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ef058d0af2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +56 to +58
const onlineAgentIds = await deps.presence.getOnline(workspaceId);
if (onlineAgentIds.length > 0) {
fanoutTasks.push(deps.realtime.deliverToAgents({ workspaceId, agentIds: onlineAgentIds, event: payload }));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Don't broadcast action results to every online agent

When an action completes with private output or error data, this presence fanout sends the transformed action.completed/action.failed payload to every online agent in the workspace. The existing route only targeted the invoking caller_id plus the workspace stream, so unrelated connected agents now receive another agent's action result; remove this presence-wide delivery or gate it behind the workspace-stream authorization path.

Useful? React with 👍 / 👎.

const updateTriggerBodySchema = triggerBodySchema.partial();

// POST /v1/triggers
triggerRoutes.post('/triggers', requireAuth, rateLimit, async (c) => {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restrict trigger creation to trusted principals

Because this endpoint accepts any agent token, any agent can create workspace-wide triggers; those triggers later invoke actions using the future message author's caller_id/caller_name, so an agent can set up a trigger for an action only available to another agent and have it run when that agent posts matching text. Require a workspace key for trigger management or store and enforce the trigger creator's permissions when firing.

Useful? React with 👍 / 👎.

Comment thread packages/sdk-typescript/src/types.ts Outdated
export interface NodeRosterEntry {
id: string;
name: string;
capabilities: string[];

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Model node capabilities as objects in the SDK

The engine stores and returns node capabilities as FleetCapability objects after enrollment/registration, but the SDK declares them as string[]. SDK consumers will compile code such as capabilities.includes('echo') that silently fails against the actual { name: 'echo' } objects; either expose the object shape here or map the response to strings before returning it.

Useful? React with 👍 / 👎.

agent-relay-code Bot added a commit that referenced this pull request Jun 15, 2026
@gemini-code-assist

Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

agent-relay-code Bot added a commit that referenced this pull request Jun 15, 2026
agent-relay-code Bot added a commit that referenced this pull request Jun 15, 2026
@willwashburn willwashburn merged commit 31caef8 into main Jun 15, 2026
4 checks passed
@willwashburn willwashburn deleted the feat/fleet-mailbox branch June 15, 2026 16:21
willwashburn added a commit that referenced this pull request Jun 15, 2026
#192 (merged) owns 0017_spawn_reservation_and_retry_state and 0018; the Phase 2
mailbox migration was authored as 0017 on an older base, colliding on the 0017
prefix once #192/#193 landed in main. Renumber to 0019 (after 0018) so the D1
migration sequence is unique and ordered. Pure file rename — no code references
the filename, and the migration has not been applied to any environment yet
(engine unpublished), so there is no D1 re-apply risk.
willwashburn added a commit that referenced this pull request Jun 15, 2026
…ivery guarantees (Phase 6) (#194)

* feat(engine): per-workspace fleet rollout flag + migration single-delivery guarantees (Phase 6)

Gate the entire fleet node control surface behind a per-workspace
`fleet_nodes_enabled` flag (default OFF), so fleet can ship dark and roll
out workspace-by-workspace. Legacy per-agent WS delivery is unaffected
either way.

The flag is checked once at each genuine boundary (no scattered checks):
- node control WS (`/v1/node/ws`) rejects with `fleet_nodes_disabled` (404)
- node roster routes (`/v1/nodes*`) return a flat 404 via `requireFleetNodes`
- declarative trigger evaluation is skipped at the message hook
- spawn placement + node-handler dispatch refuse in `invokeAction`
  (agent-handler actions stay available)

Flag source mirrors the workspace-stream pattern: a KV override with a
short in-memory cache, defaulting to `EngineConfig.fleetNodesEnabled`.
GET/PUT `/v1/workspace/fleet-nodes` toggles the per-workspace override.

Tests:
- flag OFF -> every node surface inert (roster, spawn, WS gate, triggers)
- per-workspace override flips the surface on/off; WS gate follows the flag
- migration single-delivery: a legacy self-connected agent is never also
  delivered via a node when the flag flips mid-stream (exclusive location;
  a node's `agent.register` for it is rejected `agent_location_conflict`)

The conformance harness defaults the flag ON, so existing node integration
tests pass unchanged. Full engine suite green (108 tests).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(engine): accept node token via Authorization: Bearer + gate node WS upgrade behind fleet flag (Phase 6)

Cross-repo compat fix surfaced by the Phase 6 two-node E2E: a real relay
broker could never bring a node online against a self-hosted engine.

Root cause: the node-control read-side is the Node HTTP-server `upgrade`
handler in `entrypoints/node.ts` (the Hono `/v1/node/ws` route only answers
the 426 — Node owns the 101). That handler read the token ONLY from the
`?token=` query param, but the relay Rust broker's node_control client sends
it as `Authorization: Bearer <nt_live_…>`. It also had NO fleet-flag gate for
`/v1/node/ws` (only the rk_live workspace-stream path was gated), so the
Phase 6 rollout flag did not actually cover the node control surface on the
self-host adapter.

Fix, both in the upgrade handler:
- read the node token from `?token=` query OR `Authorization: Bearer` header
  (query stays for SDK/Pear; header unblocks the shipped broker — no Rust
  release needed)
- gate the `/v1/node/ws` upgrade behind `isFleetNodesEnabled` (404 when off),
  mirroring the existing stream gate

Also mirrored the dual-transport read in the Hono `/v1/node/ws` route for any
adapter that routes upgrades through it.

Accepted-stack PRs involved: engine read-side #192, broker send-side #1107.
The hosted (Cloudflare DO) equivalent is handled in PR 5.

Test: `nodeUpgradeAuth.test.ts` boots the real Node server and asserts a WS
client authenticates via BOTH the Bearer header and the query param, that the
upgrade is rejected while the workspace flag is off (404), and that a
missing/malformed token is rejected (401). Full engine suite green (111).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(engine): reply to agent.register with a broker-shaped `reply` frame (Phase 6 token authority)

Third cross-repo compat fix surfaced by the Phase 6 E2E (spawn scenarios):
spawn never completes end-to-end against a real broker. The relay broker's
node_control client awaits a `reply` frame keyed by the request id — it matches
`pending_agent_registrations` by `reply.id` and parses `data` as
`{agent_id, token, name}` with `deny_unknown_fields`. The engine instead
answered `agent.register` with a bare `{type:'agent.registered', ...}` carrying
the full object (incl. invocation_id/session_ref), which the broker never
matches → `register_fleet_agent_token` hangs to its 30s timeout → the spawn
action fails. This blocked every spawn-dependent path (placement completion,
mailbox delivery to via-node agents, resume).

Reply in the shape the shipped broker consumes; the broker already holds the
invocation_id/session_ref it sent, so only the minted identity is echoed. Same
root pattern as the node-token transport mismatch (#192 read-side ↔ #1107
broker send-side); no Rust release needed.

Updated the one conformance helper that asserted the old frame. Engine suite
green (111).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(engine): self-host serve env for fleet flag default + mailbox TTL/depth-cap

The `relaycast-engine` serve bin gains optional env tuning so operators (and the
Phase 6 fleet E2E) can configure the bounded mailbox and the fleet rollout
default without code changes:
- RELAYCAST_FLEET_NODES_ENABLED=1 → EngineConfig.fleetNodesEnabled
- RELAYCAST_MAILBOX_TTL_MS / RELAYCAST_MAILBOX_DEPTH_CAP → mailbox tuning

Unset env leaves the existing defaults untouched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* chore(engine): renumber fleet mailbox migration 0017→0019 to deduplicate

#192 (merged) owns 0017_spawn_reservation_and_retry_state and 0018; the Phase 2
mailbox migration was authored as 0017 on an older base, colliding on the 0017
prefix once #192/#193 landed in main. Renumber to 0019 (after 0018) so the D1
migration sequence is unique and ordered. Pure file rename — no code references
the filename, and the migration has not been applied to any environment yet
(engine unpublished), so there is no D1 re-apply risk.

* docs(changelog): record fleet node/mailbox changes + breaking DeliveryStatus remap

Changelogs here are hand-curated (no CI generation), and the fleet stack
(#191-#194) was missing from them. Add the user-facing entries:

- @relaycast/types: new CHANGELOG; document the breaking DeliveryStatus enum
  remap (accepted/deferred removed, acked/dead_lettered added, delivered
  re-meaning) with old->new mapping + flag-independent migration note, the new
  Delivery location/lifecycle fields, and the fleet-wire protocol module.
- @relaycast/sdk-typescript: node roster API (nodes.list/get, triggers.list),
  capability objects, handler/dispatch node fields, JsonValue export, and the
  breaking action-output widening + delivery status value change.

These confirm the next @relaycast/types + sdk-typescript publish is a MAJOR.

* chore: apply pr-reviewer fixes for #194

* fix(engine): mailbox cumulative-ack + depth-cap correctness (Codex review)

Address P2 findings from Codex review of the fleet mailbox delivery path:

1. ackDelivery (single per-delivery REST ack) advanced the cumulative cursor to
   the row's own seq, so acking seq 2 while seq 1 is queued moved the cursor past
   seq 1; deliverPendingToNode (seq > delivery_ack_seq) then skipped it forever on
   node replay. Make the cursor advance opt-in (ackRows advanceCursorTo?) — single
   acks no longer advance it; the row's acked status already excludes it from replay.
   The node delivery.ack {up_to_seq} path still advances cumulatively. Regression test.

2. Migration 0019 seeded delivery_ack_seq = MAX(acked seq), skipping an older
   still-queued row below a newer acked one. Seed from the contiguous acked prefix
   (lowest active seq - 1; max seq when nothing is active).

3. Node-replay event classification checked dmType before threadId, so a thread
   reply inside a DM/group DM would replay as dm.received instead of thread.reply
   (the live routes/thread.ts routing). Check threadId first to mirror live.

4. Mailbox depth-cap count included expired-but-unswept rows, so an idle recipient
   kept rejecting new sends as depth_cap after TTL instead of dead-lettering.
   Exclude expired rows from the count (matches the replay query). Regression test.

Also classify the operator-only /v1/workspace/fleet-nodes flag route as non-SDK in
sdk-openapi-sync (pre-existing #194 gap that turbo test caching had masked).

* docs(openapi): require enabled|mode and document 400 on PUT /workspace/fleet-nodes

The PUT handler rejects a payload lacking both `enabled` (boolean) and
`mode: inherit` with a 400 invalid_request, but the schema marked both optional
and documented only a 200. Add anyOf[required: enabled | required: mode] to
reflect the runtime constraint, and document the 400 (ErrorResponse).

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Co-authored-by: agent-relay-code[bot] <agent-relay-code[bot]@users.noreply.github.com>
willwashburn added a commit that referenced this pull request Jun 15, 2026
Rebased feature/engine-retention onto main, which now has the fleet stack
(#191-#194). Three adaptations:

1. Renumber migration 0016_workspace_retention -> 0020_workspace_retention; #192
   took 0016 (fleet_nodes) and #193 took 0019 (fleet_mailbox).
2. Delivery status model: #193 reworked the enum, so SETTLED_DELIVERY_STATUSES is
   now ['acked','failed','dead_lettered'] (was ['delivered','failed']). 'delivered'
   is now IN-FLIGHT (sent, awaiting cumulative ack), so retention must never prune
   it; 'acked' is terminal success. Updated tests to the new status names.
3. insertDelivery test helper assigns a distinct seq per agent — the mailbox
   migration added UNIQUE(workspace_id, agent_id, seq), so same-agent rows can no
   longer share the default seq 0.

Note: turbo build/tsc is currently red on main itself (engine.ts:212 uses
originInfo.origin_surface, which #188 removed from the telemetry contract) — a
pre-existing #188/#192 collision unrelated to this PR. Engine vitest is green
(132/132).
willwashburn added a commit that referenced this pull request Jun 15, 2026
Rebased feature/engine-retention onto main, which now has the fleet stack
(#191-#194). Three adaptations:

1. Renumber migration 0016_workspace_retention -> 0020_workspace_retention; #192
   took 0016 (fleet_nodes) and #193 took 0019 (fleet_mailbox).
2. Delivery status model: #193 reworked the enum, so SETTLED_DELIVERY_STATUSES is
   now ['acked','failed','dead_lettered'] (was ['delivered','failed']). 'delivered'
   is now IN-FLIGHT (sent, awaiting cumulative ack), so retention must never prune
   it; 'acked' is terminal success. Updated tests to the new status names.
3. insertDelivery test helper assigns a distinct seq per agent — the mailbox
   migration added UNIQUE(workspace_id, agent_id, seq), so same-agent rows can no
   longer share the default seq 0.

Note: turbo build/tsc is currently red on main itself (engine.ts:212 uses
originInfo.origin_surface, which #188 removed from the telemetry contract) — a
pre-existing #188/#192 collision unrelated to this PR. Engine vitest is green
(132/132).
willwashburn added a commit that referenced this pull request Jun 15, 2026
…llow-ups (#189)

* feat(engine): retention pruning with per-workspace TTLs and outbox follow-ups

Add pruneExpired: bounded-batch deletion of expired messages (leaf-first
across thread parents), settled deliveries, message logs, and orphaned
read receipts, with per-workspace TTLs in a new nullable
workspaces.retention column. Message retention is opt-in; settled
deliveries and message logs default to 90 days as operational logs.
Runs on the Node adapter's outbox cleanup cadence and is exported for
queue-backed scheduled handlers.

cleanupOldEvents now settles exhausted pending_events rows as failed so
they become prunable instead of lingering unclaimable, and
sendWebhookEvent skips the outbox insert and queue send entirely (with
a per-request memoized existence probe) for workspaces with no active
event subscription.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* chore(engine): adapt retention to merged fleet engine (rebase onto main)

Rebased feature/engine-retention onto main, which now has the fleet stack
(#191-#194). Three adaptations:

1. Renumber migration 0016_workspace_retention -> 0020_workspace_retention; #192
   took 0016 (fleet_nodes) and #193 took 0019 (fleet_mailbox).
2. Delivery status model: #193 reworked the enum, so SETTLED_DELIVERY_STATUSES is
   now ['acked','failed','dead_lettered'] (was ['delivered','failed']). 'delivered'
   is now IN-FLIGHT (sent, awaiting cumulative ack), so retention must never prune
   it; 'acked' is terminal success. Updated tests to the new status names.
3. insertDelivery test helper assigns a distinct seq per agent — the mailbox
   migration added UNIQUE(workspace_id, agent_id, seq), so same-agent rows can no
   longer share the default seq 0.

Note: turbo build/tsc is currently red on main itself (engine.ts:212 uses
originInfo.origin_surface, which #188 removed from the telemetry contract) — a
pre-existing #188/#192 collision unrelated to this PR. Engine vitest is green
(132/132).

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
willwashburn added a commit that referenced this pull request Jun 15, 2026
Rebased #190 onto main (resolving the #188 origin-contract changes:
origin_surface is gone; only origin_actor + origin_client/origin_version
remain — confirmed no origin_surface references survive).

Correctness fixes layered on top of #190's parity additions:

- DeliveryStatus: updated the stale Literal["accepted","delivered",
  "deferred","failed"] to the canonical #193 enum
  Literal["queued","delivered","acked","failed","dead_lettered"]
  (packages/types/src/delivery.ts). "delivered" now means in-flight
  awaiting ack; "acked" is terminal success; accepted/deferred removed.

- Delivery model: aligned with the canonical DeliverySchema by adding the
  missing fields seq, location_type, location_node_id, expires_at,
  delivered_at, acked_at, dead_lettered_at to match the TS SDK surface.

- channels.set_topic: corrected the route from PATCH /v1/channels/{name}
  to PATCH /v1/channels/{name}/topic to match the TS setTopic() and the
  dedicated openapi endpoint (it was colliding with channels.update).

- channels.invite: corrected the request body field from {"agent": ...}
  to {"agent_name": ...} to match InviteRequestSchema / the TS SDK wire
  shape (Python sends keys verbatim with no camel->snake conversion).

- Updated test_channels_set_topic to assert the corrected /topic route.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
willwashburn added a commit that referenced this pull request Jun 15, 2026
)

* Add Python SDK parity endpoints

* Fix Python SDK parity: delivery status enum, channel topic/invite paths

Rebased #190 onto main (resolving the #188 origin-contract changes:
origin_surface is gone; only origin_actor + origin_client/origin_version
remain — confirmed no origin_surface references survive).

Correctness fixes layered on top of #190's parity additions:

- DeliveryStatus: updated the stale Literal["accepted","delivered",
  "deferred","failed"] to the canonical #193 enum
  Literal["queued","delivered","acked","failed","dead_lettered"]
  (packages/types/src/delivery.ts). "delivered" now means in-flight
  awaiting ack; "acked" is terminal success; accepted/deferred removed.

- Delivery model: aligned with the canonical DeliverySchema by adding the
  missing fields seq, location_type, location_node_id, expires_at,
  delivered_at, acked_at, dead_lettered_at to match the TS SDK surface.

- channels.set_topic: corrected the route from PATCH /v1/channels/{name}
  to PATCH /v1/channels/{name}/topic to match the TS setTopic() and the
  dedicated openapi endpoint (it was colliding with channels.update).

- channels.invite: corrected the request body field from {"agent": ...}
  to {"agent_name": ...} to match InviteRequestSchema / the TS SDK wire
  shape (Python sends keys verbatim with no camel->snake conversion).

- Updated test_channels_set_topic to assert the corrected /topic route.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(sdk-swift): bring Swift SDK to 100% parity with TypeScript SDK

Add the relay-level surfaces that were missing from the Swift SDK:

- nodes namespace: list (GET /v1/nodes, capability/name filters), get
  (GET /v1/nodes/{name}) with NodeRosterEntry + NodeCapability models
- triggers namespace: create/list/get/update/delete full lifecycle
  (POST/GET/PATCH/DELETE /v1/triggers[/{id}]) with Trigger,
  CreateTriggerRequest, UpdateTriggerRequest models
- activity feed: activity(limit) -> GET /v1/activity
- workspace-level DM queries: allDMConversations (GET
  /v1/dm/conversations/all) and dmMessages (GET
  /v1/dm/conversations/{id}/messages)

Fix the stale DeliveryStatus enum to the current statuses
(queued|delivered|acked|failed|dead_lettered), replacing the old
accepted/deferred values.

All routes verified present in openapi.yaml. Adds tests for nodes,
triggers, workspace DM queries, activity, and the delivery-status enum.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(sdk-rust): full Rust↔TypeScript SDK parity

Bring the Rust SDK to 100% feature parity with the TypeScript reference
SDK. Every new route is documented in openapi.yaml.

New RelayCast surfaces:
- Workspace bootstrap: lookup_workspace (GET /v1/workspaces/by-name/{name})
- A2A: register_a2a, list_a2a_agents, remove_a2a_agent, get_a2a_agent_card
- Routing: route, route_feedback, get_routing_config, update_routing_config
- Directory: search_directory, publish_to_directory, list_directory,
  get_directory_agent, update_directory_agent, delete_directory_agent,
  list_directory_ratings, rate_directory_agent
- Skills: import_skills, search_skills
- Fleet nodes: list_nodes, get_node
- Triggers: create_trigger, list_triggers, get_trigger, update_trigger,
  delete_trigger
- Certification: certify, get_certification, certification_badge_url,
  monitor_certification
- Console: console_messages, console_stats (ConsoleOverview), console_agents,
  console_costs

New AgentClient surfaces:
- channels mute_channel / unmute_channel
- invite_to_channel fixed to send documented `agent_name` body

Models: added serde structs for A2A cards/records, directory agents/skills/
ratings, routing config/weights, skill search results, node roster with
capability objects, triggers, certification runs, and console stats —
all snake_case to match the wire contract.

DeliveryStatus enum updated to the canonical lifecycle
(queued|delivered|acked|failed|dead_lettered); tests updated to match.

Adds parity tests for every new surface.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: apply pr-reviewer fixes for #190

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: agent-relay-code[bot] <agent-relay-code[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant