Update #182 to current main; keep only the workflow rebase-and-retry fix by willwashburn · Pull Request #202 · AgentWorkforce/relaycast

willwashburn · 2026-06-19T16:41:28Z

Merge this into fix/release-commit-back-race to make #182 mergeable again.

What this does

main has advanced from 2.x to 4.1.x since #182 was opened, leaving it conflicted (dirty). #182 carried two kinds of changes:

Workflow rebase-and-retry (publish-npm.yml, publish-rust.yml) — still relevant; main's publish workflows still use the racy git push origin HEAD:main.
Version reconciliation (2.5.1 → 2.6.0) — now obsolete since main is at 4.x.

This branch is current main + only the workflow changes. After merging it into fix/release-commit-back-race, that branch contains everything on main plus the workflow fix, so #182's net diff against main becomes just the two workflow files and it stops conflicting.

Note on history

Because this branch is built fresh on top of current main (the original fix/release-commit-back-race couldn't be pushed to from this environment — 403), the file list below is large. The effective change relative to main is only the two workflow files; everything else is main catching up on fix/release-commit-back-race.

🤖 Generated with Claude Code

Generated by Claude Code

…tion capability (#174) * feat(engine): atomic multi-statement write paths via optional transaction capability The database port gains an optional TransactionCapability that adapters attach when their driver supports interactive transactions, plus a runAtomic(db, fn) helper that uses it when present and falls back to plain sequential statements otherwise (unchanged D1 behavior). The Node better-sqlite3 adapter implements the capability with manual BEGIN IMMEDIATE / COMMIT / ROLLBACK, serialized through a promise queue so concurrent requests on the shared connection cannot interleave with an open transaction. Wrapped write paths (DB writes only — realtime/webhook fanout stays in routes, outside the transaction): - channel message send (message + attachments + deliveries + message_log) - DM send (message + attachments + delivery + message_log) - group DM send (message + attachments + deliveries) - thread reply (reply + deliveries) - markRead (read receipt + delivery transition + lastReadId) On self-host, a failure mid-send no longer leaves a message row with no delivery rows (silent durable-delivery loss). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * chore: apply pr-reviewer fixes for #174 * chore: apply pr-reviewer fixes for #174 * chore: apply pr-reviewer fixes for #174 * chore: apply pr-reviewer fixes for #174 * fix(engine): revalidate group DM sends atomically --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com> Co-authored-by: agent-relay-code[bot] <agent-relay-code[bot]@users.noreply.github.com>

* feat(engine): atomic multi-statement write paths via optional transaction capability The database port gains an optional TransactionCapability that adapters attach when their driver supports interactive transactions, plus a runAtomic(db, fn) helper that uses it when present and falls back to plain sequential statements otherwise (unchanged D1 behavior). The Node better-sqlite3 adapter implements the capability with manual BEGIN IMMEDIATE / COMMIT / ROLLBACK, serialized through a promise queue so concurrent requests on the shared connection cannot interleave with an open transaction. Wrapped write paths (DB writes only — realtime/webhook fanout stays in routes, outside the transaction): - channel message send (message + attachments + deliveries + message_log) - DM send (message + attachments + delivery + message_log) - group DM send (message + attachments + deliveries) - thread reply (reply + deliveries) - markRead (read receipt + delivery transition + lastReadId) On self-host, a failure mid-send no longer leaves a message row with no delivery rows (silent durable-delivery loss). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * chore: apply pr-reviewer fixes for #174 * chore: apply pr-reviewer fixes for #174 * chore: apply pr-reviewer fixes for #174 * chore: apply pr-reviewer fixes for #174 * feat(engine): atomic write batches for D1/hosted handles Write paths gained transactional atomicity on Node in the transactional-write-paths change, but the hosted Cloudflare deployment runs on D1, which has no interactive transactions — a crash between the message insert and the deliveries insert still left a message with no delivery rows. D1 does execute db.batch([...]) atomically, and drizzle's DrizzleD1Database exposes batch() natively, so the hosted handle can get all-or-nothing writes with zero cloud-side changes. - ports/database.ts: add AtomicWrite (a built-but-unexecuted drizzle statement) and BatchCapability (D1-style atomic batch), and replace runAtomic(fn) with runAtomicWrites(db, statements). Resolution order: withTransaction (Node) -> batch (D1, detected structurally since only atomic-batch drivers expose the method; better-sqlite3's drizzle instance has no batch member) -> sequential (bare handles, historical behavior). - The five multi-statement write paths (channel send, DM send, group DM send, thread reply, markRead) now do all reads up front and hand runAtomicWrites a pure statement list, so the same list runs under a transaction, one atomic batch, or sequentially. No write depends on a prior write's DB-returned value (IDs are app-generated snowflakes); .returning() rows are recovered from the per-statement results. - message.ts reads attachment details directly from files by id before the writes (the junction rows don't exist yet mid-batch); dm.ts builds the message+attachment inserts via buildDmMessageWrites; console.ts gains buildMessageLogWrite so the log insert can join the batch. - Tests: fake D1-style batch handle (records SQL, executes all-or-nothing) asserting each path issues exactly one batch with the expected statement kinds, batch failure leaves no orphan rows, and bare handles still run sequentially. Failure injection now fires at statement execution (mid-atomic-unit) rather than at build time. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * chore: apply pr-reviewer fixes for #179 * chore: apply pr-reviewer fixes for #179 --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com> Co-authored-by: agent-relay-code[bot] <agent-relay-code[bot]@users.noreply.github.com>

* feat(engine): durable webhook outbox for the Node adapter The Node self-host adapter delivered webhooks fire-and-forget: an in-process send with 3 inline retries, and any failure or restart lost the event. The hosted Cloudflare path gets real durability from CF Queues + DLQ; self-hosters got message loss. The pending_events table existed in the schema but nothing consumed it. The Node adapter's event queue now uses pending_events as a consumed outbox: - send persists the row first (durable once send resolves), then kicks an immediate poll so delivery stays prompt. - A background poller (configurable interval) claims due rows with a single UPDATE ... WHERE id IN (subquery) RETURNING statement — atomic claim with attempts++ and a lease on process_after, per the no-interactive-transactions doctrine in ports/database.ts. A worker that crashes mid-delivery leaves the row reclaimable after the lease. - Delivery reuses deliverEvent unchanged (HMAC signing, terminal-4xx vs retryable classification): success deletes the row, terminal failures settle it as failed, retryable failures reschedule with capped exponential backoff until max_attempts is exhausted. - Startup resumes leftover due rows, so deliveries survive restarts. - cleanupOldEvents (24h) is wired into the poll cadence so settled rows are pruned. The EventQueue port contract and the Cloudflare path are unchanged; InProcessEventQueue is renamed to DurableEventQueue (engine-internal, no external importers). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * chore: apply pr-reviewer fixes for #175 * chore: apply pr-reviewer fixes for #175 * chore: apply pr-reviewer fixes for #175 --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com> Co-authored-by: agent-relay-code[bot] <agent-relay-code[bot]@users.noreply.github.com>

…178) * feat(engine): durable webhook outbox for the Node adapter The Node self-host adapter delivered webhooks fire-and-forget: an in-process send with 3 inline retries, and any failure or restart lost the event. The hosted Cloudflare path gets real durability from CF Queues + DLQ; self-hosters got message loss. The pending_events table existed in the schema but nothing consumed it. The Node adapter's event queue now uses pending_events as a consumed outbox: - send persists the row first (durable once send resolves), then kicks an immediate poll so delivery stays prompt. - A background poller (configurable interval) claims due rows with a single UPDATE ... WHERE id IN (subquery) RETURNING statement — atomic claim with attempts++ and a lease on process_after, per the no-interactive-transactions doctrine in ports/database.ts. A worker that crashes mid-delivery leaves the row reclaimable after the lease. - Delivery reuses deliverEvent unchanged (HMAC signing, terminal-4xx vs retryable classification): success deletes the row, terminal failures settle it as failed, retryable failures reschedule with capped exponential backoff until max_attempts is exhausted. - Startup resumes leftover due rows, so deliveries survive restarts. - cleanupOldEvents (24h) is wired into the poll cadence so settled rows are pruned. The EventQueue port contract and the Cloudflare path are unchanged; InProcessEventQueue is renamed to DurableEventQueue (engine-internal, no external importers). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * chore: apply pr-reviewer fixes for #175 * chore: apply pr-reviewer fixes for #175 * chore: apply pr-reviewer fixes for #175 * feat(engine): persist-first webhook outbox for queue-backed adapters Move the pending_events outbox insert from the Node adapter into the engine send path so every adapter gets the same durability guarantee: routes insert the row synchronously in the request path (single cheap INSERT via routes/webhookOutbox.ts), then hand the row id to eventQueue.send in the background. If the queue send is lost (Workers isolate dies after the response, queue outage), the row stays pending and is re-enqueued by the sweep instead of vanishing. - QueuedEvent gains an optional outboxId; adapters that receive it must not insert a second row and the consumer settles the row after delivery. Absent outboxId keeps the legacy contract. - DurableEventQueue.send skips the insert when outboxId is present (no double-insert / double-delivery on the Node path). - New sweepPendingEvents(db, opts) claims due rows via the same atomic claimDueEvents the Node poller uses and returns them with complete/fail/reschedule settle callbacks so a scheduled handler can re-enqueue to an external queue without delivering directly. - Export the outbox primitives (enqueueEvent, claimDueEvents, completeEvent, failEvent, rescheduleEvent, sweepPendingEvents, cleanupOldEvents) from the engine package root for queue consumers. Tests: route-level persist-first ordering incl. sync/async queue-send failures leaving the row sweepable; Node adapter no-double-insert; sweep exactly-once claims under concurrent sweepers, lease expiry, and settle callbacks. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * chore: apply pr-reviewer fixes for #178 * chore: apply pr-reviewer fixes for #178 * chore: apply pr-reviewer fixes for #178 --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com> Co-authored-by: agent-relay-code[bot] <agent-relay-code[bot]@users.noreply.github.com>

* docs(openapi): document spawn model field * chore: apply pr-reviewer fixes for #185 * chore: apply pr-reviewer fixes for #185 * chore: apply pr-reviewer fixes for #185 * chore: apply pr-reviewer fixes for #185 * chore: apply pr-reviewer fixes for #185 --------- Co-authored-by: agent-relay-code[bot] <agent-relay-code[bot]@users.noreply.github.com>

* feat(engine,sdk-rust): rename telemetry harness -> origin_actor Replaces the CLI-centric telemetry `harness` with `origin_actor`, a UA-style path `{app}/{type}[/{name}]` (e.g. agent-relay-cli/agent/claude-code, pear/user/send-message-box). Per cloud/plans/origin-actor.md. - engine: `X-Relaycast-Harness`/`?harness=` -> `X-Relaycast-Origin-Actor`/ `?origin_actor=`; `extractHarness`->`extractOriginActor`; the `harness` request-context var + emitted `harness` telemetry property + the realtime `UpgradeRequest` field -> `origin_actor`/`originActor`. Max length 120->128 for the longer path form. The regex already permits `/`. - sdk-rust: `harness.rs`->`origin_actor.rs`; `with_harness`->`with_origin_actor`; header + WS query renamed; field/getter renamed. The DOMAIN harness (harness-emitted session events: `harness.${type}`, `POST /v1/agents/:name/events`, the session_events table) is untouched — only the telemetry/origin-attribution harness is renamed. No back-compat aliases (telemetry data intentionally breaks). origin_surface removal + the JS/python SDK + cloud/relay producer-consumer wiring follow in separate PRs. Engine: tsc clean, origin (15) + lib (20) tests pass. sdk-rust: 77 tests pass, clippy + fmt clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: apply pr-reviewer fixes for #184 * chore: apply pr-reviewer fixes for #184 * chore: apply pr-reviewer fixes for #184 * chore: apply pr-reviewer fixes for #184 * chore: apply pr-reviewer fixes for #184 * chore: apply pr-reviewer fixes for #184 * feat(origin-actor): allow @ in the path for name@version-model The name segment optionally carries harness version + model as {harness}@{version}-{model} (e.g. claude-code@2.3.1-opus4.8) so cloud can derive origin_actor_name/_version/_model. Add '@' to the allowed charset (engine regex + sdk-rust is_allowed) and update the test fixtures from the stale pre-path UA values (claude-code/2.3 …) to real path examples. See cloud/plans/origin-actor.md (name sub-format). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: agent-relay-code[bot] <agent-relay-code[bot]@users.noreply.github.com>

* feat(sdk): resync missed events on WebSocket reconnect The engine already stamps every delivered event with a monotonic agent_seq, keeps a 500-event resync ring per agent, and falls back to a DB-backed replay for larger gaps — but no shipping client used it, so every disconnect window was silent event loss. WsClient now tracks the highest agent_seq seen (read from the raw frame, since schema parsing strips unknown keys) and, after each reconnect once open handlers have re-subscribed, sends {type: "resync", last_seen_seq, since}. Replayed events flow through the normal dispatch path, deduplicated by stable event id, and the server's resync_ack surfaces as a new "resynced" lifecycle event — exposed as on.resynced(({replayed, gapDetected}) => ...) on RelayCast and AgentClient. First connections behave exactly as before (no seq, no resync frame). @relaycast/types gains the missing wire frame schemas: resync (client), resync_ack (server), and the client-only resynced event. Also adds the package README (install, RelayCast vs AgentClient quickstart, reconnect/resync behavior, self-hosting). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * chore: apply pr-reviewer fixes for #176 * chore: apply pr-reviewer fixes for #176 --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com> Co-authored-by: agent-relay-code[bot] <agent-relay-code[bot]@users.noreply.github.com>

* feat: add Swift SDK and OpenAPI sync guard * chore: apply pr-reviewer fixes for #173 * chore: apply pr-reviewer fixes for #173 * chore: apply pr-reviewer fixes for #173 * chore: apply pr-reviewer fixes for #173 --------- Co-authored-by: agent-relay-code[bot] <agent-relay-code[bot]@users.noreply.github.com>

* feat(sdk-typescript): rename telemetry harness -> origin_actor Aligns @relaycast/sdk (JS) with the engine 3.0.0 contract (relaycast#184). The JS SDK now sends the `X-Relaycast-Origin-Actor` header / `?origin_actor=` WS query from an `originActor` option (was `harness` / `X-Relaycast-Harness`). - origin.ts: HARNESS_HEADER -> ORIGIN_ACTOR_HEADER; sanitizeHarness -> sanitizeOriginActor; max 120 -> 128 and allow `@` for the {app}/{type}/{name}@{version}-{model} path. - client.ts / ws.ts / relay.ts / agent.ts: the `harness` option + internal `_originHarness` -> `originActor`; WS query `harness` -> `origin_actor`. Breaking: the public `harness` constructor option is now `originActor`; no alias (per cloud/plans/origin-actor.md). The domain session-event types are untouched. 357 tests pass; tsc clean. This is the JS-SDK foundation for the spawned-agent (23%) attribution. Once published, the relay side (telemetry options + broker per-worker env) consumes it to emit agent-relay-cli/agent/<harness>. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: apply pr-reviewer fixes for #187 * chore: apply pr-reviewer fixes for #187 --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: agent-relay-code[bot] <agent-relay-code[bot]@users.noreply.github.com>

…se 0) (#191) * feat(types): add fleet wire protocol schemas * chore: apply pr-reviewer fixes for #191 * chore: apply pr-reviewer fixes for #191 * fix(types): align fleet wire fields with v1 ruling * fix(types): tighten fleet action result contract * Extend fleet wire protocol v3 * chore: apply pr-reviewer fixes for #191 * Add agent.register reply data schema --------- Co-authored-by: agent-relay-code[bot] <agent-relay-code[bot]@users.noreply.github.com>

… (fleet Phase 1) (#192) * feat(types): add fleet wire protocol schemas * chore: apply pr-reviewer fixes for #191 * chore: apply pr-reviewer fixes for #191 * fix(types): align fleet wire fields with v1 ruling * feat(engine): add fleet node registry and node-native actions * feat(sdk): expose fleet nodes and triggers * chore: apply pr-reviewer fixes for #192 * chore: apply pr-reviewer fixes for #192 * fix(types): tighten fleet action result contract * Extend fleet wire protocol v3 * fix(engine): close round-1 node engine gaps * fix(engine): close round-2 fleet node review items * fix(engine): restore frozen reply contract and close round-3 reviews * fix(engine): close round-4 fleet node review items Round-4 review (88b792b) raised three majors; all addressed: 1. Drained offline-queue invocations stayed `pending` with no `dispatched_at`, so the timeout sweep never covered them. Factor a shared dispatched-state transition and apply it on drain via `markDrainedInvocationDispatched`, stamping `dispatched_at` + `retry_after_at` so dispatch-timeout/reschedule now picks up drained work. Adds a drain -> timeout -> reschedule conformance test. 2. Restore the frozen v3.1 `reply.agent_register.json` fixture byte-identical to ee2c001, re-add it to EXPECTED_FIXTURES, and restore the AgentRegisterReplyDataSchema assertion. `reply.json` stays the separate generic reply. `git diff ee2c001 -- packages/types` is empty; packages/types remains frozen. 3. Align SDK types with runtime: NodeRosterEntry.capabilities is now NodeCapability[] (FleetCapability objects, not string[]), and action invocation/completion output is JsonValue (FleetWireJsonValue: scalars/arrays/null legal, not object-only). Adds enforced compile-level type assertions in src plus expectTypeOf docs; widens the MCP completion call site to the JsonValue output contract. Repo build/test/lint all green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(engine): close round-5 fleet node review items Round-5 review (5f4e5b9) raised one major + three minors; all addressed: 1. [MAJOR] Queued spawn invocations could stick pending forever because drainNodeQueue ran at attachNodeSocket BEFORE node.register set the node online, so reserveNodeCapacity's online gate failed and the frame requeued with no re-drain trigger. Introduce a serialized registry `drainNode` and invoke it after node.register AND node.heartbeat (node is online by then), so queued spawns reserve capacity and dispatch. On a deferred reservation the requeue now arms `retry_after_at` so the dispatch sweeper reschedules as a backstop. Concurrent drains per node are chained so they never overlap (no double reservation). Adds a spawn drain test covering offline-queue -> reconnect (backstop armed) -> register (reserve + dispatch). 2. [minor] emitInvocationCompletionEffects double-delivered action.completed/ failed to an online caller (targeted + online-set). Exclude the caller from the online fanout; add a dedupe unit test. 3. [minor] reconcileInventory could partially apply agent rebinds before a later item's conflict throw turned into an error reply. Pre-validate all conflicts up front (caching existing rows) before mutating anything. 4. [minor] The fixed handler_node spawn path marked reservationHeld without incrementing reserved capacity, so release under-counted and over-committed. Actually reserve on that path and only mark held when the reservation took. Repo build/test/lint all green; packages/types still byte-identical to ee2c001. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: agent-relay-code[bot] <agent-relay-code[bot]@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

…fleet Phase 2) (#193) * feat(engine): add bounded durable fleet mailbox * fix(engine): re-resolve mailbox targets at fanout * fix(engine): guard ttl expiry and unify replay payload

…ivery guarantees (Phase 6) (#194) * feat(engine): per-workspace fleet rollout flag + migration single-delivery guarantees (Phase 6) Gate the entire fleet node control surface behind a per-workspace `fleet_nodes_enabled` flag (default OFF), so fleet can ship dark and roll out workspace-by-workspace. Legacy per-agent WS delivery is unaffected either way. The flag is checked once at each genuine boundary (no scattered checks): - node control WS (`/v1/node/ws`) rejects with `fleet_nodes_disabled` (404) - node roster routes (`/v1/nodes*`) return a flat 404 via `requireFleetNodes` - declarative trigger evaluation is skipped at the message hook - spawn placement + node-handler dispatch refuse in `invokeAction` (agent-handler actions stay available) Flag source mirrors the workspace-stream pattern: a KV override with a short in-memory cache, defaulting to `EngineConfig.fleetNodesEnabled`. GET/PUT `/v1/workspace/fleet-nodes` toggles the per-workspace override. Tests: - flag OFF -> every node surface inert (roster, spawn, WS gate, triggers) - per-workspace override flips the surface on/off; WS gate follows the flag - migration single-delivery: a legacy self-connected agent is never also delivered via a node when the flag flips mid-stream (exclusive location; a node's `agent.register` for it is rejected `agent_location_conflict`) The conformance harness defaults the flag ON, so existing node integration tests pass unchanged. Full engine suite green (108 tests). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(engine): accept node token via Authorization: Bearer + gate node WS upgrade behind fleet flag (Phase 6) Cross-repo compat fix surfaced by the Phase 6 two-node E2E: a real relay broker could never bring a node online against a self-hosted engine. Root cause: the node-control read-side is the Node HTTP-server `upgrade` handler in `entrypoints/node.ts` (the Hono `/v1/node/ws` route only answers the 426 — Node owns the 101). That handler read the token ONLY from the `?token=` query param, but the relay Rust broker's node_control client sends it as `Authorization: Bearer <nt_live_…>`. It also had NO fleet-flag gate for `/v1/node/ws` (only the rk_live workspace-stream path was gated), so the Phase 6 rollout flag did not actually cover the node control surface on the self-host adapter. Fix, both in the upgrade handler: - read the node token from `?token=` query OR `Authorization: Bearer` header (query stays for SDK/Pear; header unblocks the shipped broker — no Rust release needed) - gate the `/v1/node/ws` upgrade behind `isFleetNodesEnabled` (404 when off), mirroring the existing stream gate Also mirrored the dual-transport read in the Hono `/v1/node/ws` route for any adapter that routes upgrades through it. Accepted-stack PRs involved: engine read-side #192, broker send-side #1107. The hosted (Cloudflare DO) equivalent is handled in PR 5. Test: `nodeUpgradeAuth.test.ts` boots the real Node server and asserts a WS client authenticates via BOTH the Bearer header and the query param, that the upgrade is rejected while the workspace flag is off (404), and that a missing/malformed token is rejected (401). Full engine suite green (111). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(engine): reply to agent.register with a broker-shaped `reply` frame (Phase 6 token authority) Third cross-repo compat fix surfaced by the Phase 6 E2E (spawn scenarios): spawn never completes end-to-end against a real broker. The relay broker's node_control client awaits a `reply` frame keyed by the request id — it matches `pending_agent_registrations` by `reply.id` and parses `data` as `{agent_id, token, name}` with `deny_unknown_fields`. The engine instead answered `agent.register` with a bare `{type:'agent.registered', ...}` carrying the full object (incl. invocation_id/session_ref), which the broker never matches → `register_fleet_agent_token` hangs to its 30s timeout → the spawn action fails. This blocked every spawn-dependent path (placement completion, mailbox delivery to via-node agents, resume). Reply in the shape the shipped broker consumes; the broker already holds the invocation_id/session_ref it sent, so only the minted identity is echoed. Same root pattern as the node-token transport mismatch (#192 read-side ↔ #1107 broker send-side); no Rust release needed. Updated the one conformance helper that asserted the old frame. Engine suite green (111). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(engine): self-host serve env for fleet flag default + mailbox TTL/depth-cap The `relaycast-engine` serve bin gains optional env tuning so operators (and the Phase 6 fleet E2E) can configure the bounded mailbox and the fleet rollout default without code changes: - RELAYCAST_FLEET_NODES_ENABLED=1 → EngineConfig.fleetNodesEnabled - RELAYCAST_MAILBOX_TTL_MS / RELAYCAST_MAILBOX_DEPTH_CAP → mailbox tuning Unset env leaves the existing defaults untouched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * chore(engine): renumber fleet mailbox migration 0017→0019 to deduplicate #192 (merged) owns 0017_spawn_reservation_and_retry_state and 0018; the Phase 2 mailbox migration was authored as 0017 on an older base, colliding on the 0017 prefix once #192/#193 landed in main. Renumber to 0019 (after 0018) so the D1 migration sequence is unique and ordered. Pure file rename — no code references the filename, and the migration has not been applied to any environment yet (engine unpublished), so there is no D1 re-apply risk. * docs(changelog): record fleet node/mailbox changes + breaking DeliveryStatus remap Changelogs here are hand-curated (no CI generation), and the fleet stack (#191-#194) was missing from them. Add the user-facing entries: - @relaycast/types: new CHANGELOG; document the breaking DeliveryStatus enum remap (accepted/deferred removed, acked/dead_lettered added, delivered re-meaning) with old->new mapping + flag-independent migration note, the new Delivery location/lifecycle fields, and the fleet-wire protocol module. - @relaycast/sdk-typescript: node roster API (nodes.list/get, triggers.list), capability objects, handler/dispatch node fields, JsonValue export, and the breaking action-output widening + delivery status value change. These confirm the next @relaycast/types + sdk-typescript publish is a MAJOR. * chore: apply pr-reviewer fixes for #194 * fix(engine): mailbox cumulative-ack + depth-cap correctness (Codex review) Address P2 findings from Codex review of the fleet mailbox delivery path: 1. ackDelivery (single per-delivery REST ack) advanced the cumulative cursor to the row's own seq, so acking seq 2 while seq 1 is queued moved the cursor past seq 1; deliverPendingToNode (seq > delivery_ack_seq) then skipped it forever on node replay. Make the cursor advance opt-in (ackRows advanceCursorTo?) — single acks no longer advance it; the row's acked status already excludes it from replay. The node delivery.ack {up_to_seq} path still advances cumulatively. Regression test. 2. Migration 0019 seeded delivery_ack_seq = MAX(acked seq), skipping an older still-queued row below a newer acked one. Seed from the contiguous acked prefix (lowest active seq - 1; max seq when nothing is active). 3. Node-replay event classification checked dmType before threadId, so a thread reply inside a DM/group DM would replay as dm.received instead of thread.reply (the live routes/thread.ts routing). Check threadId first to mirror live. 4. Mailbox depth-cap count included expired-but-unswept rows, so an idle recipient kept rejecting new sends as depth_cap after TTL instead of dead-lettering. Exclude expired rows from the count (matches the replay query). Regression test. Also classify the operator-only /v1/workspace/fleet-nodes flag route as non-SDK in sdk-openapi-sync (pre-existing #194 gap that turbo test caching had masked). * docs(openapi): require enabled|mode and document 400 on PUT /workspace/fleet-nodes The PUT handler rejects a payload lacking both `enabled` (boolean) and `mode: inherit` with a 400 invalid_request, but the schema marked both optional and documented only a 200. Add anyOf[required: enabled | required: mode] to reflect the runtime constraint, and document the 400 (ErrorResponse). --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Co-authored-by: agent-relay-code[bot] <agent-relay-code[bot]@users.noreply.github.com>

* refactor(telemetry): drop origin_surface from the origin contract The origin-actor path (`{app}/{type}/{name}`) now expresses what `origin_surface` (`cli|sdk|cloud`) tried to. Keep exactly two non-overlapping origin concepts: `origin_actor` (who/what drove the request) and `origin_client`+`origin_version` (which SDK library sent it). See cloud/plans/origin-actor.md, decision 3. Removes `origin_surface` from the wire header `X-Relaycast-Origin-Surface`, the `?origin_surface=` WS query param, and the public SDK option — across @relaycast/types, engine, mcp, and the rust/python/typescript SDKs. No back-compat alias; this is a breaking change to TelemetryOrigin (major bump). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: apply pr-reviewer fixes for #188 * chore: apply pr-reviewer fixes for #188 * chore: apply pr-reviewer fixes for #188 * chore: apply pr-reviewer fixes for #188 * chore: apply pr-reviewer fixes for #188 * chore: apply pr-reviewer fixes for #188 --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: agent-relay-code[bot] <agent-relay-code[bot]@users.noreply.github.com>

…eflake depth-cap test (#195) * fix(engine): unbreak main CI — drop node-WS origin_surface + deflake depth-cap test main has been failing 'Lint, Build & Test' (build step) for every PR: 1. BUILD BREAK: #188 removed origin_surface from the telemetry/origin contract, but #192's node-WS upgrade handler (engine.ts) and NodeUpgradeArgs.origin (ports/realtime.ts) still referenced it — the two PRs merged without reconciling. tsc fails: 'origin_surface does not exist'. Drop the dead surface field from both, matching the agent-WS upgrade path #188 already fixed. No consumer reads node origin.surface. 2. FLAKY TEST: the depth-cap-excludes-expired conformance test relied on a 1.2s wall-clock wait racing the second-granular unixepoch() TTL boundary; it passed in isolation but failed under full-suite load. Age the row deterministically via the DB instead of sleeping (also ~1.2s faster). * chore: apply pr-reviewer fixes for #195 * chore: apply pr-reviewer fixes for #195 --------- Co-authored-by: agent-relay-code[bot] <agent-relay-code[bot]@users.noreply.github.com>

…llow-ups (#189) * feat(engine): retention pruning with per-workspace TTLs and outbox follow-ups Add pruneExpired: bounded-batch deletion of expired messages (leaf-first across thread parents), settled deliveries, message logs, and orphaned read receipts, with per-workspace TTLs in a new nullable workspaces.retention column. Message retention is opt-in; settled deliveries and message logs default to 90 days as operational logs. Runs on the Node adapter's outbox cleanup cadence and is exported for queue-backed scheduled handlers. cleanupOldEvents now settles exhausted pending_events rows as failed so they become prunable instead of lingering unclaimable, and sendWebhookEvent skips the outbox insert and queue send entirely (with a per-request memoized existence probe) for workspaces with no active event subscription. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * chore(engine): adapt retention to merged fleet engine (rebase onto main) Rebased feature/engine-retention onto main, which now has the fleet stack (#191-#194). Three adaptations: 1. Renumber migration 0016_workspace_retention -> 0020_workspace_retention; #192 took 0016 (fleet_nodes) and #193 took 0019 (fleet_mailbox). 2. Delivery status model: #193 reworked the enum, so SETTLED_DELIVERY_STATUSES is now ['acked','failed','dead_lettered'] (was ['delivered','failed']). 'delivered' is now IN-FLIGHT (sent, awaiting cumulative ack), so retention must never prune it; 'acked' is terminal success. Updated tests to the new status names. 3. insertDelivery test helper assigns a distinct seq per agent — the mailbox migration added UNIQUE(workspace_id, agent_id, seq), so same-agent rows can no longer share the default seq 0. Note: turbo build/tsc is currently red on main itself (engine.ts:212 uses originInfo.origin_surface, which #188 removed from the telemetry contract) — a pre-existing #188/#192 collision unrelated to this PR. Engine vitest is green (132/132). --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

) * Add Python SDK parity endpoints * Fix Python SDK parity: delivery status enum, channel topic/invite paths Rebased #190 onto main (resolving the #188 origin-contract changes: origin_surface is gone; only origin_actor + origin_client/origin_version remain — confirmed no origin_surface references survive). Correctness fixes layered on top of #190's parity additions: - DeliveryStatus: updated the stale Literal["accepted","delivered", "deferred","failed"] to the canonical #193 enum Literal["queued","delivered","acked","failed","dead_lettered"] (packages/types/src/delivery.ts). "delivered" now means in-flight awaiting ack; "acked" is terminal success; accepted/deferred removed. - Delivery model: aligned with the canonical DeliverySchema by adding the missing fields seq, location_type, location_node_id, expires_at, delivered_at, acked_at, dead_lettered_at to match the TS SDK surface. - channels.set_topic: corrected the route from PATCH /v1/channels/{name} to PATCH /v1/channels/{name}/topic to match the TS setTopic() and the dedicated openapi endpoint (it was colliding with channels.update). - channels.invite: corrected the request body field from {"agent": ...} to {"agent_name": ...} to match InviteRequestSchema / the TS SDK wire shape (Python sends keys verbatim with no camel->snake conversion). - Updated test_channels_set_topic to assert the corrected /topic route. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(sdk-swift): bring Swift SDK to 100% parity with TypeScript SDK Add the relay-level surfaces that were missing from the Swift SDK: - nodes namespace: list (GET /v1/nodes, capability/name filters), get (GET /v1/nodes/{name}) with NodeRosterEntry + NodeCapability models - triggers namespace: create/list/get/update/delete full lifecycle (POST/GET/PATCH/DELETE /v1/triggers[/{id}]) with Trigger, CreateTriggerRequest, UpdateTriggerRequest models - activity feed: activity(limit) -> GET /v1/activity - workspace-level DM queries: allDMConversations (GET /v1/dm/conversations/all) and dmMessages (GET /v1/dm/conversations/{id}/messages) Fix the stale DeliveryStatus enum to the current statuses (queued|delivered|acked|failed|dead_lettered), replacing the old accepted/deferred values. All routes verified present in openapi.yaml. Adds tests for nodes, triggers, workspace DM queries, activity, and the delivery-status enum. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(sdk-rust): full Rust↔TypeScript SDK parity Bring the Rust SDK to 100% feature parity with the TypeScript reference SDK. Every new route is documented in openapi.yaml. New RelayCast surfaces: - Workspace bootstrap: lookup_workspace (GET /v1/workspaces/by-name/{name}) - A2A: register_a2a, list_a2a_agents, remove_a2a_agent, get_a2a_agent_card - Routing: route, route_feedback, get_routing_config, update_routing_config - Directory: search_directory, publish_to_directory, list_directory, get_directory_agent, update_directory_agent, delete_directory_agent, list_directory_ratings, rate_directory_agent - Skills: import_skills, search_skills - Fleet nodes: list_nodes, get_node - Triggers: create_trigger, list_triggers, get_trigger, update_trigger, delete_trigger - Certification: certify, get_certification, certification_badge_url, monitor_certification - Console: console_messages, console_stats (ConsoleOverview), console_agents, console_costs New AgentClient surfaces: - channels mute_channel / unmute_channel - invite_to_channel fixed to send documented `agent_name` body Models: added serde structs for A2A cards/records, directory agents/skills/ ratings, routing config/weights, skill search results, node roster with capability objects, triggers, certification runs, and console stats — all snake_case to match the wire contract. DeliveryStatus enum updated to the canonical lifecycle (queued|delivered|acked|failed|dead_lettered); tests updated to match. Adds parity tests for every new surface. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: apply pr-reviewer fixes for #190 --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: agent-relay-code[bot] <agent-relay-code[bot]@users.noreply.github.com>

The node.heartbeat schema was .strict() and accepted only load/active_agents/handlers_live. The relay broker wants to carry the node roster (name/node_id/capabilities/max_agents/version) on the steady-state heartbeat so the engine can keep a node's descriptor fresh between — or in the absence of — a fresh node.register (e.g. after an engine restart where the broker keeps heartbeating an already-registered node). Extend FleetNodeHeartbeatMessageSchema with optional roster fields and have heartbeatNode() adopt them: refresh name/capabilities/max_agents/ version on the node row and register newly-advertised capability actions via ensureCapabilityActions. A minimal heartbeat (no roster) remains valid and preserves the existing roster. last_heartbeat_at is NOT part of the wire: receipt time is stamped server-side here as the single source of truth for liveness; the broker does not send it. node_id, when present, is validated against the authenticated node token (node_id_mismatch) like node.register. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

…nt helpers (#200)

A client that joins a workspace by key holds no workspace id locally, and the agent-registration response did not include one — so a consumer (e.g. the broker's Rust SDK) had nothing to record and fell back to an "unknown workspace" placeholder, enrolling the agent (and any fleet node it spawned) under a phantom workspace invisible to the real one. Include `workspace_id` in the agent-registration response, and in the agent-detail response used by the strict-name reclaim path. Surface it on the Rust SDK's `CreateAgentResponse` and `Agent` as an optional, serde-default field so older engines that omit it still deserialize. Co-authored-by: Barry Cape <barryonthecape@icloud.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The "Commit and tag" step in publish-npm.yml and publish-rust.yml pushed the version-bump commit with `git push origin HEAD:main` off a stale checkout. When main advanced during the multi-minute build (e.g. a concurrent cross-kind publish), the push was rejected as non-fast-forward, failing the workflow after packages had already published and skipping the git tag + GitHub Release. Before pushing, fetch + rebase onto origin/main and retry with backoff (5 attempts). The release commit only touches version metadata, so it rebases cleanly onto another release's commit. Tag + Release still run after a successful push. Carries over only the workflow changes from #182; the 2.6.0 version reconciliation is obsolete now that main is at 4.x. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_013c2kNLweXdjA5dMpmMRPBM

gemini-code-assist · 2026-06-19T16:41:32Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

coderabbitai · 2026-06-19T16:41:38Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: cd36fd76-0834-4ad0-9f01-9c18442d203c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch claude/adoring-hopper-jumdlq

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d6d1710edb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-19T16:48:55Z

+        .where(and(
+          eq(deliveries.workspaceId, workspaceId),
+          inArray(deliveries.id, ids),
+          notInArray(deliveries.status, ['acked', 'dead_lettered']),


Keep failed deliveries terminal when acking

When an agent has already reported a delivery via POST /deliveries/:id/fail (status = 'failed'), this guard still lets a later ack (including the node cumulative ack path through ackRows) update the same row to acked because failed is not excluded. That overwrites the terminal failure state and can emit delivered/read side effects for a delivery that was explicitly failed; include failed in the terminal guard.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-19T16:48:55Z

    .where(and(
      eq(deliveries.id, deliveryId),
-      ne(deliveries.status, 'delivered'),
+      notInArray(deliveries.status, ['acked', 'dead_lettered']),


Keep failed deliveries terminal when deferring

If a delivery is already failed, POST /deliveries/:id/defer passes this predicate and rewrites it back to queued, putting a terminal failure back into the active replay queue with a new available_at. The fail path treats failed as settled, so the defer early return/update guard should also exclude failed.

Useful? React with 👍 / 👎.

willwashburn and others added 30 commits June 10, 2026 06:56

chore(sdk-rust): v2.5.0

9e2af21

chore(release): v3.0.0

c635dc6

chore(sdk-rust): v3.0.0

311149a

chore(release): v3.1.0

51b391e

chore(release): v3.1.1

ee64c94

feat(engine): bounded-durable mailbox, location routing, §5 cleanup (…

31caef8

…fleet Phase 2) (#193) * feat(engine): add bounded durable fleet mailbox * fix(engine): re-resolve mailbox targets at fanout * fix(engine): guard ttl expiry and unify replay payload

chore(release): v4.0.0

6d0f0fb

chore(sdk-rust): v4.0.0

b248bfa

chore(release): v4.1.0

afe5516

Expose fleet node workspace flag in SDK (#199)

373beb2

chore(release): v4.1.1

bb01110

refactor(engine): consolidate duplicated error/serialization/attachme…

713a8e7

…nt helpers (#200)

chore(release): v4.1.2

8d87042

barryollama and others added 3 commits June 19, 2026 11:26

chore(sdk-rust): v4.1.0

75097b7

chatgpt-codex-connector Bot reviewed Jun 19, 2026

View reviewed changes

willwashburn closed this Jun 19, 2026

willwashburn deleted the claude/adoring-hopper-jumdlq branch June 19, 2026 23:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update #182 to current main; keep only the workflow rebase-and-retry fix#202

Update #182 to current main; keep only the workflow rebase-and-retry fix#202
willwashburn wants to merge 33 commits into
fix/release-commit-back-racefrom
claude/adoring-hopper-jumdlq

willwashburn commented Jun 19, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

gemini-code-assist Bot commented Jun 19, 2026

Uh oh!

coderabbitai Bot commented Jun 19, 2026

Review skipped

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 19, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

willwashburn commented Jun 19, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this does

Note on history

Uh oh!

gemini-code-assist Bot commented Jun 19, 2026

Uh oh!

coderabbitai Bot commented Jun 19, 2026

Review skipped

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

willwashburn commented Jun 19, 2026 •

edited by cubic-dev-ai Bot

Loading