Skip to content

Update #182 to current main; keep only the workflow rebase-and-retry fix#202

Closed
willwashburn wants to merge 33 commits into
fix/release-commit-back-racefrom
claude/adoring-hopper-jumdlq
Closed

Update #182 to current main; keep only the workflow rebase-and-retry fix#202
willwashburn wants to merge 33 commits into
fix/release-commit-back-racefrom
claude/adoring-hopper-jumdlq

Conversation

@willwashburn

@willwashburn willwashburn commented Jun 19, 2026

Copy link
Copy Markdown
Member

Merge this into fix/release-commit-back-race to make #182 mergeable again.

What this does

main has advanced from 2.x to 4.1.x since #182 was opened, leaving it conflicted (dirty). #182 carried two kinds of changes:

  1. Workflow rebase-and-retry (publish-npm.yml, publish-rust.yml) — still relevant; main's publish workflows still use the racy git push origin HEAD:main.
  2. Version reconciliation (2.5.1 → 2.6.0) — now obsolete since main is at 4.x.

This branch is current main + only the workflow changes. After merging it into fix/release-commit-back-race, that branch contains everything on main plus the workflow fix, so #182's net diff against main becomes just the two workflow files and it stops conflicting.

Note on history

Because this branch is built fresh on top of current main (the original fix/release-commit-back-race couldn't be pushed to from this environment — 403), the file list below is large. The effective change relative to main is only the two workflow files; everything else is main catching up on fix/release-commit-back-race.

🤖 Generated with Claude Code


Generated by Claude Code

Review in cubic

willwashburn and others added 30 commits June 10, 2026 06:56
…tion capability (#174)

* feat(engine): atomic multi-statement write paths via optional transaction capability

The database port gains an optional TransactionCapability that adapters
attach when their driver supports interactive transactions, plus a
runAtomic(db, fn) helper that uses it when present and falls back to
plain sequential statements otherwise (unchanged D1 behavior).

The Node better-sqlite3 adapter implements the capability with manual
BEGIN IMMEDIATE / COMMIT / ROLLBACK, serialized through a promise queue
so concurrent requests on the shared connection cannot interleave with
an open transaction.

Wrapped write paths (DB writes only — realtime/webhook fanout stays in
routes, outside the transaction):
- channel message send (message + attachments + deliveries + message_log)
- DM send (message + attachments + delivery + message_log)
- group DM send (message + attachments + deliveries)
- thread reply (reply + deliveries)
- markRead (read receipt + delivery transition + lastReadId)

On self-host, a failure mid-send no longer leaves a message row with no
delivery rows (silent durable-delivery loss).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* chore: apply pr-reviewer fixes for #174

* chore: apply pr-reviewer fixes for #174

* chore: apply pr-reviewer fixes for #174

* chore: apply pr-reviewer fixes for #174

* fix(engine): revalidate group DM sends atomically

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Co-authored-by: agent-relay-code[bot] <agent-relay-code[bot]@users.noreply.github.com>
* feat(engine): atomic multi-statement write paths via optional transaction capability

The database port gains an optional TransactionCapability that adapters
attach when their driver supports interactive transactions, plus a
runAtomic(db, fn) helper that uses it when present and falls back to
plain sequential statements otherwise (unchanged D1 behavior).

The Node better-sqlite3 adapter implements the capability with manual
BEGIN IMMEDIATE / COMMIT / ROLLBACK, serialized through a promise queue
so concurrent requests on the shared connection cannot interleave with
an open transaction.

Wrapped write paths (DB writes only — realtime/webhook fanout stays in
routes, outside the transaction):
- channel message send (message + attachments + deliveries + message_log)
- DM send (message + attachments + delivery + message_log)
- group DM send (message + attachments + deliveries)
- thread reply (reply + deliveries)
- markRead (read receipt + delivery transition + lastReadId)

On self-host, a failure mid-send no longer leaves a message row with no
delivery rows (silent durable-delivery loss).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* chore: apply pr-reviewer fixes for #174

* chore: apply pr-reviewer fixes for #174

* chore: apply pr-reviewer fixes for #174

* chore: apply pr-reviewer fixes for #174

* feat(engine): atomic write batches for D1/hosted handles

Write paths gained transactional atomicity on Node in the
transactional-write-paths change, but the hosted Cloudflare deployment
runs on D1, which has no interactive transactions — a crash between the
message insert and the deliveries insert still left a message with no
delivery rows. D1 does execute db.batch([...]) atomically, and drizzle's
DrizzleD1Database exposes batch() natively, so the hosted handle can get
all-or-nothing writes with zero cloud-side changes.

- ports/database.ts: add AtomicWrite (a built-but-unexecuted drizzle
  statement) and BatchCapability (D1-style atomic batch), and replace
  runAtomic(fn) with runAtomicWrites(db, statements). Resolution order:
  withTransaction (Node) -> batch (D1, detected structurally since only
  atomic-batch drivers expose the method; better-sqlite3's drizzle
  instance has no batch member) -> sequential (bare handles, historical
  behavior).
- The five multi-statement write paths (channel send, DM send, group DM
  send, thread reply, markRead) now do all reads up front and hand
  runAtomicWrites a pure statement list, so the same list runs under a
  transaction, one atomic batch, or sequentially. No write depends on a
  prior write's DB-returned value (IDs are app-generated snowflakes);
  .returning() rows are recovered from the per-statement results.
- message.ts reads attachment details directly from files by id before
  the writes (the junction rows don't exist yet mid-batch); dm.ts builds
  the message+attachment inserts via buildDmMessageWrites; console.ts
  gains buildMessageLogWrite so the log insert can join the batch.
- Tests: fake D1-style batch handle (records SQL, executes
  all-or-nothing) asserting each path issues exactly one batch with the
  expected statement kinds, batch failure leaves no orphan rows, and
  bare handles still run sequentially. Failure injection now fires at
  statement execution (mid-atomic-unit) rather than at build time.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* chore: apply pr-reviewer fixes for #179

* chore: apply pr-reviewer fixes for #179

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Co-authored-by: agent-relay-code[bot] <agent-relay-code[bot]@users.noreply.github.com>
* feat(engine): durable webhook outbox for the Node adapter

The Node self-host adapter delivered webhooks fire-and-forget: an
in-process send with 3 inline retries, and any failure or restart lost
the event. The hosted Cloudflare path gets real durability from CF
Queues + DLQ; self-hosters got message loss. The pending_events table
existed in the schema but nothing consumed it.

The Node adapter's event queue now uses pending_events as a consumed
outbox:

- send persists the row first (durable once send resolves), then kicks
  an immediate poll so delivery stays prompt.
- A background poller (configurable interval) claims due rows with a
  single UPDATE ... WHERE id IN (subquery) RETURNING statement —
  atomic claim with attempts++ and a lease on process_after, per the
  no-interactive-transactions doctrine in ports/database.ts. A worker
  that crashes mid-delivery leaves the row reclaimable after the lease.
- Delivery reuses deliverEvent unchanged (HMAC signing, terminal-4xx
  vs retryable classification): success deletes the row, terminal
  failures settle it as failed, retryable failures reschedule with
  capped exponential backoff until max_attempts is exhausted.
- Startup resumes leftover due rows, so deliveries survive restarts.
- cleanupOldEvents (24h) is wired into the poll cadence so settled
  rows are pruned.

The EventQueue port contract and the Cloudflare path are unchanged;
InProcessEventQueue is renamed to DurableEventQueue (engine-internal,
no external importers).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* chore: apply pr-reviewer fixes for #175

* chore: apply pr-reviewer fixes for #175

* chore: apply pr-reviewer fixes for #175

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Co-authored-by: agent-relay-code[bot] <agent-relay-code[bot]@users.noreply.github.com>
…178)

* feat(engine): durable webhook outbox for the Node adapter

The Node self-host adapter delivered webhooks fire-and-forget: an
in-process send with 3 inline retries, and any failure or restart lost
the event. The hosted Cloudflare path gets real durability from CF
Queues + DLQ; self-hosters got message loss. The pending_events table
existed in the schema but nothing consumed it.

The Node adapter's event queue now uses pending_events as a consumed
outbox:

- send persists the row first (durable once send resolves), then kicks
  an immediate poll so delivery stays prompt.
- A background poller (configurable interval) claims due rows with a
  single UPDATE ... WHERE id IN (subquery) RETURNING statement —
  atomic claim with attempts++ and a lease on process_after, per the
  no-interactive-transactions doctrine in ports/database.ts. A worker
  that crashes mid-delivery leaves the row reclaimable after the lease.
- Delivery reuses deliverEvent unchanged (HMAC signing, terminal-4xx
  vs retryable classification): success deletes the row, terminal
  failures settle it as failed, retryable failures reschedule with
  capped exponential backoff until max_attempts is exhausted.
- Startup resumes leftover due rows, so deliveries survive restarts.
- cleanupOldEvents (24h) is wired into the poll cadence so settled
  rows are pruned.

The EventQueue port contract and the Cloudflare path are unchanged;
InProcessEventQueue is renamed to DurableEventQueue (engine-internal,
no external importers).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* chore: apply pr-reviewer fixes for #175

* chore: apply pr-reviewer fixes for #175

* chore: apply pr-reviewer fixes for #175

* feat(engine): persist-first webhook outbox for queue-backed adapters

Move the pending_events outbox insert from the Node adapter into the
engine send path so every adapter gets the same durability guarantee:
routes insert the row synchronously in the request path (single cheap
INSERT via routes/webhookOutbox.ts), then hand the row id to
eventQueue.send in the background. If the queue send is lost (Workers
isolate dies after the response, queue outage), the row stays pending
and is re-enqueued by the sweep instead of vanishing.

- QueuedEvent gains an optional outboxId; adapters that receive it must
  not insert a second row and the consumer settles the row after
  delivery. Absent outboxId keeps the legacy contract.
- DurableEventQueue.send skips the insert when outboxId is present
  (no double-insert / double-delivery on the Node path).
- New sweepPendingEvents(db, opts) claims due rows via the same atomic
  claimDueEvents the Node poller uses and returns them with
  complete/fail/reschedule settle callbacks so a scheduled handler can
  re-enqueue to an external queue without delivering directly.
- Export the outbox primitives (enqueueEvent, claimDueEvents,
  completeEvent, failEvent, rescheduleEvent, sweepPendingEvents,
  cleanupOldEvents) from the engine package root for queue consumers.

Tests: route-level persist-first ordering incl. sync/async queue-send
failures leaving the row sweepable; Node adapter no-double-insert;
sweep exactly-once claims under concurrent sweepers, lease expiry, and
settle callbacks.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* chore: apply pr-reviewer fixes for #178

* chore: apply pr-reviewer fixes for #178

* chore: apply pr-reviewer fixes for #178

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Co-authored-by: agent-relay-code[bot] <agent-relay-code[bot]@users.noreply.github.com>
* docs(openapi): document spawn model field

* chore: apply pr-reviewer fixes for #185

* chore: apply pr-reviewer fixes for #185

* chore: apply pr-reviewer fixes for #185

* chore: apply pr-reviewer fixes for #185

* chore: apply pr-reviewer fixes for #185

---------

Co-authored-by: agent-relay-code[bot] <agent-relay-code[bot]@users.noreply.github.com>
* feat(engine,sdk-rust): rename telemetry harness -> origin_actor

Replaces the CLI-centric telemetry `harness` with `origin_actor`, a UA-style
path `{app}/{type}[/{name}]` (e.g. agent-relay-cli/agent/claude-code,
pear/user/send-message-box). Per cloud/plans/origin-actor.md.

- engine: `X-Relaycast-Harness`/`?harness=` -> `X-Relaycast-Origin-Actor`/
  `?origin_actor=`; `extractHarness`->`extractOriginActor`; the `harness`
  request-context var + emitted `harness` telemetry property + the realtime
  `UpgradeRequest` field -> `origin_actor`/`originActor`. Max length 120->128
  for the longer path form. The regex already permits `/`.
- sdk-rust: `harness.rs`->`origin_actor.rs`; `with_harness`->`with_origin_actor`;
  header + WS query renamed; field/getter renamed.

The DOMAIN harness (harness-emitted session events: `harness.${type}`,
`POST /v1/agents/:name/events`, the session_events table) is untouched — only
the telemetry/origin-attribution harness is renamed.

No back-compat aliases (telemetry data intentionally breaks). origin_surface
removal + the JS/python SDK + cloud/relay producer-consumer wiring follow in
separate PRs.

Engine: tsc clean, origin (15) + lib (20) tests pass.
sdk-rust: 77 tests pass, clippy + fmt clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: apply pr-reviewer fixes for #184

* chore: apply pr-reviewer fixes for #184

* chore: apply pr-reviewer fixes for #184

* chore: apply pr-reviewer fixes for #184

* chore: apply pr-reviewer fixes for #184

* chore: apply pr-reviewer fixes for #184

* feat(origin-actor): allow @ in the path for name@version-model

The name segment optionally carries harness version + model as
{harness}@{version}-{model} (e.g. claude-code@2.3.1-opus4.8) so cloud can
derive origin_actor_name/_version/_model. Add '@' to the allowed charset
(engine regex + sdk-rust is_allowed) and update the test fixtures from the
stale pre-path UA values (claude-code/2.3 …) to real path examples.

See cloud/plans/origin-actor.md (name sub-format).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: agent-relay-code[bot] <agent-relay-code[bot]@users.noreply.github.com>
* feat(sdk): resync missed events on WebSocket reconnect

The engine already stamps every delivered event with a monotonic
agent_seq, keeps a 500-event resync ring per agent, and falls back to a
DB-backed replay for larger gaps — but no shipping client used it, so
every disconnect window was silent event loss.

WsClient now tracks the highest agent_seq seen (read from the raw frame,
since schema parsing strips unknown keys) and, after each reconnect once
open handlers have re-subscribed, sends
{type: "resync", last_seen_seq, since}. Replayed events flow through the
normal dispatch path, deduplicated by stable event id, and the server's
resync_ack surfaces as a new "resynced" lifecycle event — exposed as
on.resynced(({replayed, gapDetected}) => ...) on RelayCast and
AgentClient. First connections behave exactly as before (no seq, no
resync frame).

@relaycast/types gains the missing wire frame schemas: resync (client),
resync_ack (server), and the client-only resynced event.

Also adds the package README (install, RelayCast vs AgentClient
quickstart, reconnect/resync behavior, self-hosting).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* chore: apply pr-reviewer fixes for #176

* chore: apply pr-reviewer fixes for #176

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Co-authored-by: agent-relay-code[bot] <agent-relay-code[bot]@users.noreply.github.com>
* feat: add Swift SDK and OpenAPI sync guard

* chore: apply pr-reviewer fixes for #173

* chore: apply pr-reviewer fixes for #173

* chore: apply pr-reviewer fixes for #173

* chore: apply pr-reviewer fixes for #173

---------

Co-authored-by: agent-relay-code[bot] <agent-relay-code[bot]@users.noreply.github.com>
* feat(sdk-typescript): rename telemetry harness -> origin_actor

Aligns @relaycast/sdk (JS) with the engine 3.0.0 contract (relaycast#184).
The JS SDK now sends the `X-Relaycast-Origin-Actor` header / `?origin_actor=`
WS query from an `originActor` option (was `harness` / `X-Relaycast-Harness`).

- origin.ts: HARNESS_HEADER -> ORIGIN_ACTOR_HEADER; sanitizeHarness ->
  sanitizeOriginActor; max 120 -> 128 and allow `@` for the
  {app}/{type}/{name}@{version}-{model} path.
- client.ts / ws.ts / relay.ts / agent.ts: the `harness` option + internal
  `_originHarness` -> `originActor`; WS query `harness` -> `origin_actor`.

Breaking: the public `harness` constructor option is now `originActor`; no
alias (per cloud/plans/origin-actor.md). The domain session-event types are
untouched.

357 tests pass; tsc clean.

This is the JS-SDK foundation for the spawned-agent (23%) attribution. Once
published, the relay side (telemetry options + broker per-worker env) consumes
it to emit agent-relay-cli/agent/<harness>.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: apply pr-reviewer fixes for #187

* chore: apply pr-reviewer fixes for #187

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: agent-relay-code[bot] <agent-relay-code[bot]@users.noreply.github.com>
…se 0) (#191)

* feat(types): add fleet wire protocol schemas

* chore: apply pr-reviewer fixes for #191

* chore: apply pr-reviewer fixes for #191

* fix(types): align fleet wire fields with v1 ruling

* fix(types): tighten fleet action result contract

* Extend fleet wire protocol v3

* chore: apply pr-reviewer fixes for #191

* Add agent.register reply data schema

---------

Co-authored-by: agent-relay-code[bot] <agent-relay-code[bot]@users.noreply.github.com>
… (fleet Phase 1) (#192)

* feat(types): add fleet wire protocol schemas

* chore: apply pr-reviewer fixes for #191

* chore: apply pr-reviewer fixes for #191

* fix(types): align fleet wire fields with v1 ruling

* feat(engine): add fleet node registry and node-native actions

* feat(sdk): expose fleet nodes and triggers

* chore: apply pr-reviewer fixes for #192

* chore: apply pr-reviewer fixes for #192

* fix(types): tighten fleet action result contract

* Extend fleet wire protocol v3

* fix(engine): close round-1 node engine gaps

* fix(engine): close round-2 fleet node review items

* fix(engine): restore frozen reply contract and close round-3 reviews

* fix(engine): close round-4 fleet node review items

Round-4 review (88b792b) raised three majors; all addressed:

1. Drained offline-queue invocations stayed `pending` with no
   `dispatched_at`, so the timeout sweep never covered them. Factor a
   shared dispatched-state transition and apply it on drain via
   `markDrainedInvocationDispatched`, stamping `dispatched_at` +
   `retry_after_at` so dispatch-timeout/reschedule now picks up drained
   work. Adds a drain -> timeout -> reschedule conformance test.

2. Restore the frozen v3.1 `reply.agent_register.json` fixture
   byte-identical to ee2c001, re-add it to EXPECTED_FIXTURES, and restore
   the AgentRegisterReplyDataSchema assertion. `reply.json` stays the
   separate generic reply. `git diff ee2c001 -- packages/types` is empty;
   packages/types remains frozen.

3. Align SDK types with runtime: NodeRosterEntry.capabilities is now
   NodeCapability[] (FleetCapability objects, not string[]), and action
   invocation/completion output is JsonValue (FleetWireJsonValue:
   scalars/arrays/null legal, not object-only). Adds enforced
   compile-level type assertions in src plus expectTypeOf docs; widens the
   MCP completion call site to the JsonValue output contract.

Repo build/test/lint all green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(engine): close round-5 fleet node review items

Round-5 review (5f4e5b9) raised one major + three minors; all addressed:

1. [MAJOR] Queued spawn invocations could stick pending forever because
   drainNodeQueue ran at attachNodeSocket BEFORE node.register set the node
   online, so reserveNodeCapacity's online gate failed and the frame requeued
   with no re-drain trigger. Introduce a serialized registry `drainNode` and
   invoke it after node.register AND node.heartbeat (node is online by then),
   so queued spawns reserve capacity and dispatch. On a deferred reservation
   the requeue now arms `retry_after_at` so the dispatch sweeper reschedules as
   a backstop. Concurrent drains per node are chained so they never overlap
   (no double reservation). Adds a spawn drain test covering offline-queue ->
   reconnect (backstop armed) -> register (reserve + dispatch).

2. [minor] emitInvocationCompletionEffects double-delivered action.completed/
   failed to an online caller (targeted + online-set). Exclude the caller from
   the online fanout; add a dedupe unit test.

3. [minor] reconcileInventory could partially apply agent rebinds before a
   later item's conflict throw turned into an error reply. Pre-validate all
   conflicts up front (caching existing rows) before mutating anything.

4. [minor] The fixed handler_node spawn path marked reservationHeld without
   incrementing reserved capacity, so release under-counted and over-committed.
   Actually reserve on that path and only mark held when the reservation took.

Repo build/test/lint all green; packages/types still byte-identical to ee2c001.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: agent-relay-code[bot] <agent-relay-code[bot]@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…fleet Phase 2) (#193)

* feat(engine): add bounded durable fleet mailbox

* fix(engine): re-resolve mailbox targets at fanout

* fix(engine): guard ttl expiry and unify replay payload
…ivery guarantees (Phase 6) (#194)

* feat(engine): per-workspace fleet rollout flag + migration single-delivery guarantees (Phase 6)

Gate the entire fleet node control surface behind a per-workspace
`fleet_nodes_enabled` flag (default OFF), so fleet can ship dark and roll
out workspace-by-workspace. Legacy per-agent WS delivery is unaffected
either way.

The flag is checked once at each genuine boundary (no scattered checks):
- node control WS (`/v1/node/ws`) rejects with `fleet_nodes_disabled` (404)
- node roster routes (`/v1/nodes*`) return a flat 404 via `requireFleetNodes`
- declarative trigger evaluation is skipped at the message hook
- spawn placement + node-handler dispatch refuse in `invokeAction`
  (agent-handler actions stay available)

Flag source mirrors the workspace-stream pattern: a KV override with a
short in-memory cache, defaulting to `EngineConfig.fleetNodesEnabled`.
GET/PUT `/v1/workspace/fleet-nodes` toggles the per-workspace override.

Tests:
- flag OFF -> every node surface inert (roster, spawn, WS gate, triggers)
- per-workspace override flips the surface on/off; WS gate follows the flag
- migration single-delivery: a legacy self-connected agent is never also
  delivered via a node when the flag flips mid-stream (exclusive location;
  a node's `agent.register` for it is rejected `agent_location_conflict`)

The conformance harness defaults the flag ON, so existing node integration
tests pass unchanged. Full engine suite green (108 tests).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(engine): accept node token via Authorization: Bearer + gate node WS upgrade behind fleet flag (Phase 6)

Cross-repo compat fix surfaced by the Phase 6 two-node E2E: a real relay
broker could never bring a node online against a self-hosted engine.

Root cause: the node-control read-side is the Node HTTP-server `upgrade`
handler in `entrypoints/node.ts` (the Hono `/v1/node/ws` route only answers
the 426 — Node owns the 101). That handler read the token ONLY from the
`?token=` query param, but the relay Rust broker's node_control client sends
it as `Authorization: Bearer <nt_live_…>`. It also had NO fleet-flag gate for
`/v1/node/ws` (only the rk_live workspace-stream path was gated), so the
Phase 6 rollout flag did not actually cover the node control surface on the
self-host adapter.

Fix, both in the upgrade handler:
- read the node token from `?token=` query OR `Authorization: Bearer` header
  (query stays for SDK/Pear; header unblocks the shipped broker — no Rust
  release needed)
- gate the `/v1/node/ws` upgrade behind `isFleetNodesEnabled` (404 when off),
  mirroring the existing stream gate

Also mirrored the dual-transport read in the Hono `/v1/node/ws` route for any
adapter that routes upgrades through it.

Accepted-stack PRs involved: engine read-side #192, broker send-side #1107.
The hosted (Cloudflare DO) equivalent is handled in PR 5.

Test: `nodeUpgradeAuth.test.ts` boots the real Node server and asserts a WS
client authenticates via BOTH the Bearer header and the query param, that the
upgrade is rejected while the workspace flag is off (404), and that a
missing/malformed token is rejected (401). Full engine suite green (111).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(engine): reply to agent.register with a broker-shaped `reply` frame (Phase 6 token authority)

Third cross-repo compat fix surfaced by the Phase 6 E2E (spawn scenarios):
spawn never completes end-to-end against a real broker. The relay broker's
node_control client awaits a `reply` frame keyed by the request id — it matches
`pending_agent_registrations` by `reply.id` and parses `data` as
`{agent_id, token, name}` with `deny_unknown_fields`. The engine instead
answered `agent.register` with a bare `{type:'agent.registered', ...}` carrying
the full object (incl. invocation_id/session_ref), which the broker never
matches → `register_fleet_agent_token` hangs to its 30s timeout → the spawn
action fails. This blocked every spawn-dependent path (placement completion,
mailbox delivery to via-node agents, resume).

Reply in the shape the shipped broker consumes; the broker already holds the
invocation_id/session_ref it sent, so only the minted identity is echoed. Same
root pattern as the node-token transport mismatch (#192 read-side ↔ #1107
broker send-side); no Rust release needed.

Updated the one conformance helper that asserted the old frame. Engine suite
green (111).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(engine): self-host serve env for fleet flag default + mailbox TTL/depth-cap

The `relaycast-engine` serve bin gains optional env tuning so operators (and the
Phase 6 fleet E2E) can configure the bounded mailbox and the fleet rollout
default without code changes:
- RELAYCAST_FLEET_NODES_ENABLED=1 → EngineConfig.fleetNodesEnabled
- RELAYCAST_MAILBOX_TTL_MS / RELAYCAST_MAILBOX_DEPTH_CAP → mailbox tuning

Unset env leaves the existing defaults untouched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* chore(engine): renumber fleet mailbox migration 0017→0019 to deduplicate

#192 (merged) owns 0017_spawn_reservation_and_retry_state and 0018; the Phase 2
mailbox migration was authored as 0017 on an older base, colliding on the 0017
prefix once #192/#193 landed in main. Renumber to 0019 (after 0018) so the D1
migration sequence is unique and ordered. Pure file rename — no code references
the filename, and the migration has not been applied to any environment yet
(engine unpublished), so there is no D1 re-apply risk.

* docs(changelog): record fleet node/mailbox changes + breaking DeliveryStatus remap

Changelogs here are hand-curated (no CI generation), and the fleet stack
(#191-#194) was missing from them. Add the user-facing entries:

- @relaycast/types: new CHANGELOG; document the breaking DeliveryStatus enum
  remap (accepted/deferred removed, acked/dead_lettered added, delivered
  re-meaning) with old->new mapping + flag-independent migration note, the new
  Delivery location/lifecycle fields, and the fleet-wire protocol module.
- @relaycast/sdk-typescript: node roster API (nodes.list/get, triggers.list),
  capability objects, handler/dispatch node fields, JsonValue export, and the
  breaking action-output widening + delivery status value change.

These confirm the next @relaycast/types + sdk-typescript publish is a MAJOR.

* chore: apply pr-reviewer fixes for #194

* fix(engine): mailbox cumulative-ack + depth-cap correctness (Codex review)

Address P2 findings from Codex review of the fleet mailbox delivery path:

1. ackDelivery (single per-delivery REST ack) advanced the cumulative cursor to
   the row's own seq, so acking seq 2 while seq 1 is queued moved the cursor past
   seq 1; deliverPendingToNode (seq > delivery_ack_seq) then skipped it forever on
   node replay. Make the cursor advance opt-in (ackRows advanceCursorTo?) — single
   acks no longer advance it; the row's acked status already excludes it from replay.
   The node delivery.ack {up_to_seq} path still advances cumulatively. Regression test.

2. Migration 0019 seeded delivery_ack_seq = MAX(acked seq), skipping an older
   still-queued row below a newer acked one. Seed from the contiguous acked prefix
   (lowest active seq - 1; max seq when nothing is active).

3. Node-replay event classification checked dmType before threadId, so a thread
   reply inside a DM/group DM would replay as dm.received instead of thread.reply
   (the live routes/thread.ts routing). Check threadId first to mirror live.

4. Mailbox depth-cap count included expired-but-unswept rows, so an idle recipient
   kept rejecting new sends as depth_cap after TTL instead of dead-lettering.
   Exclude expired rows from the count (matches the replay query). Regression test.

Also classify the operator-only /v1/workspace/fleet-nodes flag route as non-SDK in
sdk-openapi-sync (pre-existing #194 gap that turbo test caching had masked).

* docs(openapi): require enabled|mode and document 400 on PUT /workspace/fleet-nodes

The PUT handler rejects a payload lacking both `enabled` (boolean) and
`mode: inherit` with a 400 invalid_request, but the schema marked both optional
and documented only a 200. Add anyOf[required: enabled | required: mode] to
reflect the runtime constraint, and document the 400 (ErrorResponse).

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Co-authored-by: agent-relay-code[bot] <agent-relay-code[bot]@users.noreply.github.com>
* refactor(telemetry): drop origin_surface from the origin contract

The origin-actor path (`{app}/{type}/{name}`) now expresses what
`origin_surface` (`cli|sdk|cloud`) tried to. Keep exactly two
non-overlapping origin concepts: `origin_actor` (who/what drove the
request) and `origin_client`+`origin_version` (which SDK library sent
it). See cloud/plans/origin-actor.md, decision 3.

Removes `origin_surface` from the wire header `X-Relaycast-Origin-Surface`,
the `?origin_surface=` WS query param, and the public SDK option — across
@relaycast/types, engine, mcp, and the rust/python/typescript SDKs. No
back-compat alias; this is a breaking change to TelemetryOrigin (major bump).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: apply pr-reviewer fixes for #188

* chore: apply pr-reviewer fixes for #188

* chore: apply pr-reviewer fixes for #188

* chore: apply pr-reviewer fixes for #188

* chore: apply pr-reviewer fixes for #188

* chore: apply pr-reviewer fixes for #188

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: agent-relay-code[bot] <agent-relay-code[bot]@users.noreply.github.com>
…eflake depth-cap test (#195)

* fix(engine): unbreak main CI — drop node-WS origin_surface + deflake depth-cap test

main has been failing 'Lint, Build & Test' (build step) for every PR:

1. BUILD BREAK: #188 removed origin_surface from the telemetry/origin contract,
   but #192's node-WS upgrade handler (engine.ts) and NodeUpgradeArgs.origin
   (ports/realtime.ts) still referenced it — the two PRs merged without
   reconciling. tsc fails: 'origin_surface does not exist'. Drop the dead
   surface field from both, matching the agent-WS upgrade path #188 already
   fixed. No consumer reads node origin.surface.

2. FLAKY TEST: the depth-cap-excludes-expired conformance test relied on a
   1.2s wall-clock wait racing the second-granular unixepoch() TTL boundary;
   it passed in isolation but failed under full-suite load. Age the row
   deterministically via the DB instead of sleeping (also ~1.2s faster).

* chore: apply pr-reviewer fixes for #195

* chore: apply pr-reviewer fixes for #195

---------

Co-authored-by: agent-relay-code[bot] <agent-relay-code[bot]@users.noreply.github.com>
…llow-ups (#189)

* feat(engine): retention pruning with per-workspace TTLs and outbox follow-ups

Add pruneExpired: bounded-batch deletion of expired messages (leaf-first
across thread parents), settled deliveries, message logs, and orphaned
read receipts, with per-workspace TTLs in a new nullable
workspaces.retention column. Message retention is opt-in; settled
deliveries and message logs default to 90 days as operational logs.
Runs on the Node adapter's outbox cleanup cadence and is exported for
queue-backed scheduled handlers.

cleanupOldEvents now settles exhausted pending_events rows as failed so
they become prunable instead of lingering unclaimable, and
sendWebhookEvent skips the outbox insert and queue send entirely (with
a per-request memoized existence probe) for workspaces with no active
event subscription.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* chore(engine): adapt retention to merged fleet engine (rebase onto main)

Rebased feature/engine-retention onto main, which now has the fleet stack
(#191-#194). Three adaptations:

1. Renumber migration 0016_workspace_retention -> 0020_workspace_retention; #192
   took 0016 (fleet_nodes) and #193 took 0019 (fleet_mailbox).
2. Delivery status model: #193 reworked the enum, so SETTLED_DELIVERY_STATUSES is
   now ['acked','failed','dead_lettered'] (was ['delivered','failed']). 'delivered'
   is now IN-FLIGHT (sent, awaiting cumulative ack), so retention must never prune
   it; 'acked' is terminal success. Updated tests to the new status names.
3. insertDelivery test helper assigns a distinct seq per agent — the mailbox
   migration added UNIQUE(workspace_id, agent_id, seq), so same-agent rows can no
   longer share the default seq 0.

Note: turbo build/tsc is currently red on main itself (engine.ts:212 uses
originInfo.origin_surface, which #188 removed from the telemetry contract) — a
pre-existing #188/#192 collision unrelated to this PR. Engine vitest is green
(132/132).

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
)

* Add Python SDK parity endpoints

* Fix Python SDK parity: delivery status enum, channel topic/invite paths

Rebased #190 onto main (resolving the #188 origin-contract changes:
origin_surface is gone; only origin_actor + origin_client/origin_version
remain — confirmed no origin_surface references survive).

Correctness fixes layered on top of #190's parity additions:

- DeliveryStatus: updated the stale Literal["accepted","delivered",
  "deferred","failed"] to the canonical #193 enum
  Literal["queued","delivered","acked","failed","dead_lettered"]
  (packages/types/src/delivery.ts). "delivered" now means in-flight
  awaiting ack; "acked" is terminal success; accepted/deferred removed.

- Delivery model: aligned with the canonical DeliverySchema by adding the
  missing fields seq, location_type, location_node_id, expires_at,
  delivered_at, acked_at, dead_lettered_at to match the TS SDK surface.

- channels.set_topic: corrected the route from PATCH /v1/channels/{name}
  to PATCH /v1/channels/{name}/topic to match the TS setTopic() and the
  dedicated openapi endpoint (it was colliding with channels.update).

- channels.invite: corrected the request body field from {"agent": ...}
  to {"agent_name": ...} to match InviteRequestSchema / the TS SDK wire
  shape (Python sends keys verbatim with no camel->snake conversion).

- Updated test_channels_set_topic to assert the corrected /topic route.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(sdk-swift): bring Swift SDK to 100% parity with TypeScript SDK

Add the relay-level surfaces that were missing from the Swift SDK:

- nodes namespace: list (GET /v1/nodes, capability/name filters), get
  (GET /v1/nodes/{name}) with NodeRosterEntry + NodeCapability models
- triggers namespace: create/list/get/update/delete full lifecycle
  (POST/GET/PATCH/DELETE /v1/triggers[/{id}]) with Trigger,
  CreateTriggerRequest, UpdateTriggerRequest models
- activity feed: activity(limit) -> GET /v1/activity
- workspace-level DM queries: allDMConversations (GET
  /v1/dm/conversations/all) and dmMessages (GET
  /v1/dm/conversations/{id}/messages)

Fix the stale DeliveryStatus enum to the current statuses
(queued|delivered|acked|failed|dead_lettered), replacing the old
accepted/deferred values.

All routes verified present in openapi.yaml. Adds tests for nodes,
triggers, workspace DM queries, activity, and the delivery-status enum.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(sdk-rust): full Rust↔TypeScript SDK parity

Bring the Rust SDK to 100% feature parity with the TypeScript reference
SDK. Every new route is documented in openapi.yaml.

New RelayCast surfaces:
- Workspace bootstrap: lookup_workspace (GET /v1/workspaces/by-name/{name})
- A2A: register_a2a, list_a2a_agents, remove_a2a_agent, get_a2a_agent_card
- Routing: route, route_feedback, get_routing_config, update_routing_config
- Directory: search_directory, publish_to_directory, list_directory,
  get_directory_agent, update_directory_agent, delete_directory_agent,
  list_directory_ratings, rate_directory_agent
- Skills: import_skills, search_skills
- Fleet nodes: list_nodes, get_node
- Triggers: create_trigger, list_triggers, get_trigger, update_trigger,
  delete_trigger
- Certification: certify, get_certification, certification_badge_url,
  monitor_certification
- Console: console_messages, console_stats (ConsoleOverview), console_agents,
  console_costs

New AgentClient surfaces:
- channels mute_channel / unmute_channel
- invite_to_channel fixed to send documented `agent_name` body

Models: added serde structs for A2A cards/records, directory agents/skills/
ratings, routing config/weights, skill search results, node roster with
capability objects, triggers, certification runs, and console stats —
all snake_case to match the wire contract.

DeliveryStatus enum updated to the canonical lifecycle
(queued|delivered|acked|failed|dead_lettered); tests updated to match.

Adds parity tests for every new surface.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: apply pr-reviewer fixes for #190

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: agent-relay-code[bot] <agent-relay-code[bot]@users.noreply.github.com>
The node.heartbeat schema was .strict() and accepted only
load/active_agents/handlers_live. The relay broker wants to carry the
node roster (name/node_id/capabilities/max_agents/version) on the
steady-state heartbeat so the engine can keep a node's descriptor fresh
between — or in the absence of — a fresh node.register (e.g. after an
engine restart where the broker keeps heartbeating an already-registered
node).

Extend FleetNodeHeartbeatMessageSchema with optional roster fields and
have heartbeatNode() adopt them: refresh name/capabilities/max_agents/
version on the node row and register newly-advertised capability actions
via ensureCapabilityActions. A minimal heartbeat (no roster) remains
valid and preserves the existing roster.

last_heartbeat_at is NOT part of the wire: receipt time is stamped
server-side here as the single source of truth for liveness; the broker
does not send it. node_id, when present, is validated against the
authenticated node token (node_id_mismatch) like node.register.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
barryollama and others added 3 commits June 19, 2026 11:26
A client that joins a workspace by key holds no workspace id locally, and
the agent-registration response did not include one — so a consumer (e.g.
the broker's Rust SDK) had nothing to record and fell back to an "unknown
workspace" placeholder, enrolling the agent (and any fleet node it
spawned) under a phantom workspace invisible to the real one.

Include `workspace_id` in the agent-registration response, and in the
agent-detail response used by the strict-name reclaim path. Surface it on
the Rust SDK's `CreateAgentResponse` and `Agent` as an optional,
serde-default field so older engines that omit it still deserialize.

Co-authored-by: Barry Cape <barryonthecape@icloud.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The "Commit and tag" step in publish-npm.yml and publish-rust.yml pushed
the version-bump commit with `git push origin HEAD:main` off a stale
checkout. When main advanced during the multi-minute build (e.g. a
concurrent cross-kind publish), the push was rejected as non-fast-forward,
failing the workflow after packages had already published and skipping the
git tag + GitHub Release.

Before pushing, fetch + rebase onto origin/main and retry with backoff
(5 attempts). The release commit only touches version metadata, so it
rebases cleanly onto another release's commit. Tag + Release still run
after a successful push.

Carries over only the workflow changes from #182; the 2.6.0 version
reconciliation is obsolete now that main is at 4.x.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_013c2kNLweXdjA5dMpmMRPBM
@gemini-code-assist

Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: cd36fd76-0834-4ad0-9f01-9c18442d203c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/adoring-hopper-jumdlq

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d6d1710edb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

.where(and(
eq(deliveries.workspaceId, workspaceId),
inArray(deliveries.id, ids),
notInArray(deliveries.status, ['acked', 'dead_lettered']),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep failed deliveries terminal when acking

When an agent has already reported a delivery via POST /deliveries/:id/fail (status = 'failed'), this guard still lets a later ack (including the node cumulative ack path through ackRows) update the same row to acked because failed is not excluded. That overwrites the terminal failure state and can emit delivered/read side effects for a delivery that was explicitly failed; include failed in the terminal guard.

Useful? React with 👍 / 👎.

.where(and(
eq(deliveries.id, deliveryId),
ne(deliveries.status, 'delivered'),
notInArray(deliveries.status, ['acked', 'dead_lettered']),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep failed deliveries terminal when deferring

If a delivery is already failed, POST /deliveries/:id/defer passes this predicate and rewrites it back to queued, putting a terminal failure back into the active replay queue with a new available_at. The fail path treats failed as settled, so the defer early return/update guard should also exclude failed.

Useful? React with 👍 / 👎.

@willwashburn willwashburn deleted the claude/adoring-hopper-jumdlq branch June 19, 2026 23:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants