Source: reliability spec cloud/docs/specs/agent-relay-reliability-spec.md §P0 (live-incident, 2026-05-18). Filed from the spec-triage on PR #861; not fixed there (broker-level scope).
Problem
During a ~5h live-incident orchestration, a lead agent finished its analysis and tried to message.dm.send the spec to a worker. The send kept failing; the lead self-reported "Blocked on relay recovery … I'll retry sends each turn". Concurrently the broker logged watchdog: no PTY output for 120s — marking idle. Both agents showed ONLINE; no message crossed for tens of minutes. The only way it was discovered was scraping the lead's raw TTY.
Impact
Highest. A coordinated team silently stops progressing with nothing surfaced to the orchestrator/user. In an incident this directly extends outage duration.
Required
- Defined delivery semantics:
dm_send either returns a hard, actionable error promptly or is durably queued and guaranteed-delivered on recipient reachability. "Retry silently forever" is unacceptable.
- First-class
message_delivery_failed { from, to, attempts, lastError } (and a delivery-confirmed signal) surfaced to the spawning orchestrator, not only in the sender TTY.
- Distinguish "transient auth/broker recovery" from "recipient gone".
agent-relay doctor reporting broker auth state + stuck outbound queues.
Acceptance
With two ONLINE agents and an induced broker-auth blip, a dm_send either delivers within N s of recovery and emits delivery-confirmed, or fails with a surfaced error within N s — never an unbounded silent stall. Regression test induces the blip and asserts the orchestrator receives a delivery event either way.
Scope
Broker (Rust src/*.rs) + protocol + SDK + CLI. Large; own effort. Relates to PR #861 (reading/observability quick wins) but intentionally separate.
Source: reliability spec
cloud/docs/specs/agent-relay-reliability-spec.md§P0 (live-incident, 2026-05-18). Filed from the spec-triage on PR #861; not fixed there (broker-level scope).Problem
During a ~5h live-incident orchestration, a lead agent finished its analysis and tried to
message.dm.sendthe spec to a worker. The send kept failing; the lead self-reported "Blocked on relay recovery … I'll retry sends each turn". Concurrently the broker loggedwatchdog: no PTY output for 120s — marking idle. Both agents showedONLINE; no message crossed for tens of minutes. The only way it was discovered was scraping the lead's raw TTY.Impact
Highest. A coordinated team silently stops progressing with nothing surfaced to the orchestrator/user. In an incident this directly extends outage duration.
Required
dm_sendeither returns a hard, actionable error promptly or is durably queued and guaranteed-delivered on recipient reachability. "Retry silently forever" is unacceptable.message_delivery_failed { from, to, attempts, lastError }(and a delivery-confirmed signal) surfaced to the spawning orchestrator, not only in the sender TTY.agent-relay doctorreporting broker auth state + stuck outbound queues.Acceptance
With two ONLINE agents and an induced broker-auth blip, a
dm_sendeither delivers within N s of recovery and emits delivery-confirmed, or fails with a surfaced error within N s — never an unbounded silent stall. Regression test induces the blip and asserts the orchestrator receives a delivery event either way.Scope
Broker (Rust
src/*.rs) + protocol + SDK + CLI. Large; own effort. Relates to PR #861 (reading/observability quick wins) but intentionally separate.