Skip to content

P0: lead→worker dm_send can silently deadlock while both agents are ONLINE #892

Description

@khaliqgant

Source: reliability spec cloud/docs/specs/agent-relay-reliability-spec.md §P0 (live-incident, 2026-05-18). Filed from the spec-triage on PR #861; not fixed there (broker-level scope).

Problem

During a ~5h live-incident orchestration, a lead agent finished its analysis and tried to message.dm.send the spec to a worker. The send kept failing; the lead self-reported "Blocked on relay recovery … I'll retry sends each turn". Concurrently the broker logged watchdog: no PTY output for 120s — marking idle. Both agents showed ONLINE; no message crossed for tens of minutes. The only way it was discovered was scraping the lead's raw TTY.

Impact

Highest. A coordinated team silently stops progressing with nothing surfaced to the orchestrator/user. In an incident this directly extends outage duration.

Required

  • Defined delivery semantics: dm_send either returns a hard, actionable error promptly or is durably queued and guaranteed-delivered on recipient reachability. "Retry silently forever" is unacceptable.
  • First-class message_delivery_failed { from, to, attempts, lastError } (and a delivery-confirmed signal) surfaced to the spawning orchestrator, not only in the sender TTY.
  • Distinguish "transient auth/broker recovery" from "recipient gone".
  • agent-relay doctor reporting broker auth state + stuck outbound queues.

Acceptance

With two ONLINE agents and an induced broker-auth blip, a dm_send either delivers within N s of recovery and emits delivery-confirmed, or fails with a surfaced error within N s — never an unbounded silent stall. Regression test induces the blip and asserts the orchestrator receives a delivery event either way.

Scope

Broker (Rust src/*.rs) + protocol + SDK + CLI. Large; own effort. Relates to PR #861 (reading/observability quick wins) but intentionally separate.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions