P0: lead→worker dm_send can silently deadlock while both agents are ONLINE

**Source:** reliability spec `cloud/docs/specs/agent-relay-reliability-spec.md` §P0 (live-incident, 2026-05-18). Filed from the spec-triage on PR #861; **not** fixed there (broker-level scope).

## Problem
During a ~5h live-incident orchestration, a lead agent finished its analysis and tried to `message.dm.send` the spec to a worker. The send kept failing; the lead self-reported "Blocked on relay recovery … I'll retry sends each turn". Concurrently the broker logged `watchdog: no PTY output for 120s — marking idle`. Both agents showed `ONLINE`; **no message crossed for tens of minutes**. The only way it was discovered was scraping the lead's raw TTY.

## Impact
Highest. A coordinated team silently stops progressing with nothing surfaced to the orchestrator/user. In an incident this directly extends outage duration.

## Required
- Defined delivery semantics: `dm_send` either returns a hard, actionable error promptly **or** is durably queued and guaranteed-delivered on recipient reachability. "Retry silently forever" is unacceptable.
- First-class `message_delivery_failed { from, to, attempts, lastError }` (and a delivery-confirmed signal) surfaced to the spawning orchestrator, not only in the sender TTY.
- Distinguish "transient auth/broker recovery" from "recipient gone".
- `agent-relay doctor` reporting broker auth state + stuck outbound queues.

## Acceptance
With two ONLINE agents and an induced broker-auth blip, a `dm_send` either delivers within N s of recovery **and** emits delivery-confirmed, or fails with a surfaced error within N s — never an unbounded silent stall. Regression test induces the blip and asserts the orchestrator receives a delivery event either way.

## Scope
Broker (Rust `src/*.rs`) + protocol + SDK + CLI. Large; own effort. Relates to PR #861 (reading/observability quick wins) but intentionally separate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

P0: lead→worker dm_send can silently deadlock while both agents are ONLINE #892

Problem

Impact

Required

Acceptance

Scope

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

P0: lead→worker dm_send can silently deadlock while both agents are ONLINE #892

Description

Problem

Impact

Required

Acceptance

Scope

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions