Skip to content

P1: structured agent lifecycle/health/context-budget events the orchestrator can subscribe to #893

Description

@khaliqgant

Source: reliability spec cloud/docs/specs/agent-relay-reliability-spec.md §P1 (lifecycle). Filed from spec-triage on PR #861.

Problem

codex workers exhausted their context window and exited mid-task repeatedly. The only signal was a Context 6% left TTY string and later disappearance from agent-relay who. The orchestrator had to infer "agent dying" from scraped TTY and manually take over. Work was nearly lost.

Required

  • Emit structured, subscribable lifecycle events: agent_context_low { pct }, agent_exited { reason }, agent_permanently_dead, agent_idle { since } — observably actionable (the skill docs reference these but they were not actionable in practice).
  • agent-relay who / who --json should additionally include last-activity, context-budget (if known), and current state (working / idle / blocked-on-send).

Already partially shipped (PR #861, quick win)

who --json now emits real { name, cli, status, pid, uptimeSecs, memoryBytes } (replacing fabricated ONLINE/lastSeen:now). Still missing: last-activity, context-budget, working/idle/blocked state, and the lifecycle events themselves (broker-level).

Acceptance

Killing/exhausting a worker emits agent_exited/agent_context_low the orchestrator receives without scraping logs.

Scope

Broker event emission (Rust) + protocol + SDK subscription surface + CLI. Own effort; depends on broker work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions