Summary
When an interactive agent OWNS a critical-path step, it does the work and is "released", but the engine never registers the step as completed — the step stays `running` indefinitely, no timeout/retry fires, the workflow hangs. Workers (`interactive: false`) do NOT have this problem. Goal: an interactive agent in a relayflow should complete reliably, like a worker does.
Evidence (factory build-out, 2026-06-16)
- codex owner of an `implement` step (pipeline shape): target files written + `self-reflection.md` present (verification target satisfied), agent "released worker via relaycast", but step stuck `running` 15–30min, no live agent process, zero further activity. Reproduced across 4-wide, 2-wide, and solo runs.
- opencode owner: hard-failed with `owner completion decision missing`.
- Same family surfaced in the conversation shape as a lead↔worker idle stall (see sibling issue).
Root cause (suspected)
The "owner completion decision" is delivered over the broker PTY / relaycast and silently drops (delivered≠flushed). No fallback to completion exists, so the step never advances. Broker auto-restart (#10) does NOT fire — it's not a transport error (self-heal events = 0).
Proposed fix (any/all)
- Treat a PASSED step `verification` (file_exists / exit_code) as completion evidence even when the channel decision didn't arrive.
- Honor the underlying process exit as a completion signal for owner steps (workers already do this).
- Per-step inactivity watchdog (sibling issue) so a missed decision retries instead of hanging.
- Ack + redeliver the owner-completion message over relaycast.
Workaround in use
`wf.agent(name, { interactive: false })` → spawnAgent, completes on process exit. Reliable, but loses channel coordination — which defeats the point of interactive agents.
Summary
When an interactive agent OWNS a critical-path step, it does the work and is "released", but the engine never registers the step as completed — the step stays `running` indefinitely, no timeout/retry fires, the workflow hangs. Workers (`interactive: false`) do NOT have this problem. Goal: an interactive agent in a relayflow should complete reliably, like a worker does.
Evidence (factory build-out, 2026-06-16)
Root cause (suspected)
The "owner completion decision" is delivered over the broker PTY / relaycast and silently drops (delivered≠flushed). No fallback to completion exists, so the step never advances. Broker auto-restart (#10) does NOT fire — it's not a transport error (self-heal events = 0).
Proposed fix (any/all)
Workaround in use
`wf.agent(name, { interactive: false })` → spawnAgent, completes on process exit. Reliable, but loses channel coordination — which defeats the point of interactive agents.