fix(pilot): restore agent registration + SSE directory scope#128
Merged
Conversation
When PILOT_EVENT_LOG is set to a writable path, EventBus.runStream dumps one JSON line per event observed on the SSE stream BEFORE sessionID extraction or filtering. Each line includes the raw event, the extracted sessionID, the live subscriber session IDs at that moment, and the number of matched subscribers. Unset in normal operation → zero overhead. Motivation: the 0.16.1 Forensics path writes session.jsonl with events the bus has already filtered to the task's sessionID. When a task stalls with "0 events" in session.jsonl, we currently can't tell whether the SSE stream is silent, or whether the bus is dropping events due to a sessionID-extraction mismatch. This diagnostic answers both. Also ships scripts/diag-pilot.sh — a wrapper that sets PILOT_EVENT_LOG=/tmp/pilot-events.jsonl, truncates it, then exec's the locally-built pilot CLI via `bun dist/cli.js pilot <args>`. Used for one-shot diagnostic runs without `bun link` / global install. Next step: run the diagnostic against a stalling pilot build, inspect the JSONL to identify the event-filter root cause, then land the real fix (sessionID extraction + default stallMs raise + live activity streaming).
Every pilot build since v0.16.x has stalled at exactly 5min with "0
events, last none". Diagnostic run (PILOT_EVENT_LOG=/tmp/pilot.jsonl)
revealed the real cause: not a timeout, but two compounding bugs that
both dropped every session-level SSE event on the floor.
Bug 1 — pilot agents weren't registered in the spawned opencode server.
The pilot calls `createOpencodeServer` from the opencode SDK, which
spawns `opencode serve`. Empirically verified: `opencode serve
--print-logs --log-level DEBUG` shows zero `service=plugin` lines —
`serve` does NOT load external plugins, only the interactive TUI does.
So when the worker sent `promptAsync({ agent: "pilot-builder" })`,
the server accepted the request but no such agent was registered; the
prompt silently no-opped, and the session sat idle until the stall
timer fired.
Fix: build a minimal `Config` containing only `pilot-builder` and
`pilot-planner` (with full permissions, prompts, model) and pass it
via `createOpencodeServer({ config })`. The SDK forwards this via
`OPENCODE_CONFIG_CONTENT` env var, which opencode's config loader
merges with user config files. Non-pilot agents (prime, qa-reviewer,
etc.) stay out of the injection so user overrides remain in control.
Bug 2 — EventBus subscription had no directory scope, so the SSE
endpoint filtered out every session-level event.
opencode's `/event` endpoint scopes `message.updated`,
`message.part.updated`, `session.idle`, `session.status`, etc. by the
subscriber's declared directory — **exact match, not prefix**. With
no directory, the subscription received only server-wide events
(server.heartbeat, server.connected, file.watcher.updated). Every
session event the server published was silently discarded before the
bus ever saw it. Verified empirically: 15s window over a live
pilot-builder session with no directory → 2 events; with the session's
exact directory → 27 events including session.idle.
Fix: `EventBus` constructor takes a `directory` parameter and passes
`query: { directory }` to `client.event.subscribe`. Because opencode's
scope is exact-match, a single bus scoped to the run's worktrees-parent
would still miss per-task events — so `WorkerDeps` now takes a
`busFactory: (directory: string) => EventBus` instead of a single
`bus` instance. `runOneTask` creates a fresh bus per task scoped to
the task's own worktree path, and closes it at teardown.
Secondary changes:
- Default `stallMs` raised 5min → 60min. The old default was
calibrated against a broken stream where no events ever arrived;
with events actually flowing, legitimate inter-event gaps during
deep subagent work (plan delegation, long tool chains, verify
sub-runs) can exceed 5min. User-supplied `stallMs` still wins.
- `scripts/diag-pilot.sh` now uses `bun` instead of `node` to run the
built CLI (the dist imports `bun:sqlite` which node can't resolve).
Tests:
- `EventBus — directory scoping` (3 new): lock the subscribe-call
contract — when bus has a directory, query.directory MUST appear;
when it doesn't, no query. Protects against a regression that would
re-introduce the 5min stall.
- `buildPilotServerConfig` (4 new): locks the injected-agents
contract — must include pilot-builder + pilot-planner with
subagent mode and non-empty prompts; must NOT include other agents
(user-override surface).
- Updated 11 existing `pilot-worker.test.ts` cases to pass
`busFactory: (() => bus) as never` instead of `bus: bus as never`
to match the new deps shape.
Full suite: 943 tests pass (was 936).
End-to-end verified: rule-engine-refocus plan on kn-eng now runs
actual task work — first task ran 1m44s of real pilot-builder
activity (tool calls, verify cycles, 209-line session.jsonl),
correctly emitted STOP protocol on verify-fail. Zero stalls.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Every pilot build since v0.16.x has stalled at exactly 5min with
0 events, last none. Diagnostic run revealed the real cause is not a timeout — it's two compounding bugs that both dropped every session-level SSE event before the bus saw it.Bug 1: Pilot agents weren't registered in the spawned opencode server
opencode serve(spawned by the SDK'screateOpencodeServer) does not load external plugins — only the interactiveopencodeTUI does. Verified:opencode serve --print-logs --log-level DEBUGshows zeroservice=pluginlines. So when the worker calledpromptAsync({ agent: "pilot-builder" }), the server accepted the request but no such agent was registered. The prompt silently no-opped; the session sat idle until the stall timer fired.Fix: build a minimal
Configcontaining onlypilot-builderandpilot-planner(full permissions, prompts, model) and pass it viacreateOpencodeServer({ config }). The SDK forwards this viaOPENCODE_CONFIG_CONTENT; opencode's config loader merges it with user config. Non-pilot agents (prime, qa-reviewer, etc.) stay out of the injection so user overrides remain in control.Bug 2: EventBus subscription had no directory scope
opencode's
/eventSSE endpoint scopesmessage.updated,message.part.updated,session.idle,session.status, etc. by the subscriber's declared directory — exact match, not prefix. With no directory, the subscription received only server-wide events (server.heartbeat,server.connected,file.watcher.updated). Every session event the server published was silently discarded before the bus ever saw it.Verified: 15s window over a live pilot-builder session with no directory → 2 events. With the session's exact directory → 27 events including
session.idle.Fix:
EventBusconstructor now takes adirectoryparameter and passesquery: { directory }toclient.event.subscribe. Because opencode's scope is exact-match (not prefix), a single bus scoped to a parent directory would still miss per-task events — soWorkerDeps.busbecameWorkerDeps.busFactory: (directory: string) => EventBus.runOneTaskcreates a fresh bus per task scoped to the task's own worktree path, and closes it at teardown.Secondary changes
stallMsraised 5min → 60min. The old default was calibrated against the broken stream; with events actually flowing, legitimate inter-event gaps during deep subagent work (plan delegation, long tool chains, verify sub-runs) can exceed 5min. User-suppliedstallMsstill wins.scripts/diag-pilot.shnow usesbuninstead ofnodeto run the built CLI (the dist importsbun:sqlite).PILOT_EVENT_LOGenv var: when set to a writable path, EventBus dumps every raw SSE event (with extracted sessionID, live subscriber IDs, matched-subscriber count) as JSONL to that path. Zero overhead when unset; instrumental for the next debug of this kind.Regression tests
EventBus — directory scoping(3 new): lock the subscribe-call contract — when bus has a directory,query.directoryMUST appear; when it doesn't, no query. Protects against a regression that would re-introduce the 5min stall.buildPilotServerConfig(4 new): locks the injected-agents contract — must includepilot-builder+pilot-plannerwithsubagentmode and non-empty prompts; must NOT include other agents.pilot-worker.test.tscases to passbusFactoryinstead ofbus.Full suite: 943 tests pass (was 936).
End-to-end verification
Re-ran the same
rule-engine-refocusplan onkn-engthat had been stalling:439 events captured (vs 38 heartbeats pre-fix). T3's
session.jsonl= 209 lines of real agent activity. STOP protocol fires correctly on verify-fail. Zero stalls.Changeset
.changeset/pilot-fix-sse-directory-and-agent-registration.md— patch bump.