Skip to content

fix(pilot): restore agent registration + SSE directory scope#128

Merged
iceglober merged 2 commits into
mainfrom
fix/pilot-stall-timeout-and-streaming
Apr 26, 2026
Merged

fix(pilot): restore agent registration + SSE directory scope#128
iceglober merged 2 commits into
mainfrom
fix/pilot-stall-timeout-and-streaming

Conversation

@iceglober

Copy link
Copy Markdown
Owner

Summary

Every pilot build since v0.16.x has stalled at exactly 5min with 0 events, last none. Diagnostic run revealed the real cause is not a timeout — it's two compounding bugs that both dropped every session-level SSE event before the bus saw it.

Bug 1: Pilot agents weren't registered in the spawned opencode server

opencode serve (spawned by the SDK's createOpencodeServer) does not load external plugins — only the interactive opencode TUI does. Verified: opencode serve --print-logs --log-level DEBUG shows zero service=plugin lines. So when the worker called promptAsync({ agent: "pilot-builder" }), the server accepted the request but no such agent was registered. The prompt silently no-opped; the session sat idle until the stall timer fired.

Fix: build a minimal Config containing only pilot-builder and pilot-planner (full permissions, prompts, model) and pass it via createOpencodeServer({ config }). The SDK forwards this via OPENCODE_CONFIG_CONTENT; opencode's config loader merges it with user config. Non-pilot agents (prime, qa-reviewer, etc.) stay out of the injection so user overrides remain in control.

Bug 2: EventBus subscription had no directory scope

opencode's /event SSE endpoint scopes message.updated, message.part.updated, session.idle, session.status, etc. by the subscriber's declared directory — exact match, not prefix. With no directory, the subscription received only server-wide events (server.heartbeat, server.connected, file.watcher.updated). Every session event the server published was silently discarded before the bus ever saw it.

Verified: 15s window over a live pilot-builder session with no directory → 2 events. With the session's exact directory → 27 events including session.idle.

Fix: EventBus constructor now takes a directory parameter and passes query: { directory } to client.event.subscribe. Because opencode's scope is exact-match (not prefix), a single bus scoped to a parent directory would still miss per-task events — so WorkerDeps.bus became WorkerDeps.busFactory: (directory: string) => EventBus. runOneTask creates a fresh bus per task scoped to the task's own worktree path, and closes it at teardown.

Secondary changes

  • Default stallMs raised 5min → 60min. The old default was calibrated against the broken stream; with events actually flowing, legitimate inter-event gaps during deep subagent work (plan delegation, long tool chains, verify sub-runs) can exceed 5min. User-supplied stallMs still wins.
  • scripts/diag-pilot.sh now uses bun instead of node to run the built CLI (the dist imports bun:sqlite).
  • New PILOT_EVENT_LOG env var: when set to a writable path, EventBus dumps every raw SSE event (with extracted sessionID, live subscriber IDs, matched-subscriber count) as JSONL to that path. Zero overhead when unset; instrumental for the next debug of this kind.

Regression tests

  • EventBus — directory scoping (3 new): lock the subscribe-call contract — when bus has a directory, query.directory MUST appear; when it doesn't, no query. Protects against a regression that would re-introduce the 5min stall.
  • buildPilotServerConfig (4 new): locks the injected-agents contract — must include pilot-builder + pilot-planner with subagent mode and non-empty prompts; must NOT include other agents.
  • Updated 11 existing pilot-worker.test.ts cases to pass busFactory instead of bus.

Full suite: 943 tests pass (was 936).

End-to-end verification

Re-ran the same rule-engine-refocus plan on kn-eng that had been stalling:

pilot build: run 01KQ5YF1FFVBGNP4KSHP12M6HD started (19 tasks)
[15:27:22] task.started T1-AUDIT-DOC
[15:28:54] task.verify.failed T1-AUDIT-DOC
[15:29:04] task.verify.failed T1-AUDIT-DOC
[15:29:09] task.stopped T1-AUDIT-DOC (builder STOP)
[15:29:09] task.started T3-RULE-ENTITIES

439 events captured (vs 38 heartbeats pre-fix). T3's session.jsonl = 209 lines of real agent activity. STOP protocol fires correctly on verify-fail. Zero stalls.

Changeset

.changeset/pilot-fix-sse-directory-and-agent-registration.md — patch bump.

When PILOT_EVENT_LOG is set to a writable path, EventBus.runStream dumps
one JSON line per event observed on the SSE stream BEFORE sessionID
extraction or filtering. Each line includes the raw event, the extracted
sessionID, the live subscriber session IDs at that moment, and the
number of matched subscribers. Unset in normal operation → zero overhead.

Motivation: the 0.16.1 Forensics path writes session.jsonl with events
the bus has already filtered to the task's sessionID. When a task stalls
with "0 events" in session.jsonl, we currently can't tell whether the
SSE stream is silent, or whether the bus is dropping events due to a
sessionID-extraction mismatch. This diagnostic answers both.

Also ships scripts/diag-pilot.sh — a wrapper that sets
PILOT_EVENT_LOG=/tmp/pilot-events.jsonl, truncates it, then exec's the
locally-built pilot CLI via `bun dist/cli.js pilot <args>`. Used for
one-shot diagnostic runs without `bun link` / global install.

Next step: run the diagnostic against a stalling pilot build, inspect
the JSONL to identify the event-filter root cause, then land the real
fix (sessionID extraction + default stallMs raise + live activity
streaming).
Every pilot build since v0.16.x has stalled at exactly 5min with "0
events, last none". Diagnostic run (PILOT_EVENT_LOG=/tmp/pilot.jsonl)
revealed the real cause: not a timeout, but two compounding bugs that
both dropped every session-level SSE event on the floor.

Bug 1 — pilot agents weren't registered in the spawned opencode server.
The pilot calls `createOpencodeServer` from the opencode SDK, which
spawns `opencode serve`. Empirically verified: `opencode serve
--print-logs --log-level DEBUG` shows zero `service=plugin` lines —
`serve` does NOT load external plugins, only the interactive TUI does.
So when the worker sent `promptAsync({ agent: "pilot-builder" })`,
the server accepted the request but no such agent was registered; the
prompt silently no-opped, and the session sat idle until the stall
timer fired.

Fix: build a minimal `Config` containing only `pilot-builder` and
`pilot-planner` (with full permissions, prompts, model) and pass it
via `createOpencodeServer({ config })`. The SDK forwards this via
`OPENCODE_CONFIG_CONTENT` env var, which opencode's config loader
merges with user config files. Non-pilot agents (prime, qa-reviewer,
etc.) stay out of the injection so user overrides remain in control.

Bug 2 — EventBus subscription had no directory scope, so the SSE
endpoint filtered out every session-level event.
opencode's `/event` endpoint scopes `message.updated`,
`message.part.updated`, `session.idle`, `session.status`, etc. by the
subscriber's declared directory — **exact match, not prefix**. With
no directory, the subscription received only server-wide events
(server.heartbeat, server.connected, file.watcher.updated). Every
session event the server published was silently discarded before the
bus ever saw it. Verified empirically: 15s window over a live
pilot-builder session with no directory → 2 events; with the session's
exact directory → 27 events including session.idle.

Fix: `EventBus` constructor takes a `directory` parameter and passes
`query: { directory }` to `client.event.subscribe`. Because opencode's
scope is exact-match, a single bus scoped to the run's worktrees-parent
would still miss per-task events — so `WorkerDeps` now takes a
`busFactory: (directory: string) => EventBus` instead of a single
`bus` instance. `runOneTask` creates a fresh bus per task scoped to
the task's own worktree path, and closes it at teardown.

Secondary changes:

- Default `stallMs` raised 5min → 60min. The old default was
  calibrated against a broken stream where no events ever arrived;
  with events actually flowing, legitimate inter-event gaps during
  deep subagent work (plan delegation, long tool chains, verify
  sub-runs) can exceed 5min. User-supplied `stallMs` still wins.

- `scripts/diag-pilot.sh` now uses `bun` instead of `node` to run the
  built CLI (the dist imports `bun:sqlite` which node can't resolve).

Tests:

- `EventBus — directory scoping` (3 new): lock the subscribe-call
  contract — when bus has a directory, query.directory MUST appear;
  when it doesn't, no query. Protects against a regression that would
  re-introduce the 5min stall.

- `buildPilotServerConfig` (4 new): locks the injected-agents
  contract — must include pilot-builder + pilot-planner with
  subagent mode and non-empty prompts; must NOT include other agents
  (user-override surface).

- Updated 11 existing `pilot-worker.test.ts` cases to pass
  `busFactory: (() => bus) as never` instead of `bus: bus as never`
  to match the new deps shape.

Full suite: 943 tests pass (was 936).

End-to-end verified: rule-engine-refocus plan on kn-eng now runs
actual task work — first task ran 1m44s of real pilot-builder
activity (tool calls, verify cycles, 209-line session.jsonl),
correctly emitted STOP protocol on verify-fail. Zero stalls.
@iceglober iceglober merged commit 3f34e76 into main Apr 26, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant