fix(pilot): surface failure reasons in terminal + break single-slot cascade + cross-worktree discover by iceglober · Pull Request #125 · iceglober/harness-opencode

iceglober · 2026-04-26T19:10:18Z

Pilot terminal failure visibility

Goal

Make pilot build failures diagnosable from the terminal output alone.
Today when a task fails, stderr says task.failed T1 in 306s and prints
10 task.blocked lines — no failure phase, no reason string, no session
ID, no worktree path. The operator has to sqlite3 into a per-repo
state.db that pilot status --run <id> itself can't find from another
worktree. Three compounding bugs make unattended runs effectively
undebuggable: (1) task.failed event payloads carry {phase, reason}
but the streaming logger and summary discard both; (2) the single-slot
WorktreePool poisons the next task on any failure, cascading every
downstream task to blocked; (3) discoverRun resolves state.db
relative to cwd, breaking cross-worktree lookups. After this plan
lands: failure phase + reason are visible inline, the summary carries a
per-failed-task detail block with session/worktree, blocked-cascade is
de-noised to one summary line, a pool preserve doesn't poison the rest
of the run, per-task SSE logs are captured to runs/<id>/tasks/<task>/session.jsonl
for post-mortem, and pilot status --run <id> works from any worktree.

The underlying opencode stall that triggered the 300s silence on
T1-AUDIT-DOC is explicitly out of scope — that's opencode-layer, not
pilot-layer. Per-task forensics make the NEXT such stall diagnosable.

Constraints

Per-repo state layout locked to ~/.glorious/opencode/<repo>/pilot/
(root AGENTS.md rule 10). No writes outside.
No schema migration on runs table — cross-repo discover is done via
filesystem scan, not by persisting origin repo on the run row.
Pilot builder/planner invariants (rule 10) are irrelevant here — this
work edits the pilot infrastructure itself, not the agents it invokes.
Tests required: every code change has a test; grep-guards on new
symbol names pin the contract surface.
Changesets mandatory for user-visible changes — this ships as a
patch bump with a user-facing CHANGELOG line.
Stay color-free in terminal output (matches existing printSummary
style). Scannability comes from indentation + blank-line separation.

Acceptance criteria

Forensics + pool (do first — data source for the rest)

src/pilot/paths.ts exports getTaskJsonlPath(cwd, runId, taskId)
returning <runDir>/tasks/<taskId>/session.jsonl, creating the parent
tasks/<taskId>/ directory on first access. Rejects unsafe taskIds
via the same character-class guard as isSafeRunId.
Worker captures every SSE event for a task's session to that
JSONL file (one JSON object per line). Subscription is task-scoped
(not per-attempt), teed from the existing deps.bus.on(sessionId, ...)
hook.
Every appendEvent({kind: "task.failed", payload}) in
src/pilot/worker/worker.ts carries a reason string matching the
one stored in tasks.last_error. No bare {phase}-only payloads
remain in any failure branch.
The stall branch's task.failed payload additionally carries
eventCount (events observed during the wait) and lastEventTs
(epoch ms of the last event, or null if zero events).
WorktreePool.acquire() after preserveOnFailure(slot) returns
a NEW slot (preserved: false, prepared: false, fresh path after
prepare). The preserved slot is tracked in a separate retiredSlots
list — still cleaned up by shutdown({ keepPreserved: false }),
still listed by inspect().
WorktreePool.prepare() on a retried slot appends -<counter>
to the path from worktreeDirOf(n) when retryCounter.get(n) > 0.
Worker count invariant (v0.1 = 1 concurrent) is preserved.

Terminal visibility (consumes the enriched data)

startStreamingLogger in src/pilot/cli/build.ts prints, after
the existing task.failed <id> in <s>s line, a continuation line of
the form → <phase>: <reason> (two-space indent, arrow prefix).
Tolerates payloads missing phase/reason by omitting the → line.
reason truncated to 200 chars with … if longer.
startStreamingLogger no longer prints one line per
task.blocked event. Instead it counts blocked events and emits a
single summary line on run.finished:
blocked: N task(s) waiting on failed dependency (<firstReason>).
printSummary accepts the open DB handle and, when the run has
failed/aborted tasks, prints a per-task detail block between the
counts line and the follow-up commands:
```
Failed tasks (N):

  <taskId>
    phase:    <phase>
    reason:   <reason>
    session:  <sessionId | "(none — failed before session.create)">
    worktree: <worktreePath | "(none)">
    elapsed:  <Ns>   attempts: <N>
```
Phase pulled from the latest task.failed event for that
(runId, taskId); reason falls back to last_error if the event
payload is missing it.
Successful runs render identically to today — no empty
Failed tasks (0): heading.

Cross-worktree discover

resolveBaseDir in src/pilot/paths.ts is exported (minimal
change — add export).
discoverRun({cwd, runId}) tries getStateDbPath(cwd, runId)
first (fast path), then falls back to scanning direct children of
<base> for <child>/pilot/runs/<runId>/state.db. Returns the
first hit. On miss, throws listing every path tried.
No-runId path (mtime-newest scan) unchanged.
pilot status / pilot logs / pilot cost inherit the fix
because they all route through discoverRun.

Release hygiene

New .changeset/pilot-terminal-failure-visibility.md with
"@glrs-dev/harness-opencode": patch frontmatter and a 1-2 sentence
user-facing body.

Verification

bun run typecheck green.
bun test green (all suites, including existing pilot tests).
bun run build green (dist emits without error).

File-level changes

src/pilot/paths.ts

Change: Export resolveBaseDir (add export keyword, no body
change). Add getTaskJsonlPath(cwd, runId, taskId) and matching
isSafeTaskId guard (same character class as isSafeRunId).
Why: Worker needs a canonical path for per-task SSE logs; discover
needs base-dir access for cross-repo scans.
Risk: none (additive exports; resolveBaseDir visibility change is
trivial).

src/pilot/worktree/pool.ts

Change: Add retiredSlots: WorktreeSlot[] and retryCounter: Map<number, number>
to WorktreePool. Rework acquire() to retire a preserved slot
into retiredSlots, bump the counter, and mint a fresh stub.
Rework prepare() so retried slots resolve to
<worktreeDirOf(n)>-<counter>. Update shutdown() and inspect()
to cover both live and retired slots.
Why: Eliminate the cascade-failure behaviour where one preserved
slot blocks all downstream tasks.
Risk: medium — pool semantics change. Mitigated by keeping the
existing workerCount=1 invariant, the existing
WorktreePool.prepare "preserved" throw (as defence in depth), and
the existing "preserved slots are not reusable" test (still
passes).

src/pilot/worker/worker.ts

Change: After session.create in runOneTask, open an append-mode
JSONL writer at getTaskJsonlPath(cwd, runId, taskId). Register a
task-scoped bus subscription that (a) appends each event as a JSON
line and (b) tracks lastEventTs + eventCount counters. Ensure
the subscription unsubscribes on every return path. Add reason: <string> to every appendEvent({kind: "task.failed", payload})
call — the reason local variable is already computed for
markFailedSafe; just carry it into the event. Stall path
additionally carries eventCount + lastEventTs.
Why: Enriches the event payload consumed by the streaming logger
and summary; persists forensic state for post-mortem of silent
stalls.
Risk: medium — worker changes are load-bearing. Additive only; no
existing event fields removed or renamed. New JSONL writes are
best-effort (try/catch; failure doesn't fail the task).

src/pilot/cli/build.ts

Change: Extend startStreamingLogger — on task.failed, emit a
continuation → <phase>: <reason> line when payload carries both
fields. On task.blocked, count instead of print. On
run.finished, emit the blocked-cascade summary line if any
occurred. Extend printSummary — accept db: Database in args;
when counts.failed + counts.aborted > 0, render the per-task
failure block between the counts line and the follow-up commands.
Update executeRun's call to printSummary to pass the db handle.
Why: Surfaces failure reasons at the terminal — the primary axis
of this plan.
Risk: low — output-format changes only; all gated on "did anything
fail" so success-path output is byte-identical to today.

src/pilot/cli/discover.ts

Change: Restructure the runId !== undefined branch. Fast-path
getStateDbPath(cwd, runId) unchanged. On miss, compute
resolveBaseDir(), fs.readdir its direct children, and check
<base>/<child>/pilot/runs/<runId>/state.db for each. Return the
first hit. On total miss, throw with all paths tried.
Why: pilot status --run <id> invoked from a different worktree
(or different repo) of the run that created it currently fails with
"no state.db" — this is the bug that prevented me from investigating
the failing run in the first place.
Risk: low — pure widening; fast path preserved.

test/pilot-paths.test.ts

Change: Add test cases for getTaskJsonlPath — creates directory,
returns expected path, rejects unsafe taskIds (../, leading dot).
Why: Lock the new contract surface.
Risk: none.

test/pilot-worker.test.ts

Change: Add test that when the fake bus emits two events then
stalls, the resulting task.failed event payload has
eventCount: 2, a non-null lastEventTs, and the JSONL file
contains exactly two parseable JSON lines.
Why: Pins the forensic capture + enriched payload contract.
Risk: none.

test/pilot-worktree-pool.test.ts

Change: Add describe("WorktreePool — retires preserved slot on re-acquire")
with test "retires preserved slot and mints fresh path":
acquire → prepare T1 → preserve → acquire again → expect fresh
stub → prepare T2 → expect success with -1-suffixed path →
inspect() shows two slots. Existing
"preserved slots are not reusable" test must still pass
(prepare on the preserved slot itself still throws).
Why: Regression guard for the cascade-failure fix.
Risk: none.

test/pilot-cli-build.test.ts

Change: Add three test cases to the existing
describe("startStreamingLogger", ...) block:
"prints phase and reason on task.failed",
"de-noises blocked cascade",
"tolerates task.failed without phase/reason".
Add new describe("printSummary — failure block", ...) that uses
openStateDb(":memory:"), seeds a run with one failed + one
aborted task, and asserts the rendered block matches the expected
format.
Why: Lock the terminal-visibility contract.
Risk: none.

test/pilot-cli-status.test.ts

Change: Add test
"resolves run from a different worktree of the same repo" —
uses GLORIOUS_PILOT_DIR env override to build a <base> with two
repo-folder subdirs, drops a state.db in one, chdirs into a dir
whose getRepoFolder resolves to the other, expects discoverRun
to find it.
Why: Regression guard for cross-worktree lookup.
Risk: none.

.changeset/pilot-terminal-failure-visibility.md

Change: New file with "@glrs-dev/harness-opencode": patch
frontmatter and a 1-2 sentence body describing the user-visible
fix (pilot build now prints failure phase/reason, summary carries
per-task detail, blocked cascade de-noised, pilot status --run <id> works from any worktree).
Why: Mandatory for release (root AGENTS.md changesets section).
Risk: none.

Test plan

bun test test/pilot-paths.test.ts — covers getTaskJsonlPath.
bun test test/pilot-worktree-pool.test.ts — covers the retire-on-preserve behaviour; existing tests must still pass.
bun test test/pilot-worker.test.ts — covers enriched task.failed payload + JSONL capture.
bun test test/pilot-cli-build.test.ts — covers streaming-logger failure detail + blocked de-noise + printSummary failure block.
bun test test/pilot-cli-status.test.ts — covers cross-worktree discover.
bun test (full suite) — final gate.
bun run typecheck — after every surface touched.
bun run build — before shipping; ensures dist/ emits cleanly.

Out of scope

The opencode-layer root cause of T1-AUDIT-DOC's 300s silence.
Worktree preserved at /Users/austinhess/.glorious/opencode/kn-eng/pilot/worktrees/01KQ5GD4EZ0EWZT5W7YJCDA4A5/00/
for separate investigation; this plan makes NEXT such stall
diagnosable, doesn't chase this one.
A --repo <path> override flag on pilot status / pilot logs.
User picked cross-repo scan (option A); this path stays open for
future work if scan performance becomes an issue (it won't — ULIDs
are cheap).
Schema migration to persist origin repo on the runs table. Cross-repo
scan is cheaper and ULIDs prevent collisions.
ANSI color in terminal output. Matches existing style.
Multi-worker pool support (v0.3 work).
Resolving the pre-existing file-overlap warnings flagged by
pilot validate (they're static globs, not race warnings).

Open questions

None. All design decisions resolved in the plan above.

…ascade + cross-worktree discover When a task fails in 'pilot build', make the reason visible on stderr and in the run summary — no more 'task.failed T1 in 306s' followed by 10 bare 'task.blocked' lines with no signal about what actually broke. - startStreamingLogger prints ' → <phase>: <reason>' beneath each task.failed event. Blocked cascade de-noised to one summary line on run.finished ('blocked: N task(s) waiting on failed dependency (...)'). - printSummary accepts the DB handle and renders a per-failed-task block (phase, reason, session, worktree, elapsed, attempts) between the counts line and the follow-up commands. Successful runs render identically to before. - WorktreePool retires preserved-on-failure slots into a separate retiredSlots list on next acquire() and mints a fresh stub with a '-<counter>'-suffixed path. One failure no longer cascade-blocks the whole run. - runOneTask's every task.failed event payload now carries a 'reason' string matching tasks.last_error. Stall payloads additionally carry eventCount + lastEventTs. Per-task session.jsonl is written to <runDir>/tasks/<taskId>/session.jsonl (the path AGENTS.md already documents) for post-mortem. - discoverRun tries the cwd-scoped path first, then falls back to scanning '<base>/<repoFolder>/pilot/runs/<runId>/state.db' across every repo folder. 'pilot status --run <id>' / 'pilot logs --run <id>' now work from any worktree.

…tests Bun's bundled SQLite on macOS CI rejects `1_700_000_000_000` as an unrecognized token when it appears inside the VALUES list of an INSERT statement. My local Bun tolerated it but GitHub Actions macos-latest does not, so the printSummary failure-block tests failed with "SQLiteError: unrecognized token" while Ubuntu (and my machine) passed. Numeric separators in JS parameter-binding arrays are fine (JS parser owns those); they only break when baked into the SQL string. Replaced all three in-SQL occurrences with the plain numeric literals.

iceglober added 2 commits April 26, 2026 12:05

iceglober merged commit f4c6905 into main Apr 26, 2026
2 checks passed

harness-opencode-ci Bot mentioned this pull request Apr 26, 2026

chore: version packages #126

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pilot): surface failure reasons in terminal + break single-slot cascade + cross-worktree discover#125

fix(pilot): surface failure reasons in terminal + break single-slot cascade + cross-worktree discover#125
iceglober merged 2 commits into
mainfrom
fix/pilot-terminal-failure-visibility

iceglober commented Apr 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

iceglober commented Apr 26, 2026

Pilot terminal failure visibility

Goal

Constraints

Acceptance criteria

Forensics + pool (do first — data source for the rest)

Terminal visibility (consumes the enriched data)

Cross-worktree discover

Release hygiene

Verification

File-level changes

src/pilot/paths.ts

src/pilot/worktree/pool.ts

src/pilot/worker/worker.ts

src/pilot/cli/build.ts

src/pilot/cli/discover.ts

test/pilot-paths.test.ts

test/pilot-worker.test.ts

test/pilot-worktree-pool.test.ts

test/pilot-cli-build.test.ts

test/pilot-cli-status.test.ts

.changeset/pilot-terminal-failure-visibility.md

Test plan

Out of scope

Open questions

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant