Skip to content

fix(pilot): surface failure reasons in terminal + break single-slot cascade + cross-worktree discover#125

Merged
iceglober merged 2 commits into
mainfrom
fix/pilot-terminal-failure-visibility
Apr 26, 2026
Merged

fix(pilot): surface failure reasons in terminal + break single-slot cascade + cross-worktree discover#125
iceglober merged 2 commits into
mainfrom
fix/pilot-terminal-failure-visibility

Conversation

@iceglober

Copy link
Copy Markdown
Owner

Pilot terminal failure visibility

Goal

Make pilot build failures diagnosable from the terminal output alone.
Today when a task fails, stderr says task.failed T1 in 306s and prints
10 task.blocked lines — no failure phase, no reason string, no session
ID, no worktree path. The operator has to sqlite3 into a per-repo
state.db that pilot status --run <id> itself can't find from another
worktree. Three compounding bugs make unattended runs effectively
undebuggable: (1) task.failed event payloads carry {phase, reason}
but the streaming logger and summary discard both; (2) the single-slot
WorktreePool poisons the next task on any failure, cascading every
downstream task to blocked; (3) discoverRun resolves state.db
relative to cwd, breaking cross-worktree lookups. After this plan
lands: failure phase + reason are visible inline, the summary carries a
per-failed-task detail block with session/worktree, blocked-cascade is
de-noised to one summary line, a pool preserve doesn't poison the rest
of the run, per-task SSE logs are captured to runs/<id>/tasks/<task>/session.jsonl
for post-mortem, and pilot status --run <id> works from any worktree.

The underlying opencode stall that triggered the 300s silence on
T1-AUDIT-DOC is explicitly out of scope — that's opencode-layer, not
pilot-layer. Per-task forensics make the NEXT such stall diagnosable.

Constraints

  • Per-repo state layout locked to ~/.glorious/opencode/<repo>/pilot/
    (root AGENTS.md rule 10). No writes outside.
  • No schema migration on runs table — cross-repo discover is done via
    filesystem scan, not by persisting origin repo on the run row.
  • Pilot builder/planner invariants (rule 10) are irrelevant here — this
    work edits the pilot infrastructure itself, not the agents it invokes.
  • Tests required: every code change has a test; grep-guards on new
    symbol names pin the contract surface.
  • Changesets mandatory for user-visible changes — this ships as a
    patch bump with a user-facing CHANGELOG line.
  • Stay color-free in terminal output (matches existing printSummary
    style). Scannability comes from indentation + blank-line separation.

Acceptance criteria

Forensics + pool (do first — data source for the rest)

  • src/pilot/paths.ts exports getTaskJsonlPath(cwd, runId, taskId)
    returning <runDir>/tasks/<taskId>/session.jsonl, creating the parent
    tasks/<taskId>/ directory on first access. Rejects unsafe taskIds
    via the same character-class guard as isSafeRunId.
  • Worker captures every SSE event for a task's session to that
    JSONL file (one JSON object per line). Subscription is task-scoped
    (not per-attempt), teed from the existing deps.bus.on(sessionId, ...)
    hook.
  • Every appendEvent({kind: "task.failed", payload}) in
    src/pilot/worker/worker.ts carries a reason string matching the
    one stored in tasks.last_error. No bare {phase}-only payloads
    remain in any failure branch.
  • The stall branch's task.failed payload additionally carries
    eventCount (events observed during the wait) and lastEventTs
    (epoch ms of the last event, or null if zero events).
  • WorktreePool.acquire() after preserveOnFailure(slot) returns
    a NEW slot (preserved: false, prepared: false, fresh path after
    prepare). The preserved slot is tracked in a separate retiredSlots
    list — still cleaned up by shutdown({ keepPreserved: false }),
    still listed by inspect().
  • WorktreePool.prepare() on a retried slot appends -<counter>
    to the path from worktreeDirOf(n) when retryCounter.get(n) > 0.
    Worker count invariant (v0.1 = 1 concurrent) is preserved.

Terminal visibility (consumes the enriched data)

  • startStreamingLogger in src/pilot/cli/build.ts prints, after
    the existing task.failed <id> in <s>s line, a continuation line of
    the form → <phase>: <reason> (two-space indent, arrow prefix).
    Tolerates payloads missing phase/reason by omitting the line.
    reason truncated to 200 chars with if longer.
  • startStreamingLogger no longer prints one line per
    task.blocked event. Instead it counts blocked events and emits a
    single summary line on run.finished:
    blocked: N task(s) waiting on failed dependency (<firstReason>).
  • printSummary accepts the open DB handle and, when the run has
    failed/aborted tasks, prints a per-task detail block between the
    counts line and the follow-up commands:
    Failed tasks (N):
    
      <taskId>
        phase:    <phase>
        reason:   <reason>
        session:  <sessionId | "(none — failed before session.create)">
        worktree: <worktreePath | "(none)">
        elapsed:  <Ns>   attempts: <N>
    
    
    Phase pulled from the latest task.failed event for that
    (runId, taskId); reason falls back to last_error if the event
    payload is missing it.
  • Successful runs render identically to today — no empty
    Failed tasks (0): heading.

Cross-worktree discover

  • resolveBaseDir in src/pilot/paths.ts is exported (minimal
    change — add export).
  • discoverRun({cwd, runId}) tries getStateDbPath(cwd, runId)
    first (fast path), then falls back to scanning direct children of
    <base> for <child>/pilot/runs/<runId>/state.db. Returns the
    first hit. On miss, throws listing every path tried.
  • No-runId path (mtime-newest scan) unchanged.
  • pilot status / pilot logs / pilot cost inherit the fix
    because they all route through discoverRun.

Release hygiene

  • New .changeset/pilot-terminal-failure-visibility.md with
    "@glrs-dev/harness-opencode": patch frontmatter and a 1-2 sentence
    user-facing body.

Verification

  • bun run typecheck green.
  • bun test green (all suites, including existing pilot tests).
  • bun run build green (dist emits without error).

File-level changes

src/pilot/paths.ts

  • Change: Export resolveBaseDir (add export keyword, no body
    change). Add getTaskJsonlPath(cwd, runId, taskId) and matching
    isSafeTaskId guard (same character class as isSafeRunId).
  • Why: Worker needs a canonical path for per-task SSE logs; discover
    needs base-dir access for cross-repo scans.
  • Risk: none (additive exports; resolveBaseDir visibility change is
    trivial).

src/pilot/worktree/pool.ts

  • Change: Add retiredSlots: WorktreeSlot[] and retryCounter: Map<number, number>
    to WorktreePool. Rework acquire() to retire a preserved slot
    into retiredSlots, bump the counter, and mint a fresh stub.
    Rework prepare() so retried slots resolve to
    <worktreeDirOf(n)>-<counter>. Update shutdown() and inspect()
    to cover both live and retired slots.
  • Why: Eliminate the cascade-failure behaviour where one preserved
    slot blocks all downstream tasks.
  • Risk: medium — pool semantics change. Mitigated by keeping the
    existing workerCount=1 invariant, the existing
    WorktreePool.prepare "preserved" throw (as defence in depth), and
    the existing "preserved slots are not reusable" test (still
    passes).

src/pilot/worker/worker.ts

  • Change: After session.create in runOneTask, open an append-mode
    JSONL writer at getTaskJsonlPath(cwd, runId, taskId). Register a
    task-scoped bus subscription that (a) appends each event as a JSON
    line and (b) tracks lastEventTs + eventCount counters. Ensure
    the subscription unsubscribes on every return path. Add reason: <string> to every appendEvent({kind: "task.failed", payload})
    call — the reason local variable is already computed for
    markFailedSafe; just carry it into the event. Stall path
    additionally carries eventCount + lastEventTs.
  • Why: Enriches the event payload consumed by the streaming logger
    and summary; persists forensic state for post-mortem of silent
    stalls.
  • Risk: medium — worker changes are load-bearing. Additive only; no
    existing event fields removed or renamed. New JSONL writes are
    best-effort (try/catch; failure doesn't fail the task).

src/pilot/cli/build.ts

  • Change: Extend startStreamingLogger — on task.failed, emit a
    continuation → <phase>: <reason> line when payload carries both
    fields. On task.blocked, count instead of print. On
    run.finished, emit the blocked-cascade summary line if any
    occurred. Extend printSummary — accept db: Database in args;
    when counts.failed + counts.aborted > 0, render the per-task
    failure block between the counts line and the follow-up commands.
    Update executeRun's call to printSummary to pass the db handle.
  • Why: Surfaces failure reasons at the terminal — the primary axis
    of this plan.
  • Risk: low — output-format changes only; all gated on "did anything
    fail" so success-path output is byte-identical to today.

src/pilot/cli/discover.ts

  • Change: Restructure the runId !== undefined branch. Fast-path
    getStateDbPath(cwd, runId) unchanged. On miss, compute
    resolveBaseDir(), fs.readdir its direct children, and check
    <base>/<child>/pilot/runs/<runId>/state.db for each. Return the
    first hit. On total miss, throw with all paths tried.
  • Why: pilot status --run <id> invoked from a different worktree
    (or different repo) of the run that created it currently fails with
    "no state.db" — this is the bug that prevented me from investigating
    the failing run in the first place.
  • Risk: low — pure widening; fast path preserved.

test/pilot-paths.test.ts

  • Change: Add test cases for getTaskJsonlPath — creates directory,
    returns expected path, rejects unsafe taskIds (../, leading dot).
  • Why: Lock the new contract surface.
  • Risk: none.

test/pilot-worker.test.ts

  • Change: Add test that when the fake bus emits two events then
    stalls, the resulting task.failed event payload has
    eventCount: 2, a non-null lastEventTs, and the JSONL file
    contains exactly two parseable JSON lines.
  • Why: Pins the forensic capture + enriched payload contract.
  • Risk: none.

test/pilot-worktree-pool.test.ts

  • Change: Add describe("WorktreePool — retires preserved slot on re-acquire")
    with test "retires preserved slot and mints fresh path":
    acquire → prepare T1 → preserve → acquire again → expect fresh
    stub → prepare T2 → expect success with -1-suffixed path →
    inspect() shows two slots. Existing
    "preserved slots are not reusable" test must still pass
    (prepare on the preserved slot itself still throws).
  • Why: Regression guard for the cascade-failure fix.
  • Risk: none.

test/pilot-cli-build.test.ts

  • Change: Add three test cases to the existing
    describe("startStreamingLogger", ...) block:
    "prints phase and reason on task.failed",
    "de-noises blocked cascade",
    "tolerates task.failed without phase/reason".
    Add new describe("printSummary — failure block", ...) that uses
    openStateDb(":memory:"), seeds a run with one failed + one
    aborted task, and asserts the rendered block matches the expected
    format.
  • Why: Lock the terminal-visibility contract.
  • Risk: none.

test/pilot-cli-status.test.ts

  • Change: Add test
    "resolves run from a different worktree of the same repo"
    uses GLORIOUS_PILOT_DIR env override to build a <base> with two
    repo-folder subdirs, drops a state.db in one, chdirs into a dir
    whose getRepoFolder resolves to the other, expects discoverRun
    to find it.
  • Why: Regression guard for cross-worktree lookup.
  • Risk: none.

.changeset/pilot-terminal-failure-visibility.md

  • Change: New file with "@glrs-dev/harness-opencode": patch
    frontmatter and a 1-2 sentence body describing the user-visible
    fix (pilot build now prints failure phase/reason, summary carries
    per-task detail, blocked cascade de-noised, pilot status --run <id> works from any worktree).
  • Why: Mandatory for release (root AGENTS.md changesets section).
  • Risk: none.

Test plan

  • bun test test/pilot-paths.test.ts — covers getTaskJsonlPath.
  • bun test test/pilot-worktree-pool.test.ts — covers the retire-on-preserve behaviour; existing tests must still pass.
  • bun test test/pilot-worker.test.ts — covers enriched task.failed payload + JSONL capture.
  • bun test test/pilot-cli-build.test.ts — covers streaming-logger failure detail + blocked de-noise + printSummary failure block.
  • bun test test/pilot-cli-status.test.ts — covers cross-worktree discover.
  • bun test (full suite) — final gate.
  • bun run typecheck — after every surface touched.
  • bun run build — before shipping; ensures dist/ emits cleanly.

Out of scope

  • The opencode-layer root cause of T1-AUDIT-DOC's 300s silence.
    Worktree preserved at /Users/austinhess/.glorious/opencode/kn-eng/pilot/worktrees/01KQ5GD4EZ0EWZT5W7YJCDA4A5/00/
    for separate investigation; this plan makes NEXT such stall
    diagnosable, doesn't chase this one.
  • A --repo <path> override flag on pilot status / pilot logs.
    User picked cross-repo scan (option A); this path stays open for
    future work if scan performance becomes an issue (it won't — ULIDs
    are cheap).
  • Schema migration to persist origin repo on the runs table. Cross-repo
    scan is cheaper and ULIDs prevent collisions.
  • ANSI color in terminal output. Matches existing style.
  • Multi-worker pool support (v0.3 work).
  • Resolving the pre-existing file-overlap warnings flagged by
    pilot validate (they're static globs, not race warnings).

Open questions

  • None. All design decisions resolved in the plan above.

…ascade + cross-worktree discover

When a task fails in 'pilot build', make the reason visible on stderr and
in the run summary — no more 'task.failed T1 in 306s' followed by 10 bare
'task.blocked' lines with no signal about what actually broke.

- startStreamingLogger prints '  → <phase>: <reason>' beneath each
  task.failed event. Blocked cascade de-noised to one summary line on
  run.finished ('blocked: N task(s) waiting on failed dependency (...)').
- printSummary accepts the DB handle and renders a per-failed-task
  block (phase, reason, session, worktree, elapsed, attempts) between
  the counts line and the follow-up commands. Successful runs render
  identically to before.
- WorktreePool retires preserved-on-failure slots into a separate
  retiredSlots list on next acquire() and mints a fresh stub with a
  '-<counter>'-suffixed path. One failure no longer cascade-blocks the
  whole run.
- runOneTask's every task.failed event payload now carries a 'reason'
  string matching tasks.last_error. Stall payloads additionally carry
  eventCount + lastEventTs. Per-task session.jsonl is written to
  <runDir>/tasks/<taskId>/session.jsonl (the path AGENTS.md already
  documents) for post-mortem.
- discoverRun tries the cwd-scoped path first, then falls back to
  scanning '<base>/<repoFolder>/pilot/runs/<runId>/state.db' across
  every repo folder. 'pilot status --run <id>' / 'pilot logs --run <id>'
  now work from any worktree.
…tests

Bun's bundled SQLite on macOS CI rejects `1_700_000_000_000` as an
unrecognized token when it appears inside the VALUES list of an INSERT
statement. My local Bun tolerated it but GitHub Actions macos-latest
does not, so the printSummary failure-block tests failed with
"SQLiteError: unrecognized token" while Ubuntu (and my machine) passed.

Numeric separators in JS parameter-binding arrays are fine (JS parser
owns those); they only break when baked into the SQL string. Replaced
all three in-SQL occurrences with the plain numeric literals.
@iceglober iceglober merged commit f4c6905 into main Apr 26, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant