[codex] Add agent-tty dogfood demo bundle by ThomasK33 · Pull Request #54 · coder/agent-tty

ThomasK33 · 2026-04-26T19:34:17Z

What changed

Added an evergreen dogfood/agent-uses-agent-tty/ bundle that records Codex and Claude using public agent-tty ... commands to drive clean Neovim, write a proof file, export inner artifacts, and exit.
Added README thumbnail links to review-friendly outer WebM cuts while preserving untrimmed outer recordings as *-outer-full.webm.
Listed the new scenario in the dogfood README and catalog.

Why

The first outer recordings spent too much time on startup/trust/model latency and then moved too quickly through the useful agent-tty interaction. The generated bundle now keeps canonical full recordings while surfacing slowed review cuts for readers.

Validation

bash -n dogfood/agent-uses-agent-tty/reproduce.sh
npm run build
npm run format:check
bash dogfood/agent-uses-agent-tty/reproduce.sh --agent both
ffprobe confirmed linked review WebMs are ~29.84s and full outer WebMs remain available.
Final file proofs confirmed both agents wrote agent-tty nested Neovim proof from an AI coding agent.

ThomasK33 · 2026-04-27T09:29:20Z

/coder-agents-review

coder-agents-review

Thorough dogfood bundle. Genuine artifacts (cross-verified session IDs, coherent timestamps, real agent TUI banners, correct SHA256 in proof files). The inner helper design is smart: a deterministic, self-contained script that the non-deterministic agent simply executes. Isolation is solid (per-agent outer/inner homes, XDG overrides, git init, trap cleanup). Two issues need addressing before this ships.

Severity count: 2 P2, 5 P3, 2 P4, 4 Nit.

Notes (not posted inline):

The checked-in JSON envelopes (*-outer-create.json, *-record-*.json) contain the author's full local PATH and macOS temp-dir paths. Not a credential leak, but will produce noisy diffs when a different contributor regenerates the bundle.
"Evergreen" label vs. external service dependencies: the reproducer requires live Codex/Claude API access and a specific model (gpt-5.4-mini). The committed artifacts stand on their own, but the reproducer will break when any external dependency changes. The inner helper is itself a clean, standalone agent-tty demo that could run independently without AI services.
Narrow cleanup window: if a signal arrives between CURRENT_OUTER_HOME assignment (line 551) and CURRENT_OUTER_SESSION_ID extraction (line 552), the EXIT trap skips session destroy. Microsecond window under normal operation; noting for completeness.

Change-Id: Id4b4fb802f76b8951cd5f1638b7491764f3858d3 Signed-off-by: Thomas Kosiewski <tk@coder.com>

Change-Id: Ia0ccc7dc20d57e6052aa74a5792ac2ecac1f2ef4 Signed-off-by: Thomas Kosiewski <tk@coder.com>

ThomasK33 · 2026-04-27T15:00:00Z

/coder-agents-review

ThomasK33 · 2026-04-27T18:09:25Z

/coder-agents-review

coder-agents-review

All 15 prior findings addressed cleanly in one commit. The DEREM-5 fix (review cuts) is architecturally sound: switching to accelerated replay compresses idle time so the tail contains useful content. Pariston noted the accelerated and recorded timings produce identical durations for these specific recordings (continuous TUI output leaves no idle to compress), so the fix is a no-op here but correct in principle for future runs with more idle time. The DEREM-15 cleanup fix adds two defensive layers (inner helper trap + outer sweep via destroy_sessions_in_home) that interlock correctly.

Two fix-chain regressions from the prior round: dead parameters left behind by DEREM-13 and DEREM-2 fixes. Four new P3 findings, two P4s, two Nits. No P2s.

Severity count: 4 P3, 2 P4, 2 Nit.

Notes (not posted inline):

exitCode // "unknown" fallback (line 641) is effectively unreachable because stop_outer_agent_tui guarantees timedOut == false before reaching the exit-code extraction. If it ever fires, the error message would misdiagnose ("exited with code unknown" when the process did exit, just without reporting a code).
Cleanup trap double-destroys the current outer session (explicit destroy at line 188, then sweep at line 191). Intentional defense-in-depth; the second attempt is harmless.

Change-Id: I509caf65570c04a1d0753420113e7c58c9df064e Signed-off-by: Thomas Kosiewski <tk@coder.com>

ThomasK33 · 2026-04-28T09:42:26Z

/coder-agents-review

coder-agents-review

All 9 R2 findings addressed cleanly in 384016a. No fix-chain regressions this round. Netero reports all 8 sections clean. The assertion suite after three rounds is genuinely strong: exact file content, inner cast content greps, ffprobe duration on all three WebMs, envelope metadata, template placeholder integrity, input validation, liveness detection, and diagnostic error messages.

Two subtle bash set -e context issues surfaced, plus a stale-envelope concern. No P2s.

Severity count: 3 P3, 2 P4.

Inventory correction: DEREM-22 ("outer-snapshot.txt and agent-transcript.txt byte-identical") was marked "Author fixed" but three reviewers independently verified the code is unchanged and artifacts remain identical. Updated to "Author accepted" since the fix was documentation-only (adding both files to README with distinct descriptions). Reasonable for a dogfood bundle.

Notes (not posted inline):

capture_json_var in the polling loop (DEREM-33 sibling) aborts the script on any non-ok response. The assumption that agent-tty never returns ok:false during polling is reasonable for a dogfood script, but the polling loop already handles two response shapes (timedOut true/false) and silently drops a third (error).

Change-Id: I3398fa0fa1a2a83ecf80df7c1ea89999181638b8 Signed-off-by: Thomas Kosiewski <tk@coder.com>

ThomasK33 · 2026-04-28T12:26:42Z

/coder-agents-review

coder-agents-review

All 5 R3 findings addressed in 8a96ef7. No fix-chain regressions. Netero clean for two consecutive rounds. Two of four panel reviewers found no findings; Hisoka verified all R3 fixes and reported silence.

The DEREM-32 fix (set -e context in thumbnail capture) is well done: explicit if/return branching replaces the implicit set -e dependency. The DEREM-33 fix (liveness check) replaced capture_json_var with direct execution and four distinct error paths, each with diagnostic output. The DEREM-34 fix removes the stale review-webm envelopes from the committed bundle entirely.

Severity count: 1 P3, 1 P4.

The P3 is about diagnostic quality in the remaining helpers (run_json_file and capture_json_var), which silently crash on .ok == false without extracting the error message. The script already established the right pattern in wait_for_agent_proof (DEREM-33 fix); the same treatment would benefit the six call sites in run_agent_demo.

Change-Id: I096d3091d3bac403ee2830d445e34d75292d3220 Signed-off-by: Thomas Kosiewski <tk@coder.com>

ThomasK33 · 2026-04-28T20:08:54Z

/coder-agents-review

coder-agents-review

All R4 findings addressed in 79964e0. Netero clean for three consecutive rounds. Two of four panel reviewers found no findings. Pariston: "I tried to build a case against this change and couldn't. Five rounds and 38 findings have hardened the script's error handling, temp-file hygiene, verification completeness, and diagnostic quality."

Two minor findings remain (1 P3, 1 P4), both naming/documentation concerns with no functional impact. Approving with these noted.

Severity count: 1 P3, 1 P4.

The P3 is a naming mismatch: final-message.txt contains a script-generated summary, not captured agent output as the README claims. The P4 is dead parameter weight in render_prompt from call-site symmetry with write_inner_helper. Neither affects correctness.

Across 5 rounds, 40 findings were raised: 2 P2, 14 P3, 10 P4, 7 Nit, 2 Note, 5 dropped. 33 were fixed by the author, 1 accepted (DEREM-22), 5 dropped by orchestrator, 2 remain open. The finding rate has converged: R1 had 15, R2 had 9, R3 had 5, R4 had 2, R5 has 2. The script is well-hardened.

Change-Id: Ifdfc3827636621845347429eda2c6b31856c2496 Signed-off-by: Thomas Kosiewski <tk@coder.com>

ThomasK33 · 2026-04-28T20:48:48Z

/coder-agents-review

coder-agents-review

All 40 findings across 5 prior rounds addressed. Netero clean for four consecutive rounds. All four panel reviewers report no findings. Unanimous silence.

The R5 fixes are verified: final-message.txt renamed to recording-summary.txt with honest README description. render_prompt trimmed from 11 to 5 parameters with exact 1:1 placeholder mapping.

No open findings remain. 33 fixed by author, 1 accepted (DEREM-22), 6 dropped by orchestrator. The finding rate converged to zero: R1(15), R2(9), R3(5), R4(2), R5(2), R6(0).

coder-agents-review Bot suggested changes Apr 27, 2026

View reviewed changes

docs: add agent-tty dogfood demo bundle

7d4e0ca

Change-Id: Id4b4fb802f76b8951cd5f1638b7491764f3858d3 Signed-off-by: Thomas Kosiewski <tk@coder.com>

ThomasK33 force-pushed the codex/agent-uses-agent-tty-dogfood branch from cc464e3 to 0cc109a Compare April 27, 2026 14:07

fix: address dogfood demo review feedback

8b6ac7f

Change-Id: Ia0ccc7dc20d57e6052aa74a5792ac2ecac1f2ef4 Signed-off-by: Thomas Kosiewski <tk@coder.com>

ThomasK33 force-pushed the codex/agent-uses-agent-tty-dogfood branch from 0cc109a to 8b6ac7f Compare April 27, 2026 14:16

coder-agents-review Bot reviewed Apr 28, 2026

View reviewed changes

fix: address dogfood review followups

384016a

Change-Id: I509caf65570c04a1d0753420113e7c58c9df064e Signed-off-by: Thomas Kosiewski <tk@coder.com>

coder-agents-review Bot reviewed Apr 28, 2026

View reviewed changes

fix: address dogfood review edge cases

8a96ef7

Change-Id: I3398fa0fa1a2a83ecf80df7c1ea89999181638b8 Signed-off-by: Thomas Kosiewski <tk@coder.com>

coder-agents-review Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread dogfood/agent-uses-agent-tty/reproduce.sh

Comment thread dogfood/agent-uses-agent-tty/reproduce.sh Outdated

fix: improve dogfood JSON diagnostics

79964e0

Change-Id: I096d3091d3bac403ee2830d445e34d75292d3220 Signed-off-by: Thomas Kosiewski <tk@coder.com>

coder-agents-review Bot approved these changes Apr 28, 2026

View reviewed changes

Comment thread dogfood/agent-uses-agent-tty/reproduce.sh Outdated

Comment thread dogfood/agent-uses-agent-tty/reproduce.sh

fix: address dogfood review followups

b1f61d5

Change-Id: Ifdfc3827636621845347429eda2c6b31856c2496 Signed-off-by: Thomas Kosiewski <tk@coder.com>

coder-agents-review Bot approved these changes Apr 28, 2026

View reviewed changes

ThomasK33 marked this pull request as ready for review April 28, 2026 21:13

ThomasK33 merged commit 3f7627b into main Apr 28, 2026
19 of 22 checks passed

ThomasK33 deleted the codex/agent-uses-agent-tty-dogfood branch April 28, 2026 21:13

Conversation

ThomasK33 commented Apr 26, 2026

What changed

Why

Validation

Uh oh!

ThomasK33 commented Apr 27, 2026

Uh oh!

coder-agents-review Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ThomasK33 commented Apr 27, 2026

Uh oh!

ThomasK33 commented Apr 27, 2026

Uh oh!

coder-agents-review Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ThomasK33 commented Apr 28, 2026

Uh oh!

coder-agents-review Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ThomasK33 commented Apr 28, 2026

Uh oh!

coder-agents-review Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ThomasK33 commented Apr 28, 2026

Uh oh!

coder-agents-review Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ThomasK33 commented Apr 28, 2026

Uh oh!

coder-agents-review Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant