Skip to content

[codex] Add agent-tty dogfood demo bundle#54

Merged
ThomasK33 merged 6 commits into
mainfrom
codex/agent-uses-agent-tty-dogfood
Apr 28, 2026
Merged

[codex] Add agent-tty dogfood demo bundle#54
ThomasK33 merged 6 commits into
mainfrom
codex/agent-uses-agent-tty-dogfood

Conversation

@ThomasK33
Copy link
Copy Markdown
Member

What changed

  • Added an evergreen dogfood/agent-uses-agent-tty/ bundle that records Codex and Claude using public agent-tty ... commands to drive clean Neovim, write a proof file, export inner artifacts, and exit.
  • Added README thumbnail links to review-friendly outer WebM cuts while preserving untrimmed outer recordings as *-outer-full.webm.
  • Listed the new scenario in the dogfood README and catalog.

Why

The first outer recordings spent too much time on startup/trust/model latency and then moved too quickly through the useful agent-tty interaction. The generated bundle now keeps canonical full recordings while surfacing slowed review cuts for readers.

Validation

  • bash -n dogfood/agent-uses-agent-tty/reproduce.sh
  • npm run build
  • npm run format:check
  • bash dogfood/agent-uses-agent-tty/reproduce.sh --agent both
  • ffprobe confirmed linked review WebMs are ~29.84s and full outer WebMs remain available.
  • Final file proofs confirmed both agents wrote agent-tty nested Neovim proof from an AI coding agent.

@ThomasK33
Copy link
Copy Markdown
Member Author

/coder-agents-review

Copy link
Copy Markdown

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thorough dogfood bundle. Genuine artifacts (cross-verified session IDs, coherent timestamps, real agent TUI banners, correct SHA256 in proof files). The inner helper design is smart: a deterministic, self-contained script that the non-deterministic agent simply executes. Isolation is solid (per-agent outer/inner homes, XDG overrides, git init, trap cleanup). Two issues need addressing before this ships.

Severity count: 2 P2, 5 P3, 2 P4, 4 Nit.

Notes (not posted inline):

  • The checked-in JSON envelopes (*-outer-create.json, *-record-*.json) contain the author's full local PATH and macOS temp-dir paths. Not a credential leak, but will produce noisy diffs when a different contributor regenerates the bundle.
  • "Evergreen" label vs. external service dependencies: the reproducer requires live Codex/Claude API access and a specific model (gpt-5.4-mini). The committed artifacts stand on their own, but the reproducer will break when any external dependency changes. The inner helper is itself a clean, standalone agent-tty demo that could run independently without AI services.
  • Narrow cleanup window: if a signal arrives between CURRENT_OUTER_HOME assignment (line 551) and CURRENT_OUTER_SESSION_ID extraction (line 552), the EXIT trap skips session destroy. Microsecond window under normal operation; noting for completeness.

"What happens when a contributor on Linux runs the headline dogfood reproducer and it fails after a 2-minute npm build cycle? Shall I show you?" (Hisoka)

🤖 This review was automatically generated with Coder Agents.

Comment thread dogfood/agent-uses-agent-tty/reproduce.sh Outdated
Comment thread dogfood/agent-uses-agent-tty/reproduce.sh Outdated
Comment thread dogfood/agent-uses-agent-tty/reproduce.sh Outdated
Comment thread dogfood/agent-uses-agent-tty/reproduce.sh
Comment thread dogfood/agent-uses-agent-tty/reproduce.sh Outdated
Comment thread dogfood/agent-uses-agent-tty/prompts/template.md
Comment thread dogfood/agent-uses-agent-tty/reproduce.sh Outdated
Comment thread dogfood/agent-uses-agent-tty/reproduce.sh Outdated
Comment thread dogfood/agent-uses-agent-tty/reproduce.sh
Comment thread dogfood/agent-uses-agent-tty/reproduce.sh Outdated
Change-Id: Id4b4fb802f76b8951cd5f1638b7491764f3858d3
Signed-off-by: Thomas Kosiewski <tk@coder.com>
@ThomasK33 ThomasK33 force-pushed the codex/agent-uses-agent-tty-dogfood branch from cc464e3 to 0cc109a Compare April 27, 2026 14:07
Change-Id: Ia0ccc7dc20d57e6052aa74a5792ac2ecac1f2ef4
Signed-off-by: Thomas Kosiewski <tk@coder.com>
@ThomasK33 ThomasK33 force-pushed the codex/agent-uses-agent-tty-dogfood branch from 0cc109a to 8b6ac7f Compare April 27, 2026 14:16
@ThomasK33
Copy link
Copy Markdown
Member Author

/coder-agents-review

1 similar comment
@ThomasK33
Copy link
Copy Markdown
Member Author

/coder-agents-review

Copy link
Copy Markdown

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All 15 prior findings addressed cleanly in one commit. The DEREM-5 fix (review cuts) is architecturally sound: switching to accelerated replay compresses idle time so the tail contains useful content. Pariston noted the accelerated and recorded timings produce identical durations for these specific recordings (continuous TUI output leaves no idle to compress), so the fix is a no-op here but correct in principle for future runs with more idle time. The DEREM-15 cleanup fix adds two defensive layers (inner helper trap + outer sweep via destroy_sessions_in_home) that interlock correctly.

Two fix-chain regressions from the prior round: dead parameters left behind by DEREM-13 and DEREM-2 fixes. Four new P3 findings, two P4s, two Nits. No P2s.

Severity count: 4 P3, 2 P4, 2 Nit.

Notes (not posted inline):

  • exitCode // "unknown" fallback (line 641) is effectively unreachable because stop_outer_agent_tui guarantees timedOut == false before reaching the exit-code extraction. If it ever fires, the error message would misdiagnose ("exited with code unknown" when the process did exit, just without reporting a code).
  • Cleanup trap double-destroys the current outer session (explicit destroy at line 188, then sweep at line 191). Intentional defense-in-depth; the second attempt is harmless.

"Two surfaces presenting the same face, one validated, one not. What happens when someone copies the pattern from OUTER_TIMEOUT_MS and sets PROOF_TIMEOUT_MS=5000 for a quick test? 5 seconds. What about 500? Zero." (Hisoka)

🤖 This review was automatically generated with Coder Agents.

Comment thread dogfood/agent-uses-agent-tty/reproduce.sh Outdated
Comment thread dogfood/agent-uses-agent-tty/reproduce.sh Outdated
Comment thread dogfood/agent-uses-agent-tty/reproduce.sh
Comment thread dogfood/agent-uses-agent-tty/reproduce.sh
Comment thread dogfood/agent-uses-agent-tty/reproduce.sh
Comment thread dogfood/agent-uses-agent-tty/reproduce.sh
Comment thread dogfood/agent-uses-agent-tty/reproduce.sh Outdated
Comment thread dogfood/agent-uses-agent-tty/reproduce.sh
Comment thread dogfood/agent-uses-agent-tty/README.md
Change-Id: I509caf65570c04a1d0753420113e7c58c9df064e
Signed-off-by: Thomas Kosiewski <tk@coder.com>
@ThomasK33
Copy link
Copy Markdown
Member Author

/coder-agents-review

Copy link
Copy Markdown

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All 9 R2 findings addressed cleanly in 384016a. No fix-chain regressions this round. Netero reports all 8 sections clean. The assertion suite after three rounds is genuinely strong: exact file content, inner cast content greps, ffprobe duration on all three WebMs, envelope metadata, template placeholder integrity, input validation, liveness detection, and diagnostic error messages.

Two subtle bash set -e context issues surfaced, plus a stale-envelope concern. No P2s.

Severity count: 3 P3, 2 P4.

Inventory correction: DEREM-22 ("outer-snapshot.txt and agent-transcript.txt byte-identical") was marked "Author fixed" but three reviewers independently verified the code is unchanged and artifacts remain identical. Updated to "Author accepted" since the fix was documentation-only (adding both files to README with distinct descriptions). Reasonable for a dogfood bundle.

Notes (not posted inline):

  • capture_json_var in the polling loop (DEREM-33 sibling) aborts the script on any non-ok response. The assumption that agent-tty never returns ok:false during polling is reasonable for a dogfood script, but the polling loop already handles two response shapes (timedOut true/false) and silently drops a third (error).

"The function behaves differently depending on who calls it. The ok-check and the copy are secretly coupled through set -e semantics that the caller controls." (Hisoka)

🤖 This review was automatically generated with Coder Agents.

Comment thread dogfood/agent-uses-agent-tty/reproduce.sh Outdated
Comment thread dogfood/agent-uses-agent-tty/reproduce.sh Outdated
Comment thread dogfood/agent-uses-agent-tty/codex-outer-record-review-webm.json Outdated
Comment thread dogfood/agent-uses-agent-tty/reproduce.sh Outdated
Comment thread dogfood/agent-uses-agent-tty/reproduce.sh Outdated
Change-Id: I3398fa0fa1a2a83ecf80df7c1ea89999181638b8
Signed-off-by: Thomas Kosiewski <tk@coder.com>
@ThomasK33
Copy link
Copy Markdown
Member Author

/coder-agents-review

Copy link
Copy Markdown

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All 5 R3 findings addressed in 8a96ef7. No fix-chain regressions. Netero clean for two consecutive rounds. Two of four panel reviewers found no findings; Hisoka verified all R3 fixes and reported silence.

The DEREM-32 fix (set -e context in thumbnail capture) is well done: explicit if/return branching replaces the implicit set -e dependency. The DEREM-33 fix (liveness check) replaced capture_json_var with direct execution and four distinct error paths, each with diagnostic output. The DEREM-34 fix removes the stale review-webm envelopes from the committed bundle entirely.

Severity count: 1 P3, 1 P4.

The P3 is about diagnostic quality in the remaining helpers (run_json_file and capture_json_var), which silently crash on .ok == false without extracting the error message. The script already established the right pattern in wait_for_agent_proof (DEREM-33 fix); the same treatment would benefit the six call sites in run_agent_demo.

"The code earned silence." (Hisoka)

🤖 This review was automatically generated with Coder Agents.

Comment thread dogfood/agent-uses-agent-tty/reproduce.sh
Comment thread dogfood/agent-uses-agent-tty/reproduce.sh Outdated
Change-Id: I096d3091d3bac403ee2830d445e34d75292d3220
Signed-off-by: Thomas Kosiewski <tk@coder.com>
@ThomasK33
Copy link
Copy Markdown
Member Author

/coder-agents-review

Copy link
Copy Markdown

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All R4 findings addressed in 79964e0. Netero clean for three consecutive rounds. Two of four panel reviewers found no findings. Pariston: "I tried to build a case against this change and couldn't. Five rounds and 38 findings have hardened the script's error handling, temp-file hygiene, verification completeness, and diagnostic quality."

Two minor findings remain (1 P3, 1 P4), both naming/documentation concerns with no functional impact. Approving with these noted.

Severity count: 1 P3, 1 P4.

The P3 is a naming mismatch: final-message.txt contains a script-generated summary, not captured agent output as the README claims. The P4 is dead parameter weight in render_prompt from call-site symmetry with write_inner_helper. Neither affects correctness.

Across 5 rounds, 40 findings were raised: 2 P2, 14 P3, 10 P4, 7 Nit, 2 Note, 5 dropped. 33 were fixed by the author, 1 accepted (DEREM-22), 5 dropped by orchestrator, 2 remain open. The finding rate has converged: R1 had 15, R2 had 9, R3 had 5, R4 had 2, R5 has 2. The script is well-hardened.

🤖 This review was automatically generated with Coder Agents.

Comment thread dogfood/agent-uses-agent-tty/reproduce.sh Outdated
Comment thread dogfood/agent-uses-agent-tty/reproduce.sh
Change-Id: Ifdfc3827636621845347429eda2c6b31856c2496
Signed-off-by: Thomas Kosiewski <tk@coder.com>
@ThomasK33
Copy link
Copy Markdown
Member Author

/coder-agents-review

Copy link
Copy Markdown

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All 40 findings across 5 prior rounds addressed. Netero clean for four consecutive rounds. All four panel reviewers report no findings. Unanimous silence.

The R5 fixes are verified: final-message.txt renamed to recording-summary.txt with honest README description. render_prompt trimmed from 11 to 5 parameters with exact 1:1 placeholder mapping.

No open findings remain. 33 fixed by author, 1 accepted (DEREM-22), 6 dropped by orchestrator. The finding rate converged to zero: R1(15), R2(9), R3(5), R4(2), R5(2), R6(0).

"Silence is its own rating." (Hisoka)

🤖 This review was automatically generated with Coder Agents.

@ThomasK33 ThomasK33 marked this pull request as ready for review April 28, 2026 21:13
@ThomasK33 ThomasK33 merged commit 3f7627b into main Apr 28, 2026
19 of 22 checks passed
@ThomasK33 ThomasK33 deleted the codex/agent-uses-agent-tty-dogfood branch April 28, 2026 21:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant