[codex] Add agent-tty dogfood demo bundle#54
Conversation
|
/coder-agents-review |
There was a problem hiding this comment.
Thorough dogfood bundle. Genuine artifacts (cross-verified session IDs, coherent timestamps, real agent TUI banners, correct SHA256 in proof files). The inner helper design is smart: a deterministic, self-contained script that the non-deterministic agent simply executes. Isolation is solid (per-agent outer/inner homes, XDG overrides, git init, trap cleanup). Two issues need addressing before this ships.
Severity count: 2 P2, 5 P3, 2 P4, 4 Nit.
Notes (not posted inline):
- The checked-in JSON envelopes (
*-outer-create.json,*-record-*.json) contain the author's full local PATH and macOS temp-dir paths. Not a credential leak, but will produce noisy diffs when a different contributor regenerates the bundle. - "Evergreen" label vs. external service dependencies: the reproducer requires live Codex/Claude API access and a specific model (
gpt-5.4-mini). The committed artifacts stand on their own, but the reproducer will break when any external dependency changes. The inner helper is itself a clean, standalone agent-tty demo that could run independently without AI services. - Narrow cleanup window: if a signal arrives between
CURRENT_OUTER_HOMEassignment (line 551) andCURRENT_OUTER_SESSION_IDextraction (line 552), the EXIT trap skips session destroy. Microsecond window under normal operation; noting for completeness.
"What happens when a contributor on Linux runs the headline dogfood reproducer and it fails after a 2-minute npm build cycle? Shall I show you?" (Hisoka)
🤖 This review was automatically generated with Coder Agents.
Change-Id: Id4b4fb802f76b8951cd5f1638b7491764f3858d3 Signed-off-by: Thomas Kosiewski <tk@coder.com>
cc464e3 to
0cc109a
Compare
Change-Id: Ia0ccc7dc20d57e6052aa74a5792ac2ecac1f2ef4 Signed-off-by: Thomas Kosiewski <tk@coder.com>
0cc109a to
8b6ac7f
Compare
|
/coder-agents-review |
1 similar comment
|
/coder-agents-review |
There was a problem hiding this comment.
All 15 prior findings addressed cleanly in one commit. The DEREM-5 fix (review cuts) is architecturally sound: switching to accelerated replay compresses idle time so the tail contains useful content. Pariston noted the accelerated and recorded timings produce identical durations for these specific recordings (continuous TUI output leaves no idle to compress), so the fix is a no-op here but correct in principle for future runs with more idle time. The DEREM-15 cleanup fix adds two defensive layers (inner helper trap + outer sweep via destroy_sessions_in_home) that interlock correctly.
Two fix-chain regressions from the prior round: dead parameters left behind by DEREM-13 and DEREM-2 fixes. Four new P3 findings, two P4s, two Nits. No P2s.
Severity count: 4 P3, 2 P4, 2 Nit.
Notes (not posted inline):
exitCode // "unknown"fallback (line 641) is effectively unreachable becausestop_outer_agent_tuiguaranteestimedOut == falsebefore reaching the exit-code extraction. If it ever fires, the error message would misdiagnose ("exited with code unknown" when the process did exit, just without reporting a code).- Cleanup trap double-destroys the current outer session (explicit destroy at line 188, then sweep at line 191). Intentional defense-in-depth; the second attempt is harmless.
"Two surfaces presenting the same face, one validated, one not. What happens when someone copies the pattern from OUTER_TIMEOUT_MS and sets PROOF_TIMEOUT_MS=5000 for a quick test? 5 seconds. What about 500? Zero." (Hisoka)
🤖 This review was automatically generated with Coder Agents.
Change-Id: I509caf65570c04a1d0753420113e7c58c9df064e Signed-off-by: Thomas Kosiewski <tk@coder.com>
|
/coder-agents-review |
There was a problem hiding this comment.
All 9 R2 findings addressed cleanly in 384016a. No fix-chain regressions this round. Netero reports all 8 sections clean. The assertion suite after three rounds is genuinely strong: exact file content, inner cast content greps, ffprobe duration on all three WebMs, envelope metadata, template placeholder integrity, input validation, liveness detection, and diagnostic error messages.
Two subtle bash set -e context issues surfaced, plus a stale-envelope concern. No P2s.
Severity count: 3 P3, 2 P4.
Inventory correction: DEREM-22 ("outer-snapshot.txt and agent-transcript.txt byte-identical") was marked "Author fixed" but three reviewers independently verified the code is unchanged and artifacts remain identical. Updated to "Author accepted" since the fix was documentation-only (adding both files to README with distinct descriptions). Reasonable for a dogfood bundle.
Notes (not posted inline):
capture_json_varin the polling loop (DEREM-33 sibling) aborts the script on any non-ok response. The assumption that agent-tty never returns ok:false during polling is reasonable for a dogfood script, but the polling loop already handles two response shapes (timedOut true/false) and silently drops a third (error).
"The function behaves differently depending on who calls it. The ok-check and the copy are secretly coupled through set -e semantics that the caller controls." (Hisoka)
🤖 This review was automatically generated with Coder Agents.
Change-Id: I3398fa0fa1a2a83ecf80df7c1ea89999181638b8 Signed-off-by: Thomas Kosiewski <tk@coder.com>
|
/coder-agents-review |
There was a problem hiding this comment.
All 5 R3 findings addressed in 8a96ef7. No fix-chain regressions. Netero clean for two consecutive rounds. Two of four panel reviewers found no findings; Hisoka verified all R3 fixes and reported silence.
The DEREM-32 fix (set -e context in thumbnail capture) is well done: explicit if/return branching replaces the implicit set -e dependency. The DEREM-33 fix (liveness check) replaced capture_json_var with direct execution and four distinct error paths, each with diagnostic output. The DEREM-34 fix removes the stale review-webm envelopes from the committed bundle entirely.
Severity count: 1 P3, 1 P4.
The P3 is about diagnostic quality in the remaining helpers (run_json_file and capture_json_var), which silently crash on .ok == false without extracting the error message. The script already established the right pattern in wait_for_agent_proof (DEREM-33 fix); the same treatment would benefit the six call sites in run_agent_demo.
"The code earned silence." (Hisoka)
🤖 This review was automatically generated with Coder Agents.
Change-Id: I096d3091d3bac403ee2830d445e34d75292d3220 Signed-off-by: Thomas Kosiewski <tk@coder.com>
|
/coder-agents-review |
There was a problem hiding this comment.
All R4 findings addressed in 79964e0. Netero clean for three consecutive rounds. Two of four panel reviewers found no findings. Pariston: "I tried to build a case against this change and couldn't. Five rounds and 38 findings have hardened the script's error handling, temp-file hygiene, verification completeness, and diagnostic quality."
Two minor findings remain (1 P3, 1 P4), both naming/documentation concerns with no functional impact. Approving with these noted.
Severity count: 1 P3, 1 P4.
The P3 is a naming mismatch: final-message.txt contains a script-generated summary, not captured agent output as the README claims. The P4 is dead parameter weight in render_prompt from call-site symmetry with write_inner_helper. Neither affects correctness.
Across 5 rounds, 40 findings were raised: 2 P2, 14 P3, 10 P4, 7 Nit, 2 Note, 5 dropped. 33 were fixed by the author, 1 accepted (DEREM-22), 5 dropped by orchestrator, 2 remain open. The finding rate has converged: R1 had 15, R2 had 9, R3 had 5, R4 had 2, R5 has 2. The script is well-hardened.
🤖 This review was automatically generated with Coder Agents.
Change-Id: Ifdfc3827636621845347429eda2c6b31856c2496 Signed-off-by: Thomas Kosiewski <tk@coder.com>
|
/coder-agents-review |
There was a problem hiding this comment.
All 40 findings across 5 prior rounds addressed. Netero clean for four consecutive rounds. All four panel reviewers report no findings. Unanimous silence.
The R5 fixes are verified: final-message.txt renamed to recording-summary.txt with honest README description. render_prompt trimmed from 11 to 5 parameters with exact 1:1 placeholder mapping.
No open findings remain. 33 fixed by author, 1 accepted (DEREM-22), 6 dropped by orchestrator. The finding rate converged to zero: R1(15), R2(9), R3(5), R4(2), R5(2), R6(0).
"Silence is its own rating." (Hisoka)
🤖 This review was automatically generated with Coder Agents.
What changed
dogfood/agent-uses-agent-tty/bundle that records Codex and Claude using publicagent-tty ...commands to drive clean Neovim, write a proof file, export inner artifacts, and exit.*-outer-full.webm.Why
The first outer recordings spent too much time on startup/trust/model latency and then moved too quickly through the useful
agent-ttyinteraction. The generated bundle now keeps canonical full recordings while surfacing slowed review cuts for readers.Validation
bash -n dogfood/agent-uses-agent-tty/reproduce.shnpm run buildnpm run format:checkbash dogfood/agent-uses-agent-tty/reproduce.sh --agent bothffprobeconfirmed linked review WebMs are ~29.84s and full outer WebMs remain available.agent-tty nested Neovim proof from an AI coding agent.