CLI: add eval-runner result diagnostics#3376
Conversation
Capture final SDK result metadata in eval-runner output: resultStopReason, resultText, resultErrorMessage, resultUsage, and interrupted. All derived from findLastAssistant(event.messages) on agent_end so they reflect the same source of truth as success. Add producedNoUsefulOutput, a heuristic that flags runs where the agent reported success but produced neither assistant text nor a successful mutating-tool call. The motivating failure mode is GPT-5.5 ending a turn after read-only tool calls (site_list, site_info) without writing anything or explaining anything — runs that classify as success today but produce no import report, no assistant text, and no Write/wp_cli activity. Add opt-in compact transcript diagnostics behind STUDIO_EVAL_INCLUDE_TRANSCRIPT=1. The transcript records every AgentSessionEvent with text/tool-result truncation to keep default artifacts small while making deeper debugging available when investigating model/runtime/harness regressions. Replaces #3349 (closed), which was written against the pre-#3360 SDKMessage event model and could not be rebased onto trunk without a full rewrite. Validation: - npx eslint apps/cli/ai/eval-runner.ts: clean - npm -w wp-studio run typecheck: clean - npm run cli:build --silent: clean - Direct smoke run (no tools): all new fields populate correctly. - studio-agent-site-info.bench.mjs (single read-only tool call): tool calls and truncated tool results captured in transcript; producedNoUsefulOutput is false because the assistant produced text. - Multi-turn mutating run (real site_create + wp_cli, generated and cleaned up a disposable Studio site): producedNoUsefulOutput is false; resultUsage reports input/output/cache tokens and cost. - Read-only tool-only run (site_list with prompt forbidding any text): producedNoUsefulOutput is true, exactly the failure mode being targeted. AI assistance: - AI assistance: Yes - Tool(s): Claude Code (Sonnet 4.5) - Used for: rewriting the diagnostics layer against the new AgentSessionEvent / pi-coding-agent SessionManager model that landed in #3360, running the validation matrix above, and drafting this commit and the PR description. Chris reviewed the failure-mode evidence and remains responsible for the change.
📊 Performance Test ResultsComparing 041a096 vs trunk app-size
site-editor
site-startup
Results are median values from multiple test runs. Legend: 🟢 Improvement (faster) | 🔴 Regression (slower) | ⚪ No change (<50ms diff) |
| const MUTATING_TOOL_NAMES = new Set( [ | ||
| 'Write', | ||
| 'Edit', | ||
| 'Bash', |
There was a problem hiding this comment.
Makes me think we should normalize how do we name tools, these camel case names are just named like that because we inherited them from claude initially.
There was a problem hiding this comment.
Seems like a good idea. The Pi harness is a great direction.
One question i had about this PR was: Do we want to keep the full transcript opt-in, or enable it by default?
It can be argued that if you are doing evals, the transcript is always helpful.
There was a problem hiding this comment.
To be honest, I didn't think about evals too much yet (as much as you did haha), so feel free to go in any direction you think is good.
There was a problem hiding this comment.
Sounds good. I will try it both ways and see what feels better.
These evals will help a lot as we add more features, more models, and evolve the system prompt.
Summary
resultStopReason,resultText,resultErrorMessage,resultUsage, andinterrupted. All derived fromfindLastAssistant(event.messages)onagent_endso they reflect the same source of truth assuccess.producedNoUsefulOutput, a heuristic that flags runs where the agent reportedsuccessbut produced neither assistant text nor a successful mutating-tool call. The motivating failure mode is GPT-5.5 ending a turn after read-only tool calls (site_list,site_info) without writing anything or explaining anything — runs that classify assuccesstoday but produce no import report, no assistant text, and noWrite/wp_cliactivity.STUDIO_EVAL_INCLUDE_TRANSCRIPT=1. The transcript records everyAgentSessionEvent(includingcompaction_*andauto_retry_*) with text / tool-result truncation, so default artifacts stay small while deeper debugging remains available when investigating model / runtime / harness regressions.Why
This continues the eval-runner observability work from #3273 and #3330. Those PRs made phase / tool timings, first tool errors, loop exceptions, and timeouts visible. This adds the final SDK result shape and an opt-in turn transcript so eval consumers can distinguish model behavior, pi-coding-agent harness continuation behavior, runner classification, and downstream benchmark quality gates.
The need surfaced while testing the Static Site Importer draft path in #3309 with the Studio site-build benchmark. GPT-5.5 repeatedly produced tool-only runs for built-in prompt variants (
restaurant,wordpress-is-dead):site_list/site_inforeturned successfully, then the agent ended the turn with no assistant text, noWrite, nowp_cli, and no import report. The eval runner classified the run as successful because the run terminated cleanly. With these diagnostics, those runs are now mechanically detectable asproducedNoUsefulOutput: true.Replaces #3349
This PR replaces #3349, which was written against the pre-#3360
SDKMessageevent model. After #3360 landed (AI sessions: adopt pi-coding-agent SessionManager end-to-end), the original PR could not be rebased — every event-handling code path it touched changed shape:startAiAgent()async iterator →runStudioAgentTurn()callbackSDKMessage(assistant/user/result) →AgentSessionEvent(message_end/tool_execution_end/turn_end/agent_end/compaction_*/auto_retry_*)resultSDKMessage withsubtype/stop_reason/result/is_error/errors→ no terminalresultmessage at all; the canonical end-state lives onfindLastAssistant(event.messages).stopReasonplus the helpergetAgentEndTurnResult()exposed by the new@studio/common/ai/session-eventsmoduleThis rewrite is small (+161 / -2 lines, single file) and aligned with the new event surface. Closing #3349 in favor of this PR.
Validation
npx eslint apps/cli/ai/eval-runner.ts— cleannpm -w wp-studio run typecheck— cleannpm run cli:build --silent— clean,apps/cli/dist/cli/eval-runner.mjsproducedproducedNoUsefulOutput: falsebecause the assistant produced text.studio-agent-site-info.bench.mjs, read-onlysite_info): tool calls and truncated tool results captured in transcript;producedNoUsefulOutput: false.Create a new Studio site named X. Then use wp_cli to set the blog title to Eval Validation. Confirm by saying done.— agent calledsite_createandwp_cliacross 3 turns, generated a real local Studio site (cleaned up afterward),producedNoUsefulOutput: false,resultUsagereported input / output / cache tokens and cost.Use only site_list and respond with nothing else. Do not say anything.— the agent calledsite_list, produced no text,success: true,producedNoUsefulOutput: true. This is the exact failure mode the heuristic is designed to flag.studio-agent-runtime.bench.mjsandstudio-agent-site-info.bench.mjsran end-to-end against the new runner with no breakage. The benches don't yet consume the new fields; that's a separate change in the rig.AI assistance
AgentSessionEvent/ pi-coding-agentSessionManagermodel that landed in AI sessions: adopt pi-coding-agent SessionManager end-to-end #3360, running the validation matrix above, and drafting this PR description. Chris reviewed the failure-mode evidence and remains responsible for the change.