CLI: add eval-runner result diagnostics by chubes4 · Pull Request #3376 · Automattic/studio

chubes4 · 2026-05-07T15:38:15Z

Summary

Capture final SDK result metadata in eval-runner output: resultStopReason, resultText, resultErrorMessage, resultUsage, and interrupted. All derived from findLastAssistant(event.messages) on agent_end so they reflect the same source of truth as success.
Add producedNoUsefulOutput, a heuristic that flags runs where the agent reported success but produced neither assistant text nor a successful mutating-tool call. The motivating failure mode is GPT-5.5 ending a turn after read-only tool calls (site_list, site_info) without writing anything or explaining anything — runs that classify as success today but produce no import report, no assistant text, and no Write / wp_cli activity.
Add opt-in compact transcript diagnostics behind STUDIO_EVAL_INCLUDE_TRANSCRIPT=1. The transcript records every AgentSessionEvent (including compaction_* and auto_retry_*) with text / tool-result truncation, so default artifacts stay small while deeper debugging remains available when investigating model / runtime / harness regressions.

Why

This continues the eval-runner observability work from #3273 and #3330. Those PRs made phase / tool timings, first tool errors, loop exceptions, and timeouts visible. This adds the final SDK result shape and an opt-in turn transcript so eval consumers can distinguish model behavior, pi-coding-agent harness continuation behavior, runner classification, and downstream benchmark quality gates.

The need surfaced while testing the Static Site Importer draft path in #3309 with the Studio site-build benchmark. GPT-5.5 repeatedly produced tool-only runs for built-in prompt variants (restaurant, wordpress-is-dead): site_list / site_info returned successfully, then the agent ended the turn with no assistant text, no Write, no wp_cli, and no import report. The eval runner classified the run as successful because the run terminated cleanly. With these diagnostics, those runs are now mechanically detectable as producedNoUsefulOutput: true.

Replaces #3349

This PR replaces #3349, which was written against the pre-#3360 SDKMessage event model. After #3360 landed (AI sessions: adopt pi-coding-agent SessionManager end-to-end), the original PR could not be rebased — every event-handling code path it touched changed shape:

startAiAgent() async iterator → runStudioAgentTurn() callback
SDKMessage (assistant / user / result) → AgentSessionEvent (message_end / tool_execution_end / turn_end / agent_end / compaction_* / auto_retry_*)
Final result SDKMessage with subtype / stop_reason / result / is_error / errors → no terminal result message at all; the canonical end-state lives on findLastAssistant(event.messages).stopReason plus the helper getAgentEndTurnResult() exposed by the new @studio/common/ai/session-events module

This rewrite is small (+161 / -2 lines, single file) and aligned with the new event surface. Closing #3349 in favor of this PR.

Validation

npx eslint apps/cli/ai/eval-runner.ts — clean
npm -w wp-studio run typecheck — clean
npm run cli:build --silent — clean, apps/cli/dist/cli/eval-runner.mjs produced
Direct smoke run (no tools, identity prompt): all new fields populate correctly; producedNoUsefulOutput: false because the assistant produced text.
Single-tool eval (studio-agent-site-info.bench.mjs, read-only site_info): tool calls and truncated tool results captured in transcript; producedNoUsefulOutput: false.
Multi-turn mutating run: Create a new Studio site named X. Then use wp_cli to set the blog title to Eval Validation. Confirm by saying done. — agent called site_create and wp_cli across 3 turns, generated a real local Studio site (cleaned up afterward), producedNoUsefulOutput: false, resultUsage reported input / output / cache tokens and cost.
Read-only tool-only run: Use only site_list and respond with nothing else. Do not say anything. — the agent called site_list, produced no text, success: true, producedNoUsefulOutput: true. This is the exact failure mode the heuristic is designed to flag.
Existing studio-agent-runtime.bench.mjs and studio-agent-site-info.bench.mjs ran end-to-end against the new runner with no breakage. The benches don't yet consume the new fields; that's a separate change in the rig.

AI assistance

AI assistance: Yes
Tool(s): Claude Code (Sonnet 4.5)
Used for: Rewriting the diagnostics layer against the new AgentSessionEvent / pi-coding-agent SessionManager model that landed in AI sessions: adopt pi-coding-agent SessionManager end-to-end #3360, running the validation matrix above, and drafting this PR description. Chris reviewed the failure-mode evidence and remains responsible for the change.

Capture final SDK result metadata in eval-runner output: resultStopReason, resultText, resultErrorMessage, resultUsage, and interrupted. All derived from findLastAssistant(event.messages) on agent_end so they reflect the same source of truth as success. Add producedNoUsefulOutput, a heuristic that flags runs where the agent reported success but produced neither assistant text nor a successful mutating-tool call. The motivating failure mode is GPT-5.5 ending a turn after read-only tool calls (site_list, site_info) without writing anything or explaining anything — runs that classify as success today but produce no import report, no assistant text, and no Write/wp_cli activity. Add opt-in compact transcript diagnostics behind STUDIO_EVAL_INCLUDE_TRANSCRIPT=1. The transcript records every AgentSessionEvent with text/tool-result truncation to keep default artifacts small while making deeper debugging available when investigating model/runtime/harness regressions. Replaces #3349 (closed), which was written against the pre-#3360 SDKMessage event model and could not be rebased onto trunk without a full rewrite. Validation: - npx eslint apps/cli/ai/eval-runner.ts: clean - npm -w wp-studio run typecheck: clean - npm run cli:build --silent: clean - Direct smoke run (no tools): all new fields populate correctly. - studio-agent-site-info.bench.mjs (single read-only tool call): tool calls and truncated tool results captured in transcript; producedNoUsefulOutput is false because the assistant produced text. - Multi-turn mutating run (real site_create + wp_cli, generated and cleaned up a disposable Studio site): producedNoUsefulOutput is false; resultUsage reports input/output/cache tokens and cost. - Read-only tool-only run (site_list with prompt forbidding any text): producedNoUsefulOutput is true, exactly the failure mode being targeted. AI assistance: - AI assistance: Yes - Tool(s): Claude Code (Sonnet 4.5) - Used for: rewriting the diagnostics layer against the new AgentSessionEvent / pi-coding-agent SessionManager model that landed in #3360, running the validation matrix above, and drafting this commit and the PR description. Chris reviewed the failure-mode evidence and remains responsible for the change.

wpmobilebot · 2026-05-07T16:03:00Z

📊 Performance Test Results

Comparing 041a096 vs trunk

app-size

Metric	trunk	`041a096`	Diff	Change
App Size (Mac)	1410.38 MB	1410.38 MB	+0.00 MB	⚪ 0.0%

site-editor

Metric	trunk	`041a096`	Diff	Change
load	1528 ms	1494 ms	34 ms	⚪ 0.0%

site-startup

Metric	trunk	`041a096`	Diff	Change
siteCreation	8075 ms	8111 ms	+36 ms	⚪ 0.0%
siteStartup	4921 ms	4920 ms	1 ms	⚪ 0.0%

Results are median values from multiple test runs.

Legend: 🟢 Improvement (faster) | 🔴 Regression (slower) | ⚪ No change (<50ms diff)

youknowriad · 2026-05-07T16:51:28Z

+const MUTATING_TOOL_NAMES = new Set( [
+	'Write',
+	'Edit',
+	'Bash',


Makes me think we should normalize how do we name tools, these camel case names are just named like that because we inherited them from claude initially.

Seems like a good idea. The Pi harness is a great direction.

One question i had about this PR was: Do we want to keep the full transcript opt-in, or enable it by default?

It can be argued that if you are doing evals, the transcript is always helpful.

To be honest, I didn't think about evals too much yet (as much as you did haha), so feel free to go in any direction you think is good.

Sounds good. I will try it both ways and see what feels better.

These evals will help a lot as we add more features, more models, and evolve the system prompt.

chubes4 mentioned this pull request May 7, 2026

CLI: add eval-runner result diagnostics #3349

Closed

github-actions Bot assigned chubes4 May 7, 2026

chubes4 requested a review from youknowriad May 7, 2026 15:39

youknowriad reviewed May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLI: add eval-runner result diagnostics#3376

CLI: add eval-runner result diagnostics#3376
chubes4 wants to merge 1 commit into
trunkfrom
eval-runner-result-diagnostics-v2

chubes4 commented May 7, 2026

Uh oh!

wpmobilebot commented May 7, 2026

Uh oh!

youknowriad May 7, 2026

Uh oh!

chubes4 May 7, 2026

Uh oh!

youknowriad May 7, 2026

Uh oh!

chubes4 May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

chubes4 commented May 7, 2026

Summary

Why

Replaces #3349

Validation

AI assistance

Uh oh!

wpmobilebot commented May 7, 2026

📊 Performance Test Results

app-size

site-editor

site-startup

Uh oh!

youknowriad May 7, 2026

Choose a reason for hiding this comment

Uh oh!

chubes4 May 7, 2026

Choose a reason for hiding this comment

Uh oh!

youknowriad May 7, 2026

Choose a reason for hiding this comment

Uh oh!

chubes4 May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants