Skip to content

CLI: add eval-runner result diagnostics#3376

Open
chubes4 wants to merge 1 commit into
trunkfrom
eval-runner-result-diagnostics-v2
Open

CLI: add eval-runner result diagnostics#3376
chubes4 wants to merge 1 commit into
trunkfrom
eval-runner-result-diagnostics-v2

Conversation

@chubes4
Copy link
Copy Markdown
Contributor

@chubes4 chubes4 commented May 7, 2026

Summary

  • Capture final SDK result metadata in eval-runner output: resultStopReason, resultText, resultErrorMessage, resultUsage, and interrupted. All derived from findLastAssistant(event.messages) on agent_end so they reflect the same source of truth as success.
  • Add producedNoUsefulOutput, a heuristic that flags runs where the agent reported success but produced neither assistant text nor a successful mutating-tool call. The motivating failure mode is GPT-5.5 ending a turn after read-only tool calls (site_list, site_info) without writing anything or explaining anything — runs that classify as success today but produce no import report, no assistant text, and no Write / wp_cli activity.
  • Add opt-in compact transcript diagnostics behind STUDIO_EVAL_INCLUDE_TRANSCRIPT=1. The transcript records every AgentSessionEvent (including compaction_* and auto_retry_*) with text / tool-result truncation, so default artifacts stay small while deeper debugging remains available when investigating model / runtime / harness regressions.

Why

This continues the eval-runner observability work from #3273 and #3330. Those PRs made phase / tool timings, first tool errors, loop exceptions, and timeouts visible. This adds the final SDK result shape and an opt-in turn transcript so eval consumers can distinguish model behavior, pi-coding-agent harness continuation behavior, runner classification, and downstream benchmark quality gates.

The need surfaced while testing the Static Site Importer draft path in #3309 with the Studio site-build benchmark. GPT-5.5 repeatedly produced tool-only runs for built-in prompt variants (restaurant, wordpress-is-dead): site_list / site_info returned successfully, then the agent ended the turn with no assistant text, no Write, no wp_cli, and no import report. The eval runner classified the run as successful because the run terminated cleanly. With these diagnostics, those runs are now mechanically detectable as producedNoUsefulOutput: true.

Replaces #3349

This PR replaces #3349, which was written against the pre-#3360 SDKMessage event model. After #3360 landed (AI sessions: adopt pi-coding-agent SessionManager end-to-end), the original PR could not be rebased — every event-handling code path it touched changed shape:

  • startAiAgent() async iterator → runStudioAgentTurn() callback
  • SDKMessage (assistant / user / result) → AgentSessionEvent (message_end / tool_execution_end / turn_end / agent_end / compaction_* / auto_retry_*)
  • Final result SDKMessage with subtype / stop_reason / result / is_error / errors → no terminal result message at all; the canonical end-state lives on findLastAssistant(event.messages).stopReason plus the helper getAgentEndTurnResult() exposed by the new @studio/common/ai/session-events module

This rewrite is small (+161 / -2 lines, single file) and aligned with the new event surface. Closing #3349 in favor of this PR.

Validation

  • npx eslint apps/cli/ai/eval-runner.ts — clean
  • npm -w wp-studio run typecheck — clean
  • npm run cli:build --silent — clean, apps/cli/dist/cli/eval-runner.mjs produced
  • Direct smoke run (no tools, identity prompt): all new fields populate correctly; producedNoUsefulOutput: false because the assistant produced text.
  • Single-tool eval (studio-agent-site-info.bench.mjs, read-only site_info): tool calls and truncated tool results captured in transcript; producedNoUsefulOutput: false.
  • Multi-turn mutating run: Create a new Studio site named X. Then use wp_cli to set the blog title to Eval Validation. Confirm by saying done. — agent called site_create and wp_cli across 3 turns, generated a real local Studio site (cleaned up afterward), producedNoUsefulOutput: false, resultUsage reported input / output / cache tokens and cost.
  • Read-only tool-only run: Use only site_list and respond with nothing else. Do not say anything. — the agent called site_list, produced no text, success: true, producedNoUsefulOutput: true. This is the exact failure mode the heuristic is designed to flag.
  • Existing studio-agent-runtime.bench.mjs and studio-agent-site-info.bench.mjs ran end-to-end against the new runner with no breakage. The benches don't yet consume the new fields; that's a separate change in the rig.

AI assistance

  • AI assistance: Yes
  • Tool(s): Claude Code (Sonnet 4.5)
  • Used for: Rewriting the diagnostics layer against the new AgentSessionEvent / pi-coding-agent SessionManager model that landed in AI sessions: adopt pi-coding-agent SessionManager end-to-end #3360, running the validation matrix above, and drafting this PR description. Chris reviewed the failure-mode evidence and remains responsible for the change.

Capture final SDK result metadata in eval-runner output: resultStopReason,
resultText, resultErrorMessage, resultUsage, and interrupted. All derived from
findLastAssistant(event.messages) on agent_end so they reflect the same source
of truth as success.

Add producedNoUsefulOutput, a heuristic that flags runs where the agent
reported success but produced neither assistant text nor a successful
mutating-tool call. The motivating failure mode is GPT-5.5 ending a turn after
read-only tool calls (site_list, site_info) without writing anything or
explaining anything — runs that classify as success today but produce no
import report, no assistant text, and no Write/wp_cli activity.

Add opt-in compact transcript diagnostics behind STUDIO_EVAL_INCLUDE_TRANSCRIPT=1.
The transcript records every AgentSessionEvent with text/tool-result truncation
to keep default artifacts small while making deeper debugging available when
investigating model/runtime/harness regressions.

Replaces #3349 (closed), which was written against the pre-#3360 SDKMessage
event model and could not be rebased onto trunk without a full rewrite.

Validation:

- npx eslint apps/cli/ai/eval-runner.ts: clean
- npm -w wp-studio run typecheck: clean
- npm run cli:build --silent: clean
- Direct smoke run (no tools): all new fields populate correctly.
- studio-agent-site-info.bench.mjs (single read-only tool call): tool calls and
  truncated tool results captured in transcript; producedNoUsefulOutput is
  false because the assistant produced text.
- Multi-turn mutating run (real site_create + wp_cli, generated and cleaned up
  a disposable Studio site): producedNoUsefulOutput is false; resultUsage
  reports input/output/cache tokens and cost.
- Read-only tool-only run (site_list with prompt forbidding any text):
  producedNoUsefulOutput is true, exactly the failure mode being targeted.

AI assistance:

- AI assistance: Yes
- Tool(s): Claude Code (Sonnet 4.5)
- Used for: rewriting the diagnostics layer against the new
  AgentSessionEvent / pi-coding-agent SessionManager model that landed in
  #3360, running the validation matrix above, and drafting this commit and
  the PR description. Chris reviewed the failure-mode evidence and remains
  responsible for the change.
@wpmobilebot
Copy link
Copy Markdown
Collaborator

📊 Performance Test Results

Comparing 041a096 vs trunk

app-size

Metric trunk 041a096 Diff Change
App Size (Mac) 1410.38 MB 1410.38 MB +0.00 MB ⚪ 0.0%

site-editor

Metric trunk 041a096 Diff Change
load 1528 ms 1494 ms 34 ms ⚪ 0.0%

site-startup

Metric trunk 041a096 Diff Change
siteCreation 8075 ms 8111 ms +36 ms ⚪ 0.0%
siteStartup 4921 ms 4920 ms 1 ms ⚪ 0.0%

Results are median values from multiple test runs.

Legend: 🟢 Improvement (faster) | 🔴 Regression (slower) | ⚪ No change (<50ms diff)

const MUTATING_TOOL_NAMES = new Set( [
'Write',
'Edit',
'Bash',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes me think we should normalize how do we name tools, these camel case names are just named like that because we inherited them from claude initially.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good idea. The Pi harness is a great direction.

One question i had about this PR was: Do we want to keep the full transcript opt-in, or enable it by default?

It can be argued that if you are doing evals, the transcript is always helpful.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I didn't think about evals too much yet (as much as you did haha), so feel free to go in any direction you think is good.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I will try it both ways and see what feels better.

These evals will help a lot as we add more features, more models, and evolve the system prompt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants