feat: add execution status classification to eval results (#431) by christso · Pull Request #436 · EntityProcess/agentv

christso · 2026-03-06T01:41:07Z

Summary

Adds executionStatus: 'ok' | 'quality_failure' | 'execution_error' as a required field on EvaluationResult, with optional failureStage, failureReasonCode, and executionError for structured error detail
Classifies errors at each catch site in the orchestrator with appropriate stage/reason codes (setup, repo_setup, agent, evaluator)
Updates summary statistics to exclude execution errors from quality metrics (mean, median, histogram, top/bottom)
Propagates execution status through trial aggregation
Preserves backward-compatible error field

Test Plan

6 new dedicated execution status tests (provider errors, high/low scores, boundary conditions, backward compat)
6 existing orchestrator tests updated with executionStatus assertions
Baseline test fixture updated with required field
855/856 core tests pass (1 pre-existing flaky test unrelated to this PR)
TypeScript typecheck clean across core and CLI packages
Biome lint clean

Follow-up Issues

Add --retry-errors CLI flag to re-run only execution errors #433 — --retry-errors CLI flag
Add fail_on_error tolerance config for eval runs #434 — fail_on_error tolerance config
Track retried transient errors in eval results for diagnostics #435 — Transient error retry tracking

🤖 Generated with Claude Code

Closes #431

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…431) Add executionStatus, failureStage, failureReasonCode, and executionError assertions to representative existing tests: - Success path: assert executionStatus === 'ok' - Provider throw: assert execution_error with agent stage - Provider raw.error: assert execution_error with provider_error code - Setup script failure: assert execution_error with setup stage - Successful workspace scripts: assert ok status Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

New test file covering all executionStatus classification paths: - Provider throw -> execution_error with agent stage - High score (>=0.8) -> ok - Low score (<0.8) -> quality_failure - Backward compatibility: error field still set alongside executionError - Threshold boundary tests at exactly 0.8 and 0.79 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add required executionStatus: 'ok' to makeFullResult() in baseline.test.ts to match the updated EvaluationResult type which now requires this field. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…egation - Extract 0.8 threshold into QUALITY_PASS_THRESHOLD constant with classifyQualityStatus helper to prevent threshold drift - Scope byFailureStage/byFailureReason aggregation to execution_error results only (was iterating all results) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cloudflare-workers-and-pages · 2026-03-06T01:41:54Z

Deploying agentv with Cloudflare Pages

Latest commit:	`015b77d`
Status:	✅ Deploy successful!
Preview URL:	https://34345a56.agentv.pages.dev
Branch Preview URL:	https://feat-execution-status-431.agentv.pages.dev

View logs

christso and others added 9 commits March 6, 2026 00:23

feat: add ExecutionStatus, FailureStage types to EvaluationResult (#431)

2324584

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: classify execution errors at each orchestrator catch site (#431)

1e23df1

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: propagate executionStatus through trial aggregation (#431)

091a67d

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: separate execution errors from quality metrics in summary (#431)

1e23b27

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: add executionStatus to baseline test fixture (#431)

b56dc9c

Add required executionStatus: 'ok' to makeFullResult() in baseline.test.ts to match the updated EvaluationResult type which now requires this field. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

style: fix biome formatting in statistics and execution-status test

015b77d

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

This was referenced Mar 6, 2026

Fix flaky defineCodeJudge test failing in full suite #437

Closed

Add explicit tooling/execution-failure status separate from model score #431

Closed

christso merged commit 7ae9034 into main Mar 6, 2026
1 check passed

christso deleted the feat/execution-status-431 branch March 6, 2026 02:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add execution status classification to eval results (#431)#436

feat: add execution status classification to eval results (#431)#436
christso merged 9 commits intomainfrom
feat/execution-status-431

christso commented Mar 6, 2026 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages bot commented Mar 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

christso commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Follow-up Issues

Uh oh!

cloudflare-workers-and-pages bot commented Mar 6, 2026

Deploying agentv with Cloudflare Pages

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

christso commented Mar 6, 2026 •

edited

Loading