Skip to content

feat: add execution status classification to eval results (#431)#436

Merged
christso merged 9 commits intomainfrom
feat/execution-status-431
Mar 6, 2026
Merged

feat: add execution status classification to eval results (#431)#436
christso merged 9 commits intomainfrom
feat/execution-status-431

Conversation

@christso
Copy link
Copy Markdown
Collaborator

@christso christso commented Mar 6, 2026

Summary

  • Adds executionStatus: 'ok' | 'quality_failure' | 'execution_error' as a required field on EvaluationResult, with optional failureStage, failureReasonCode, and executionError for structured error detail
  • Classifies errors at each catch site in the orchestrator with appropriate stage/reason codes (setup, repo_setup, agent, evaluator)
  • Updates summary statistics to exclude execution errors from quality metrics (mean, median, histogram, top/bottom)
  • Propagates execution status through trial aggregation
  • Preserves backward-compatible error field

Test Plan

  • 6 new dedicated execution status tests (provider errors, high/low scores, boundary conditions, backward compat)
  • 6 existing orchestrator tests updated with executionStatus assertions
  • Baseline test fixture updated with required field
  • 855/856 core tests pass (1 pre-existing flaky test unrelated to this PR)
  • TypeScript typecheck clean across core and CLI packages
  • Biome lint clean

Follow-up Issues

🤖 Generated with Claude Code

Closes #431

christso and others added 9 commits March 6, 2026 00:23
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…431)

Add executionStatus, failureStage, failureReasonCode, and executionError
assertions to representative existing tests:
- Success path: assert executionStatus === 'ok'
- Provider throw: assert execution_error with agent stage
- Provider raw.error: assert execution_error with provider_error code
- Setup script failure: assert execution_error with setup stage
- Successful workspace scripts: assert ok status

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New test file covering all executionStatus classification paths:
- Provider throw -> execution_error with agent stage
- High score (>=0.8) -> ok
- Low score (<0.8) -> quality_failure
- Backward compatibility: error field still set alongside executionError
- Threshold boundary tests at exactly 0.8 and 0.79

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add required executionStatus: 'ok' to makeFullResult() in baseline.test.ts
to match the updated EvaluationResult type which now requires this field.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…egation

- Extract 0.8 threshold into QUALITY_PASS_THRESHOLD constant with
  classifyQualityStatus helper to prevent threshold drift
- Scope byFailureStage/byFailureReason aggregation to execution_error
  results only (was iterating all results)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@cloudflare-workers-and-pages
Copy link
Copy Markdown

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: 015b77d
Status: ✅  Deploy successful!
Preview URL: https://34345a56.agentv.pages.dev
Branch Preview URL: https://feat-execution-status-431.agentv.pages.dev

View logs

@christso christso merged commit 7ae9034 into main Mar 6, 2026
1 check passed
@christso christso deleted the feat/execution-status-431 branch March 6, 2026 02:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add explicit tooling/execution-failure status separate from model score

1 participant