-
Notifications
You must be signed in to change notification settings - Fork 0
Add explicit tooling/execution-failure status separate from model score #431
Copy link
Copy link
Closed
Description
Problem
In eval outputs, low scores can come from non-LLM causes (missing toolchain/runtime, repo/environment constraints), but currently this is easy to misread as model quality or judge quality.
Example from CargoWise customs evals:
- Criteria required "Builds successfully with all tests passing"
- Agent made correct code/test edits
- Environment lacked
dotnet, so build verification could not run - Score dropped (e.g., 0.67) even though core coding behavior was largely correct
Industry Research
Surveyed how Inspect AI (UK AISI), Braintrust, W&B Weave, DeepEval, OpenAI Evals, and LangSmith handle this separation. Key findings:
- All mature frameworks keep scores and errors structurally separate.
- Inspect AI has the most architecturally complete model:
EvalError,EvalSampleLimitwith typed enums, andToolCallErrorwith typedtypefor recoverable tool errors. Limits and errors are tracked as separate fields onEvalSample. - W&B Weave uses a discriminated union for evaluation status plus
TraceStatusat the call level. - DeepEval tracks errors at the metric level separate from
scoreandsuccess. - Braintrust keeps
errorandscoresas separate top-level fields.
Implementation (PR #436)
Result-level fields
// Required — present on every result record
executionStatus: 'ok' | 'quality_failure' | 'execution_error';
// Optional — only when executionStatus is 'execution_error'
failureStage?: 'setup' | 'repo_setup' | 'agent' | 'evaluator' | 'teardown';
failureReasonCode?: string; // e.g. 'provider_error', 'template_error', 'script_error', 'clone_error', 'budget_exceeded'
// Structured error detail
executionError?: { message: string; stage: FailureStage; };Classification logic
- Errors are classified at the catch site in the orchestrator with appropriate stage/reason:
- Workspace creation errors →
setup/template_error - Repo materialization errors →
repo_setup/clone_error - Script errors →
setup/script_error - Provider errors →
agent/provider_error - Evaluator errors →
evaluator/evaluator_error - Budget exceeded →
setup/budget_exceeded
- Workspace creation errors →
- Successful results use
QUALITY_PASS_THRESHOLD(0.8): score >= 0.8 →ok, below →quality_failure
Score handling
Scores remain 0 for execution errors (backward-compatible), but execution errors are excluded from quality metrics (mean, median, histogram, top/bottom results) in summary aggregation.
Trial aggregation
- Any trial
ok→ aggregateok - All trials
execution_error→ aggregateexecution_error - Otherwise →
quality_failure
Summary output
Total tests: 10
Passed: 5
Quality failures: 3
Execution errors: 2
Mean score: 0.750 (8 quality tests, 2 execution errors excluded)
Execution errors by stage:
agent: 1
setup: 1
Execution errors by reason:
provider_error: 1
template_error: 1
Follow-up Issues
- Add --retry-errors CLI flag to re-run only execution errors #433 —
--retry-errorsCLI flag to re-run only execution errors - Add fail_on_error tolerance config for eval runs #434 —
fail_on_errortolerance config for eval runs - Track retried transient errors in eval results for diagnostics #435 — Track retried transient errors for diagnostics
Acceptance Signals
- Result records include
executionStatusfield on every result - Summary output separates quality failures from execution errors
- Execution errors excluded from quality metrics (mean, median, histogram)
- Trial aggregation propagates executionStatus correctly
- E2E verified: provider errors produce
execution_errorwith correct stage/reason
Non-Goals
- Classifying every possible error —
failureReasonCodeis extensible, not exhaustive - Automatic root-cause analysis of errors
- Changing how quality failures (assertion/scoring issues) are evaluated
- Setting score to
nullfor execution errors (kept as 0 for backward compatibility)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels