Skip to content

Add explicit tooling/execution-failure status separate from model score #431

@christso

Description

@christso

Problem

In eval outputs, low scores can come from non-LLM causes (missing toolchain/runtime, repo/environment constraints), but currently this is easy to misread as model quality or judge quality.

Example from CargoWise customs evals:

  • Criteria required "Builds successfully with all tests passing"
  • Agent made correct code/test edits
  • Environment lacked dotnet, so build verification could not run
  • Score dropped (e.g., 0.67) even though core coding behavior was largely correct

Industry Research

Surveyed how Inspect AI (UK AISI), Braintrust, W&B Weave, DeepEval, OpenAI Evals, and LangSmith handle this separation. Key findings:

  • All mature frameworks keep scores and errors structurally separate.
  • Inspect AI has the most architecturally complete model: EvalError, EvalSampleLimit with typed enums, and ToolCallError with typed type for recoverable tool errors. Limits and errors are tracked as separate fields on EvalSample.
  • W&B Weave uses a discriminated union for evaluation status plus TraceStatus at the call level.
  • DeepEval tracks errors at the metric level separate from score and success.
  • Braintrust keeps error and scores as separate top-level fields.

Implementation (PR #436)

Result-level fields

// Required — present on every result record
executionStatus: 'ok' | 'quality_failure' | 'execution_error';

// Optional — only when executionStatus is 'execution_error'
failureStage?: 'setup' | 'repo_setup' | 'agent' | 'evaluator' | 'teardown';
failureReasonCode?: string; // e.g. 'provider_error', 'template_error', 'script_error', 'clone_error', 'budget_exceeded'

// Structured error detail
executionError?: { message: string; stage: FailureStage; };

Classification logic

  • Errors are classified at the catch site in the orchestrator with appropriate stage/reason:
    • Workspace creation errors → setup / template_error
    • Repo materialization errors → repo_setup / clone_error
    • Script errors → setup / script_error
    • Provider errors → agent / provider_error
    • Evaluator errors → evaluator / evaluator_error
    • Budget exceeded → setup / budget_exceeded
  • Successful results use QUALITY_PASS_THRESHOLD (0.8): score >= 0.8 → ok, below → quality_failure

Score handling

Scores remain 0 for execution errors (backward-compatible), but execution errors are excluded from quality metrics (mean, median, histogram, top/bottom results) in summary aggregation.

Trial aggregation

  • Any trial ok → aggregate ok
  • All trials execution_error → aggregate execution_error
  • Otherwise → quality_failure

Summary output

Total tests: 10
Passed: 5
Quality failures: 3
Execution errors: 2
Mean score: 0.750 (8 quality tests, 2 execution errors excluded)

Execution errors by stage:
  agent: 1
  setup: 1

Execution errors by reason:
  provider_error: 1
  template_error: 1

Follow-up Issues

Acceptance Signals

  • Result records include executionStatus field on every result
  • Summary output separates quality failures from execution errors
  • Execution errors excluded from quality metrics (mean, median, histogram)
  • Trial aggregation propagates executionStatus correctly
  • E2E verified: provider errors produce execution_error with correct stage/reason

Non-Goals

  • Classifying every possible error — failureReasonCode is extensible, not exhaustive
  • Automatic root-cause analysis of errors
  • Changing how quality failures (assertion/scoring issues) are evaluated
  • Setting score to null for execution errors (kept as 0 for backward compatibility)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions