Skip to content

feat: skill-eval companion artifacts (grading, timing, benchmark) #565

@christso

Description

@christso

Objective

Add an output format or post-processing command that produces structured companion artifacts — grading/<test-id>.json, timing.json, and benchmark.json — from existing JSONL eval results, following Anthropic's skill-creator JSON schemas. These artifacts make eval iterations inspectable, resumable, and compatible with skill-creator's tooling ecosystem.

Architecture Boundary

external-first (output format / post-processing)

Why not core-runtime: The JSONL output already contains all the data needed to derive these artifacts — per-evaluator scores with reasoning/evidence text (scores[].reasoning, scores[].hits, scores[].misses), timing (duration_ms, start_time, end_time, per-evaluator startedAt/endedAt/durationMs), and token usage (token_usage.input/output/cached, per-evaluator token usage). No runtime-only data is required. Per CLAUDE.md's Lightweight Core principle ("Can this be achieved with existing primitives + a plugin or wrapper? If yes, it should not be a built-in"), this should be a post-processing wrapper or output format option, not a core engine change.

Context

Anthropic's skill-creator defines explicit JSON schemas for evaluation artifacts (grading.json, timing.json, benchmark.json). AgentV currently outputs JSONL results that contain all the raw data, but not in the structured artifact format that skill-creator's tooling ecosystem expects.

The agentv-eval-orchestrator skill already generates benchmark.json in Agent Skills format. This issue extends that pattern to cover the full artifact bundle as a reusable output format.

Implementation Approaches (choose one)

  1. New OutputWriter format: agentv eval run --format artifacts -o ./results/ — writes companion artifacts alongside JSONL
  2. Post-processing command: agentv results export ./results.jsonl --format grading|timing|benchmark — derives artifacts from existing JSONL
  3. Both: Write during eval run as a format option, and provide a standalone export for existing results

Target Schemas

grading/<test-id>.json

Maps from JSONL fields: scores[].reasoningexpectations[].evidence, scores[].hits/missesexpectations[].text/passed

{
  "expectations": [
    { "text": "Response correctly identifies the root cause", "passed": true, "evidence": "Output states 'The root cause is...'" }
  ],
  "summary": { "passed": 1, "failed": 1, "total": 2, "pass_rate": 0.50 },
  "execution_metrics": { "tool_calls": {}, "total_tool_calls": 6, "errors_encountered": 0 }
}

timing.json

Maps from JSONL fields: aggregate token_usage and duration_ms across all results

{
  "total_tokens": 84852,
  "duration_ms": 45200,
  "total_duration_seconds": 45.2,
  "token_usage": { "input": 52000, "output": 18000 }
}

benchmark.json

Maps from JSONL fields: aggregate score, duration_ms, token_usage per target with mean/stddev across test cases within the run

{
  "metadata": { "eval_file": "...", "timestamp": "...", "targets": ["..."], "tests_run": ["..."] },
  "run_summary": {
    "<target>": {
      "pass_rate": { "mean": 0.85, "stddev": 0.05 },
      "time_seconds": { "mean": 45.0, "stddev": 12.0 },
      "tokens": { "mean": 3800, "stddev": 400 }
    }
  },
  "notes": []
}

Note: stddev refers to variance across test cases within a single run (e.g., score spread across 10 tests for a target), not variance across multiple runs.

Acceptance Signals

  • Companion artifacts can be produced from any existing JSONL results file (no re-run required)
  • grading/<test-id>.json maps JSONL per-evaluator data to skill-creator's expectations/evidence format
  • timing.json aggregates per-result timing and token data from JSONL
  • benchmark.json computes cross-test statistics per target from JSONL
  • Schemas are compatible with Anthropic's skill-creator conventions (same field names and structure where they overlap)
  • Existing JSONL output and eval configs are unchanged
  • Follows existing OutputWriter interface pattern (see apps/cli/src/commands/eval/jsonl-writer.ts)

Interoperability Constraint

Artifacts must be readable by skill-creator's tooling (eval-viewer/generate_review.py, analyzer scripts). This means:

Non-Goals

Related

Research Basis

AgentV Extensions Beyond Skill-Creator Schemas

The companion artifacts should be supersets of skill-creator's schemas — compatible with skill-creator tooling but extended for AgentV's broader evaluation scope.

grading/<test-id>.json extensions

Skill-creator's grading only covers LLM-judged expectations. AgentV must also include:

{
  "expectations": [...],
  "summary": {...},
  "execution_metrics": {...},
  "evaluators": [
    {
      "name": "correctness",
      "type": "llm-judge",
      "score": 0.85,
      "reasoning": "..."
    },
    {
      "name": "tool-usage",
      "type": "tool-trajectory",
      "score": 1.0,
      "expected_tools": ["Read", "Write"],
      "actual_tools": ["Read", "Write"],
      "reasoning": "All expected tools used in correct order"
    },
    {
      "name": "output-check",
      "type": "code-judge",
      "score": 0.9,
      "details": {"tp": 8, "fp": 1, "fn": 0},
      "reasoning": "Python judge: 8/9 fields correct"
    }
  ],
  "workspace_changes": {
    "files_modified": 3,
    "files_created": 1,
    "diff_summary": "Modified src/index.ts, added tests/new.test.ts"
  },
  "conversation": {
    "turns": 4,
    "conversation_id": "conv-123"
  }
}

benchmark.json extensions

Skill-creator uses with_skill/without_skill. AgentV uses target names and supports N-way comparison:

{
  "run_summary": {
    "claude-3-5-sonnet": {
      "pass_rate": {"mean": 0.85, "stddev": 0.05},
      "time_seconds": {"mean": 45.0, "stddev": 12.0},
      "tokens": {"mean": 3800, "stddev": 400},
      "tool_calls": {"mean": 6, "stddev": 2},
      "cost_usd": {"mean": 0.12, "stddev": 0.03}
    },
    "gpt-4o": {
      "pass_rate": {"mean": 0.75, "stddev": 0.08},
      "...": "..."
    }
  },
  "per_evaluator_summary": {
    "correctness": {"mean": 0.82, "stddev": 0.1},
    "tool-trajectory": {"mean": 0.95, "stddev": 0.05},
    "code-judge-output": {"mean": 0.88, "stddev": 0.07}
  }
}

Compatibility constraint

Fields shared with skill-creator (expectations[].text/passed/evidence, summary, run_summary.*.mean/stddev) must use the same names and types. AgentV-specific fields are additive — skill-creator's tooling can ignore them gracefully.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions