-
Notifications
You must be signed in to change notification settings - Fork 0
feat: skill-eval companion artifacts (grading, timing, benchmark) #565
Description
Objective
Add an output format or post-processing command that produces structured companion artifacts — grading/<test-id>.json, timing.json, and benchmark.json — from existing JSONL eval results, following Anthropic's skill-creator JSON schemas. These artifacts make eval iterations inspectable, resumable, and compatible with skill-creator's tooling ecosystem.
Architecture Boundary
external-first (output format / post-processing)
Why not core-runtime: The JSONL output already contains all the data needed to derive these artifacts — per-evaluator scores with reasoning/evidence text (scores[].reasoning, scores[].hits, scores[].misses), timing (duration_ms, start_time, end_time, per-evaluator startedAt/endedAt/durationMs), and token usage (token_usage.input/output/cached, per-evaluator token usage). No runtime-only data is required. Per CLAUDE.md's Lightweight Core principle ("Can this be achieved with existing primitives + a plugin or wrapper? If yes, it should not be a built-in"), this should be a post-processing wrapper or output format option, not a core engine change.
Context
Anthropic's skill-creator defines explicit JSON schemas for evaluation artifacts (grading.json, timing.json, benchmark.json). AgentV currently outputs JSONL results that contain all the raw data, but not in the structured artifact format that skill-creator's tooling ecosystem expects.
The agentv-eval-orchestrator skill already generates benchmark.json in Agent Skills format. This issue extends that pattern to cover the full artifact bundle as a reusable output format.
Implementation Approaches (choose one)
- New OutputWriter format:
agentv eval run --format artifacts -o ./results/— writes companion artifacts alongside JSONL - Post-processing command:
agentv results export ./results.jsonl --format grading|timing|benchmark— derives artifacts from existing JSONL - Both: Write during eval run as a format option, and provide a standalone export for existing results
Target Schemas
grading/<test-id>.json
Maps from JSONL fields: scores[].reasoning → expectations[].evidence, scores[].hits/misses → expectations[].text/passed
{
"expectations": [
{ "text": "Response correctly identifies the root cause", "passed": true, "evidence": "Output states 'The root cause is...'" }
],
"summary": { "passed": 1, "failed": 1, "total": 2, "pass_rate": 0.50 },
"execution_metrics": { "tool_calls": {}, "total_tool_calls": 6, "errors_encountered": 0 }
}timing.json
Maps from JSONL fields: aggregate token_usage and duration_ms across all results
{
"total_tokens": 84852,
"duration_ms": 45200,
"total_duration_seconds": 45.2,
"token_usage": { "input": 52000, "output": 18000 }
}benchmark.json
Maps from JSONL fields: aggregate score, duration_ms, token_usage per target with mean/stddev across test cases within the run
{
"metadata": { "eval_file": "...", "timestamp": "...", "targets": ["..."], "tests_run": ["..."] },
"run_summary": {
"<target>": {
"pass_rate": { "mean": 0.85, "stddev": 0.05 },
"time_seconds": { "mean": 45.0, "stddev": 12.0 },
"tokens": { "mean": 3800, "stddev": 400 }
}
},
"notes": []
}Note: stddev refers to variance across test cases within a single run (e.g., score spread across 10 tests for a target), not variance across multiple runs.
Acceptance Signals
- Companion artifacts can be produced from any existing JSONL results file (no re-run required)
grading/<test-id>.jsonmaps JSONL per-evaluator data to skill-creator's expectations/evidence formattiming.jsonaggregates per-result timing and token data from JSONLbenchmark.jsoncomputes cross-test statistics per target from JSONL- Schemas are compatible with Anthropic's skill-creator conventions (same field names and structure where they overlap)
- Existing JSONL output and eval configs are unchanged
- Follows existing
OutputWriterinterface pattern (seeapps/cli/src/commands/eval/jsonl-writer.ts)
Interoperability Constraint
Artifacts must be readable by skill-creator's tooling (eval-viewer/generate_review.py, analyzer scripts). This means:
- Fields shared with skill-creator (
expectations[].text/passed/evidence,summary,run_summary.*.mean/stddev) use identical names and types - AgentV-specific fields (evaluator breakdown, workspace changes, multi-provider data) are additive — skill-creator tooling ignores unknown fields gracefully
- A user who generates artifacts with
agentv eval runcan open them in skill-creator's eval-viewer without conversion - Conversely, if skill-creator's grader produces
grading.json, AgentV's dashboard (feat: self-contained HTML dashboard with meta-refresh (runs, benchmarks, traces) #562/feat: AgentV Studio — eval management platform with historical trends, quality gates, and orchestration #563) can read it (the shared fields are sufficient)
Non-Goals
- Modifying the core evaluation engine or JSONL writer
- History repo management (that's feat: AgentV Studio — eval management platform with historical trends, quality gates, and orchestration #563)
- Dashboard or visualization (that's feat: self-contained HTML dashboard with meta-refresh (runs, benchmarks, traces) #562/feat: AgentV Studio — eval management platform with historical trends, quality gates, and orchestration #563)
- Campaign/iteration tracking across runs (that's feat: AgentV Studio — eval management platform with historical trends, quality gates, and orchestration #563's history.json / feat(eval): iteration tracking, termination taxonomy, and cross-run regression detection #335)
- Feedback artifact collection (separate concern — feat: human review checkpoint and feedback artifact for skill iteration #568)
- Trigger evaluation artifacts
Related
- feat: self-contained HTML dashboard with meta-refresh (runs, benchmarks, traces) #562 — Static HTML dashboard (consumer of these artifacts)
- feat: AgentV Studio — eval management platform with historical trends, quality gates, and orchestration #563 — Self-hosted dashboard with history repo (consumer + extends with history.json, meta.json)
- feat(eval): iteration tracking, termination taxonomy, and cross-run regression detection #335 — Iteration tracking (complementary, cross-run concern)
- Anthropic skill-creator schemas
Research Basis
- Skill Creator Findings — Structured artifacts make iteration inspectable
- Skill Lifecycle Alignment — Adopt Now #2: Standardize a skill-eval artifact bundle
AgentV Extensions Beyond Skill-Creator Schemas
The companion artifacts should be supersets of skill-creator's schemas — compatible with skill-creator tooling but extended for AgentV's broader evaluation scope.
grading/<test-id>.json extensions
Skill-creator's grading only covers LLM-judged expectations. AgentV must also include:
{
"expectations": [...],
"summary": {...},
"execution_metrics": {...},
"evaluators": [
{
"name": "correctness",
"type": "llm-judge",
"score": 0.85,
"reasoning": "..."
},
{
"name": "tool-usage",
"type": "tool-trajectory",
"score": 1.0,
"expected_tools": ["Read", "Write"],
"actual_tools": ["Read", "Write"],
"reasoning": "All expected tools used in correct order"
},
{
"name": "output-check",
"type": "code-judge",
"score": 0.9,
"details": {"tp": 8, "fp": 1, "fn": 0},
"reasoning": "Python judge: 8/9 fields correct"
}
],
"workspace_changes": {
"files_modified": 3,
"files_created": 1,
"diff_summary": "Modified src/index.ts, added tests/new.test.ts"
},
"conversation": {
"turns": 4,
"conversation_id": "conv-123"
}
}benchmark.json extensions
Skill-creator uses with_skill/without_skill. AgentV uses target names and supports N-way comparison:
{
"run_summary": {
"claude-3-5-sonnet": {
"pass_rate": {"mean": 0.85, "stddev": 0.05},
"time_seconds": {"mean": 45.0, "stddev": 12.0},
"tokens": {"mean": 3800, "stddev": 400},
"tool_calls": {"mean": 6, "stddev": 2},
"cost_usd": {"mean": 0.12, "stddev": 0.03}
},
"gpt-4o": {
"pass_rate": {"mean": 0.75, "stddev": 0.08},
"...": "..."
}
},
"per_evaluator_summary": {
"correctness": {"mean": 0.82, "stddev": 0.1},
"tool-trajectory": {"mean": 0.95, "stddev": 0.05},
"code-judge-output": {"mean": 0.88, "stddev": 0.07}
}
}Compatibility constraint
Fields shared with skill-creator (expectations[].text/passed/evidence, summary, run_summary.*.mean/stddev) must use the same names and types. AgentV-specific fields are additive — skill-creator's tooling can ignore them gracefully.