feat: skill-eval companion artifacts (grading, timing, benchmark)

## Objective

Add an output format or post-processing command that produces structured companion artifacts — `grading/<test-id>.json`, `timing.json`, and `benchmark.json` — from existing JSONL eval results, following Anthropic's skill-creator JSON schemas. These artifacts make eval iterations inspectable, resumable, and compatible with skill-creator's tooling ecosystem.

## Architecture Boundary

external-first (output format / post-processing)

**Why not core-runtime:** The JSONL output already contains all the data needed to derive these artifacts — per-evaluator scores with reasoning/evidence text (`scores[].reasoning`, `scores[].hits`, `scores[].misses`), timing (`duration_ms`, `start_time`, `end_time`, per-evaluator `startedAt`/`endedAt`/`durationMs`), and token usage (`token_usage.input/output/cached`, per-evaluator token usage). No runtime-only data is required. Per CLAUDE.md's Lightweight Core principle ("Can this be achieved with existing primitives + a plugin or wrapper? If yes, it should not be a built-in"), this should be a post-processing wrapper or output format option, not a core engine change.

## Context

Anthropic's skill-creator defines explicit JSON schemas for evaluation artifacts (`grading.json`, `timing.json`, `benchmark.json`). AgentV currently outputs JSONL results that contain all the raw data, but not in the structured artifact format that skill-creator's tooling ecosystem expects.

The `agentv-eval-orchestrator` skill already generates `benchmark.json` in Agent Skills format. This issue extends that pattern to cover the full artifact bundle as a reusable output format.

## Implementation Approaches (choose one)

1. **New OutputWriter format**: `agentv eval run --format artifacts -o ./results/` — writes companion artifacts alongside JSONL
2. **Post-processing command**: `agentv results export ./results.jsonl --format grading|timing|benchmark` — derives artifacts from existing JSONL
3. **Both**: Write during eval run as a format option, and provide a standalone export for existing results

## Target Schemas

### `grading/<test-id>.json`

Maps from JSONL fields: `scores[].reasoning` → `expectations[].evidence`, `scores[].hits/misses` → `expectations[].text/passed`

```json
{
  "expectations": [
    { "text": "Response correctly identifies the root cause", "passed": true, "evidence": "Output states 'The root cause is...'" }
  ],
  "summary": { "passed": 1, "failed": 1, "total": 2, "pass_rate": 0.50 },
  "execution_metrics": { "tool_calls": {}, "total_tool_calls": 6, "errors_encountered": 0 }
}
```

### `timing.json`

Maps from JSONL fields: aggregate `token_usage` and `duration_ms` across all results

```json
{
  "total_tokens": 84852,
  "duration_ms": 45200,
  "total_duration_seconds": 45.2,
  "token_usage": { "input": 52000, "output": 18000 }
}
```

### `benchmark.json`

Maps from JSONL fields: aggregate `score`, `duration_ms`, `token_usage` per target with mean/stddev across test cases within the run

```json
{
  "metadata": { "eval_file": "...", "timestamp": "...", "targets": ["..."], "tests_run": ["..."] },
  "run_summary": {
    "<target>": {
      "pass_rate": { "mean": 0.85, "stddev": 0.05 },
      "time_seconds": { "mean": 45.0, "stddev": 12.0 },
      "tokens": { "mean": 3800, "stddev": 400 }
    }
  },
  "notes": []
}
```

Note: `stddev` refers to variance across test cases within a single run (e.g., score spread across 10 tests for a target), not variance across multiple runs.

## Acceptance Signals

- Companion artifacts can be produced from any existing JSONL results file (no re-run required)
- `grading/<test-id>.json` maps JSONL per-evaluator data to skill-creator's expectations/evidence format
- `timing.json` aggregates per-result timing and token data from JSONL
- `benchmark.json` computes cross-test statistics per target from JSONL
- Schemas are compatible with Anthropic's skill-creator conventions (same field names and structure where they overlap)
- Existing JSONL output and eval configs are unchanged
- Follows existing `OutputWriter` interface pattern (see `apps/cli/src/commands/eval/jsonl-writer.ts`)

## Interoperability Constraint

Artifacts must be readable by skill-creator's tooling (`eval-viewer/generate_review.py`, analyzer scripts). This means:
- Fields shared with skill-creator (`expectations[].text/passed/evidence`, `summary`, `run_summary.*.mean/stddev`) use identical names and types
- AgentV-specific fields (evaluator breakdown, workspace changes, multi-provider data) are additive — skill-creator tooling ignores unknown fields gracefully
- A user who generates artifacts with `agentv eval run` can open them in skill-creator's eval-viewer without conversion
- Conversely, if skill-creator's grader produces `grading.json`, AgentV's dashboard (#562/#563) can read it (the shared fields are sufficient)

## Non-Goals

- Modifying the core evaluation engine or JSONL writer
- History repo management (that's #563)
- Dashboard or visualization (that's #562/#563)
- Campaign/iteration tracking across runs (that's #563's history.json / #335)
- Feedback artifact collection (separate concern — #568)
- Trigger evaluation artifacts

## Related

- #562 — Static HTML dashboard (consumer of these artifacts)
- #563 — Self-hosted dashboard with history repo (consumer + extends with history.json, meta.json)
- #335 — Iteration tracking (complementary, cross-run concern)
- [Anthropic skill-creator schemas](https://github.com/anthropics/skills/blob/main/skills/skill-creator/references/schemas.md)

## Research Basis

- [Skill Creator Findings — Structured artifacts make iteration inspectable](https://github.com/agentevals/agentevals-research/blob/main/research/findings/skill-creator/README.md)
- [Skill Lifecycle Alignment — Adopt Now #2: Standardize a skill-eval artifact bundle](https://github.com/agentevals/agentevals-research/blob/main/research/agentv/skill-lifecycle-alignment.md)

## AgentV Extensions Beyond Skill-Creator Schemas

The companion artifacts should be **supersets** of skill-creator's schemas — compatible with skill-creator tooling but extended for AgentV's broader evaluation scope.

### `grading/<test-id>.json` extensions

Skill-creator's grading only covers LLM-judged expectations. AgentV must also include:

```json
{
  "expectations": [...],
  "summary": {...},
  "execution_metrics": {...},
  "evaluators": [
    {
      "name": "correctness",
      "type": "llm-judge",
      "score": 0.85,
      "reasoning": "..."
    },
    {
      "name": "tool-usage",
      "type": "tool-trajectory",
      "score": 1.0,
      "expected_tools": ["Read", "Write"],
      "actual_tools": ["Read", "Write"],
      "reasoning": "All expected tools used in correct order"
    },
    {
      "name": "output-check",
      "type": "code-judge",
      "score": 0.9,
      "details": {"tp": 8, "fp": 1, "fn": 0},
      "reasoning": "Python judge: 8/9 fields correct"
    }
  ],
  "workspace_changes": {
    "files_modified": 3,
    "files_created": 1,
    "diff_summary": "Modified src/index.ts, added tests/new.test.ts"
  },
  "conversation": {
    "turns": 4,
    "conversation_id": "conv-123"
  }
}
```

### `benchmark.json` extensions

Skill-creator uses `with_skill`/`without_skill`. AgentV uses target names and supports N-way comparison:

```json
{
  "run_summary": {
    "claude-3-5-sonnet": {
      "pass_rate": {"mean": 0.85, "stddev": 0.05},
      "time_seconds": {"mean": 45.0, "stddev": 12.0},
      "tokens": {"mean": 3800, "stddev": 400},
      "tool_calls": {"mean": 6, "stddev": 2},
      "cost_usd": {"mean": 0.12, "stddev": 0.03}
    },
    "gpt-4o": {
      "pass_rate": {"mean": 0.75, "stddev": 0.08},
      "...": "..."
    }
  },
  "per_evaluator_summary": {
    "correctness": {"mean": 0.82, "stddev": 0.1},
    "tool-trajectory": {"mean": 0.95, "stddev": 0.05},
    "code-judge-output": {"mean": 0.88, "stddev": 0.07}
  }
}
```

### Compatibility constraint

Fields shared with skill-creator (`expectations[].text/passed/evidence`, `summary`, `run_summary.*.mean/stddev`) must use the same names and types. AgentV-specific fields are additive — skill-creator's tooling can ignore them gracefully.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: skill-eval companion artifacts (grading, timing, benchmark) #565

Objective

Architecture Boundary

Context

Implementation Approaches (choose one)

Target Schemas

`grading/<test-id>.json`

`timing.json`

`benchmark.json`

Acceptance Signals

Interoperability Constraint

Non-Goals

Related

Research Basis

AgentV Extensions Beyond Skill-Creator Schemas

`grading/<test-id>.json` extensions

`benchmark.json` extensions

Compatibility constraint

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: skill-eval companion artifacts (grading, timing, benchmark) #565

Description

Objective

Architecture Boundary

Context

Implementation Approaches (choose one)

Target Schemas

grading/<test-id>.json

timing.json

benchmark.json

Acceptance Signals

Interoperability Constraint

Non-Goals

Related

Research Basis

AgentV Extensions Beyond Skill-Creator Schemas

grading/<test-id>.json extensions

benchmark.json extensions

Compatibility constraint

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`grading/<test-id>.json`

`timing.json`

`benchmark.json`

`grading/<test-id>.json` extensions

`benchmark.json` extensions