feat: self-contained HTML dashboard with meta-refresh (runs, benchmarks, traces)

## Summary

Add an `html-writer.ts` that produces a self-contained `.html` report file — no server, no framework, opens in any browser. Feature parity with DeepEval (Confident AI) and Convex Evals dashboards for the features that make sense in a static, self-contained context.

## Problem

AgentV outputs structured data (JSONL, YAML, JSON, JUnit XML) but has no human-readable visual dashboard. Users must manually parse result files or build their own visualization. DeepEval requires a cloud account (Confident AI) and Convex Evals requires a Convex backend — both add vendor dependencies for basic result visualization.

## Industry Reference

| Feature | DeepEval (Confident AI) | Convex Evals | AgentV Dashboard (proposed) |
|---------|------------------------|--------------|----------------------------|
| Run overview | ✅ Dataset, avg/median scores, pass %, AI summary | ✅ Pass rate bars, stat cards, category breakdown | ✅ |
| Test case table | ✅ Input/output, per-metric scores, reasoning, threshold, pass/fail | ✅ Status, name, category, duration, failure reason | ✅ |
| Category breakdown | ❌ (flat metric list) | ✅ Category → pass rate, passed/failed/total | ✅ |
| Score distribution | ✅ Histogram per metric | ❌ | ✅ |
| A/B regression comparison | ✅ Side-by-side, color-coded improvement/regression | ❌ | ✅ (when 2+ runs in file) |
| Trace/step visualization | ✅ Spans, tool calls, latency | ✅ Steps pipeline (filesystem → install → deploy → tsc → eslint → tests) | ✅ Collapsible tree |
| Code/output browser | ❌ | ✅ Monaco Editor with file tree for model output + task source | ⏳ Future (out of scope for v1) |
| Model comparison | ✅ Cross-version prompt performance | ❌ | ✅ (multi-target matrix) |
| Cost & latency tracking | ✅ Per-span token cost, latency | ✅ Duration, token usage, run cost | ✅ |
| Filtering & search | ✅ By status, metric, category | ❌ | ✅ By target, evaluator, status |
| Shareable report | ✅ URL-shareable | ✅ (hosted on Netlify) | ✅ (self-contained .html file) |
| Live updates during run | ❌ (cloud push) | ❌ (cloud push) | ✅ Meta-refresh |
| No vendor dependency | ❌ (requires Confident AI account) | ❌ (requires Convex backend) | ✅ |

## Proposed behavior

### Output

```bash
agentv eval run --format html -o report.html
# or alongside other formats:
agentv eval run --format jsonl --html-report report.html
```

Produces a single `.html` file with all data embedded as `<script>const data = [...]</script>`. CSS/JS inlined. Zero external dependencies. Works offline.

### Live updates during a run

`<meta http-equiv="refresh" content="2">` — browser auto-reloads every 2s. Writer rewrites the file on each new result.

---

## Dashboard Views

### View 1: Runs Overview

The landing view when opening the report.

**Stat cards** (top row):
- Total tests / Passed / Failed / Errors
- Overall pass rate (%)
- Total duration
- Total token usage (input + output) and estimated cost

**Runs table** (if multiple runs or targets):

| Column | Source |
|--------|--------|
| Target name | `targetLabel` |
| Pass rate | Visual bar + percentage |
| Passed / Failed / Error counts | Result status |
| Avg score | Mean of all evaluator scores |
| Duration | Sum of test durations |
| Token usage | Input + output tokens |

Click a row → drill into that target's results.

**Score distribution** (per evaluator):
- Histogram showing score distribution across all tests
- Mean, median, min, max displayed

### View 2: Test Cases Table

Per-target (or all-targets) test case detail.

| Column | Source | Notes |
|--------|--------|-------|
| Status icon | pass/fail/error | Color-coded |
| Test ID | `testId` | |
| Target | `targetLabel` | If multi-target |
| Per-evaluator scores | Each evaluator as a column | Color-coded: green ≥90%, yellow ≥50%, red <50% |
| Overall score | Composite | |
| Duration | ms | |
| Failure reason | `error` or evaluator reasoning | Expandable |

**Interactions:**
- Filter by: status (pass/fail/error), target, evaluator, score range
- Sort by: any column
- Search by test ID
- Click row → expand to show full detail (input, output, evaluator reasoning, trace)

### View 3: Test Case Detail (Expanded)

When a test case row is expanded or clicked:

**Input/Output panel:**
- `input` (the prompt / test message sequence)
- `actual_output` (agent's response)
- `expected_output` (if defined)

**Evaluator results:**

| Evaluator | Score | Pass/Fail | Threshold | Reasoning |
|-----------|-------|-----------|-----------|-----------|
| correctness | 0.85 | ✅ | 0.7 | "Response accurately..." |
| completeness | 0.60 | ❌ | 0.8 | "Missing coverage of..." |

- Show LLM judge reasoning when available
- Show rubric criteria breakdown when available

**Metadata:**
- Duration, token usage, provider, model
- Timestamp

### View 4: A/B Regression Comparison

Available when the report contains results from 2+ runs or targets.

**Side-by-side layout:**
- Left column: baseline run/target
- Right column: comparison run/target
- Dropdown selectors for which runs to compare

**Per-test comparison rows:**
- Green highlight: test that was failing now passes (improvement)
- Red highlight: test that was passing now fails (regression)
- Score delta shown with ↑/↓ arrows
- Unchanged tests shown in neutral color

**Summary stats:**
- Improvements count / Regressions count / Unchanged count
- Net score delta

### View 5: Trace Visualization

Collapsible tree view of test execution (same data as `agentv trace show --tree`).

```
▶ test-feature-alpha                           ✅ 0.92  12.4s
  ▶ claude-3-5-sonnet                          ✅ 0.92  12.4s
    ├─ LLM Call: initial prompt                       3.2s  1,240 tokens
    ├─ Tool Call: read_file("src/index.ts")            0.1s
    ├─ LLM Call: analysis                             4.1s  2,100 tokens
    ├─ Tool Call: write_file("src/fix.ts")             0.2s
    └─ LLM Call: final response                       4.8s  1,800 tokens
```

**Per-node details** (on expand):
- LLM calls: prompt, response, token count, latency, model
- Tool calls: tool name, arguments, result, latency
- Scores: per-evaluator scores at each level

**Color coding:**
- Green: score ≥ 90%
- Yellow: score ≥ 50%
- Red: score < 50%
- Gray: no score (tool calls, metadata)

### View 6: Category Breakdown

If tests have categories/tags:

| Category | Tests | Pass Rate | Avg Score | Passed | Failed |
|----------|-------|-----------|-----------|--------|--------|
| retrieval | 8 | 87.5% | 0.82 | 7 | 1 |
| reasoning | 5 | 60.0% | 0.71 | 3 | 2 |
| tool_use | 12 | 91.7% | 0.89 | 11 | 1 |

Click category → filtered test cases table.

---

## Architecture

### Writer implementation

Follows the existing `OutputWriter` interface pattern:
- `apps/cli/src/commands/eval/html-writer.ts`
- On each result: append to in-memory array, rewrite the HTML file with updated embedded data
- Template is a string literal in the writer (no external template files)
- Vanilla JS in the generated HTML — no React, no build step, no framework

### Navigation

Tab-based navigation between views (Runs Overview, Test Cases, Comparison, Traces, Categories). All views are in the same single HTML file — tab switching is pure JS DOM manipulation, no routing needed.

### Chart library

If histograms or bar charts are needed, vendor a minimal chart solution inlined into the template. No CDN dependencies. Alternatively, render charts as pure HTML/CSS (CSS bar widths for pass rate bars, HTML table cells for histograms).

---

## Design latitude

- Visual design, layout, CSS are implementer's choice
- May use lightweight vendored libs (bundled into the template string, not CDN)
- Collapsible sections and tab navigation encouraged
- Dark mode support is nice-to-have, not required
- Responsive layout is nice-to-have

## Acceptance signals

- [ ] `agentv eval run --format html -o report.html` produces a working self-contained HTML file
- [ ] File opens correctly in Chrome/Firefox/Safari with no network requests
- [ ] View 1: Runs overview with stat cards, runs table, score distribution
- [ ] View 2: Test cases table with filtering, sorting, search
- [ ] View 3: Test case detail with input/output, evaluator reasoning, metadata
- [ ] View 4: A/B comparison when 2+ targets/runs present
- [ ] View 5: Collapsible trace tree with LLM/tool call details
- [ ] View 6: Category breakdown with drill-down
- [ ] Meta-refresh updates the page during an active eval run
- [ ] Implements `OutputWriter` interface — same pattern as existing writers
- [ ] No new runtime dependencies added to the CLI
- [ ] No external network requests from the generated HTML

## Non-goals

- Server-side rendering or hosted dashboard
- React/framework-based SPA
- Real-time SSE streaming (future: `agentv serve`)
- PDF export
- Monaco Editor / code browsing (future enhancement)
- Human annotation / feedback collection (Confident AI feature, not relevant for local reports)
- Dataset management UI
- Arena / no-code comparison UI
- Authentication or team features

## Related

- Research: [pi-autoresearch patterns](https://github.com/agentevals/agentevals-research/blob/main/research/findings/pi-autoresearch/README.md)
- Industry: [DeepEval Confident AI dashboard](https://docs.confident-ai.com/) — cloud-hosted eval visualization
- Industry: [Convex Evals visualizer](https://convex-evals.netlify.app) — React + Convex web dashboard
- Existing writers: `jsonl-writer.ts`, `yaml-writer.ts`, `json-writer.ts`, `junit-writer.ts`

---

## MVP Scope (added 2026-03-14)

### Design Reference: skill-creator's `generate_review.py`

Anthropic's skill-creator includes a Python `eval-viewer/generate_review.py` that produces a static HTML review page. This is the **design reference** for the MVP — implement the same functionality in TypeScript as `html-writer.ts`.

**What generate_review.py produces:**
- **Outputs tab**: One test case at a time — shows prompt, output files rendered inline, formal grades (assertion pass/fail), and user feedback form
- **Benchmark tab**: Quantitative summary — pass rates, timing, token usage per configuration, per-evaluation breakdowns, analyst observations
- **Feedback download**: "Submit All Reviews" button downloads `feedback.json`
- **Static HTML mode**: `--static <output_path>` generates a self-contained HTML file (no server needed)
- **Iteration comparison**: For iteration 2+, shows previous output and feedback side-by-side

### MVP Acceptance Signals (subset of full #562)

Ship these first, then iterate toward the full spec:

- [ ] View 1: Runs overview with stat cards (pass rate, duration, tokens, cost)
- [ ] View 2: Test cases table with per-evaluator scores, pass/fail status, expandable reasoning
- [ ] View 3: Benchmark tab with per-target statistics from `benchmark.json`
- [ ] Reads AgentV companion artifacts (`grading/*.json`, `timing.json`, `benchmark.json`) produced by #565
- [ ] Reads JSONL results directly as fallback
- [ ] Feedback integration: links to `feedback.json` workflow (#568)
- [ ] Self-contained HTML file — no network requests, opens offline

### What's deferred to post-MVP

- View 4: A/B regression comparison (full side-by-side)
- View 5: Trace visualization (collapsible tree)
- View 6: Category breakdown
- Meta-refresh live updates during runs
- Score distribution histograms

### Relationship to #569 (skill-creator alignment)

The companion artifacts (#565) provide the data layer. This issue provides the visualization layer. Together they give AgentV feature parity with skill-creator's eval-viewer while keeping everything in TypeScript (no Python runtime dependency).


## Architecture Decision: Two Report Surfaces (added 2026-03-14)

Per CLAUDE.md's Lightweight Core principle ("Can this be achieved with existing primitives + a plugin or wrapper?"), HTML visualization is a **presentation concern**, not a universal primitive. Two surfaces:

### 1. Built-in `--format html` (CI/CD pipelines)
- Lightweight TypeScript `html-writer.ts` in the CLI
- Minimal: stat cards, test table, pass/fail, evaluator scores
- No interactivity beyond filtering/sorting
- Justification: CI/CD pipelines need report generation without an agent session, same as `--format junit-xml`

### 2. Skill-invoked Python script (interactive agent sessions)
- `plugins/agentv-dev/skills/agentv-optimizer/references/generate-report.py`
- Richer: iteration comparison, feedback forms, analyst observations
- Modeled after skill-creator's `eval-viewer/generate_review.py`
- The lifecycle skill (Phase 6: Review) invokes this for interactive review
- Users can fork/customize without touching core

### Why both?
- Built-in serves the `agentv eval run --format html -o report.html` CI use case (deterministic, no agent needed)
- Skill-invoked serves the interactive improvement loop (richer, customizable, agent-native)
- Both read the same data: JSONL results + companion artifacts (#565)


Feature	DeepEval (Confident AI)	Convex Evals	AgentV Dashboard (proposed)
Run overview	✅ Dataset, avg/median scores, pass %, AI summary	✅ Pass rate bars, stat cards, category breakdown	✅
Test case table	✅ Input/output, per-metric scores, reasoning, threshold, pass/fail	✅ Status, name, category, duration, failure reason	✅
Category breakdown	❌ (flat metric list)	✅ Category → pass rate, passed/failed/total	✅
Score distribution	✅ Histogram per metric	❌	✅
A/B regression comparison	✅ Side-by-side, color-coded improvement/regression	❌	✅ (when 2+ runs in file)
Trace/step visualization	✅ Spans, tool calls, latency	✅ Steps pipeline (filesystem → install → deploy → tsc → eslint → tests)	✅ Collapsible tree
Code/output browser	❌	✅ Monaco Editor with file tree for model output + task source	⏳ Future (out of scope for v1)
Model comparison	✅ Cross-version prompt performance	❌	✅ (multi-target matrix)
Cost & latency tracking	✅ Per-span token cost, latency	✅ Duration, token usage, run cost	✅
Filtering & search	✅ By status, metric, category	❌	✅ By target, evaluator, status
Shareable report	✅ URL-shareable	✅ (hosted on Netlify)	✅ (self-contained .html file)
Live updates during run	❌ (cloud push)	❌ (cloud push)	✅ Meta-refresh
No vendor dependency	❌ (requires Confident AI account)	❌ (requires Convex backend)	✅

Column	Source
Target name	`targetLabel`
Pass rate	Visual bar + percentage
Passed / Failed / Error counts	Result status
Avg score	Mean of all evaluator scores
Duration	Sum of test durations
Token usage	Input + output tokens

Column	Source	Notes
Status icon	pass/fail/error	Color-coded
Test ID	`testId`
Target	`targetLabel`	If multi-target
Per-evaluator scores	Each evaluator as a column	Color-coded: green ≥90%, yellow ≥50%, red <50%
Overall score	Composite
Duration	ms
Failure reason	`error` or evaluator reasoning	Expandable

Evaluator	Score	Pass/Fail	Threshold	Reasoning
correctness	0.85	✅	0.7	"Response accurately..."
completeness	0.60	❌	0.8	"Missing coverage of..."

Category	Tests	Pass Rate	Avg Score	Passed	Failed
retrieval	8	87.5%	0.82	7	1
reasoning	5	60.0%	0.71	3	2
tool_use	12	91.7%	0.89	11	1

feat: self-contained HTML dashboard with meta-refresh (runs, benchmarks, traces) #562

Description

Summary

Problem

Industry Reference

Proposed behavior

Output

Live updates during a run

Dashboard Views

View 1: Runs Overview

View 2: Test Cases Table

View 3: Test Case Detail (Expanded)

View 4: A/B Regression Comparison

View 5: Trace Visualization

View 6: Category Breakdown

Architecture

Writer implementation

Navigation

Chart library

Design latitude

Acceptance signals

Non-goals

Related

MVP Scope (added 2026-03-14)

Design Reference: skill-creator's generate_review.py

MVP Acceptance Signals (subset of full #562)

What's deferred to post-MVP

Relationship to #569 (skill-creator alignment)

Architecture Decision: Two Report Surfaces (added 2026-03-14)

1. Built-in --format html (CI/CD pipelines)

2. Skill-invoked Python script (interactive agent sessions)

Why both?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Design Reference: skill-creator's `generate_review.py`

1. Built-in `--format html` (CI/CD pipelines)