-
Notifications
You must be signed in to change notification settings - Fork 0
feat: self-contained HTML dashboard with meta-refresh (runs, benchmarks, traces) #562
Description
Summary
Add an html-writer.ts that produces a self-contained .html report file — no server, no framework, opens in any browser. Feature parity with DeepEval (Confident AI) and Convex Evals dashboards for the features that make sense in a static, self-contained context.
Problem
AgentV outputs structured data (JSONL, YAML, JSON, JUnit XML) but has no human-readable visual dashboard. Users must manually parse result files or build their own visualization. DeepEval requires a cloud account (Confident AI) and Convex Evals requires a Convex backend — both add vendor dependencies for basic result visualization.
Industry Reference
| Feature | DeepEval (Confident AI) | Convex Evals | AgentV Dashboard (proposed) |
|---|---|---|---|
| Run overview | ✅ Dataset, avg/median scores, pass %, AI summary | ✅ Pass rate bars, stat cards, category breakdown | ✅ |
| Test case table | ✅ Input/output, per-metric scores, reasoning, threshold, pass/fail | ✅ Status, name, category, duration, failure reason | ✅ |
| Category breakdown | ❌ (flat metric list) | ✅ Category → pass rate, passed/failed/total | ✅ |
| Score distribution | ✅ Histogram per metric | ❌ | ✅ |
| A/B regression comparison | ✅ Side-by-side, color-coded improvement/regression | ❌ | ✅ (when 2+ runs in file) |
| Trace/step visualization | ✅ Spans, tool calls, latency | ✅ Steps pipeline (filesystem → install → deploy → tsc → eslint → tests) | ✅ Collapsible tree |
| Code/output browser | ❌ | ✅ Monaco Editor with file tree for model output + task source | ⏳ Future (out of scope for v1) |
| Model comparison | ✅ Cross-version prompt performance | ❌ | ✅ (multi-target matrix) |
| Cost & latency tracking | ✅ Per-span token cost, latency | ✅ Duration, token usage, run cost | ✅ |
| Filtering & search | ✅ By status, metric, category | ❌ | ✅ By target, evaluator, status |
| Shareable report | ✅ URL-shareable | ✅ (hosted on Netlify) | ✅ (self-contained .html file) |
| Live updates during run | ❌ (cloud push) | ❌ (cloud push) | ✅ Meta-refresh |
| No vendor dependency | ❌ (requires Confident AI account) | ❌ (requires Convex backend) | ✅ |
Proposed behavior
Output
agentv eval run --format html -o report.html
# or alongside other formats:
agentv eval run --format jsonl --html-report report.htmlProduces a single .html file with all data embedded as <script>const data = [...]</script>. CSS/JS inlined. Zero external dependencies. Works offline.
Live updates during a run
<meta http-equiv="refresh" content="2"> — browser auto-reloads every 2s. Writer rewrites the file on each new result.
Dashboard Views
View 1: Runs Overview
The landing view when opening the report.
Stat cards (top row):
- Total tests / Passed / Failed / Errors
- Overall pass rate (%)
- Total duration
- Total token usage (input + output) and estimated cost
Runs table (if multiple runs or targets):
| Column | Source |
|---|---|
| Target name | targetLabel |
| Pass rate | Visual bar + percentage |
| Passed / Failed / Error counts | Result status |
| Avg score | Mean of all evaluator scores |
| Duration | Sum of test durations |
| Token usage | Input + output tokens |
Click a row → drill into that target's results.
Score distribution (per evaluator):
- Histogram showing score distribution across all tests
- Mean, median, min, max displayed
View 2: Test Cases Table
Per-target (or all-targets) test case detail.
| Column | Source | Notes |
|---|---|---|
| Status icon | pass/fail/error | Color-coded |
| Test ID | testId |
|
| Target | targetLabel |
If multi-target |
| Per-evaluator scores | Each evaluator as a column | Color-coded: green ≥90%, yellow ≥50%, red <50% |
| Overall score | Composite | |
| Duration | ms | |
| Failure reason | error or evaluator reasoning |
Expandable |
Interactions:
- Filter by: status (pass/fail/error), target, evaluator, score range
- Sort by: any column
- Search by test ID
- Click row → expand to show full detail (input, output, evaluator reasoning, trace)
View 3: Test Case Detail (Expanded)
When a test case row is expanded or clicked:
Input/Output panel:
input(the prompt / test message sequence)actual_output(agent's response)expected_output(if defined)
Evaluator results:
| Evaluator | Score | Pass/Fail | Threshold | Reasoning |
|---|---|---|---|---|
| correctness | 0.85 | ✅ | 0.7 | "Response accurately..." |
| completeness | 0.60 | ❌ | 0.8 | "Missing coverage of..." |
- Show LLM judge reasoning when available
- Show rubric criteria breakdown when available
Metadata:
- Duration, token usage, provider, model
- Timestamp
View 4: A/B Regression Comparison
Available when the report contains results from 2+ runs or targets.
Side-by-side layout:
- Left column: baseline run/target
- Right column: comparison run/target
- Dropdown selectors for which runs to compare
Per-test comparison rows:
- Green highlight: test that was failing now passes (improvement)
- Red highlight: test that was passing now fails (regression)
- Score delta shown with ↑/↓ arrows
- Unchanged tests shown in neutral color
Summary stats:
- Improvements count / Regressions count / Unchanged count
- Net score delta
View 5: Trace Visualization
Collapsible tree view of test execution (same data as agentv trace show --tree).
▶ test-feature-alpha ✅ 0.92 12.4s
▶ claude-3-5-sonnet ✅ 0.92 12.4s
├─ LLM Call: initial prompt 3.2s 1,240 tokens
├─ Tool Call: read_file("src/index.ts") 0.1s
├─ LLM Call: analysis 4.1s 2,100 tokens
├─ Tool Call: write_file("src/fix.ts") 0.2s
└─ LLM Call: final response 4.8s 1,800 tokens
Per-node details (on expand):
- LLM calls: prompt, response, token count, latency, model
- Tool calls: tool name, arguments, result, latency
- Scores: per-evaluator scores at each level
Color coding:
- Green: score ≥ 90%
- Yellow: score ≥ 50%
- Red: score < 50%
- Gray: no score (tool calls, metadata)
View 6: Category Breakdown
If tests have categories/tags:
| Category | Tests | Pass Rate | Avg Score | Passed | Failed |
|---|---|---|---|---|---|
| retrieval | 8 | 87.5% | 0.82 | 7 | 1 |
| reasoning | 5 | 60.0% | 0.71 | 3 | 2 |
| tool_use | 12 | 91.7% | 0.89 | 11 | 1 |
Click category → filtered test cases table.
Architecture
Writer implementation
Follows the existing OutputWriter interface pattern:
apps/cli/src/commands/eval/html-writer.ts- On each result: append to in-memory array, rewrite the HTML file with updated embedded data
- Template is a string literal in the writer (no external template files)
- Vanilla JS in the generated HTML — no React, no build step, no framework
Navigation
Tab-based navigation between views (Runs Overview, Test Cases, Comparison, Traces, Categories). All views are in the same single HTML file — tab switching is pure JS DOM manipulation, no routing needed.
Chart library
If histograms or bar charts are needed, vendor a minimal chart solution inlined into the template. No CDN dependencies. Alternatively, render charts as pure HTML/CSS (CSS bar widths for pass rate bars, HTML table cells for histograms).
Design latitude
- Visual design, layout, CSS are implementer's choice
- May use lightweight vendored libs (bundled into the template string, not CDN)
- Collapsible sections and tab navigation encouraged
- Dark mode support is nice-to-have, not required
- Responsive layout is nice-to-have
Acceptance signals
-
agentv eval run --format html -o report.htmlproduces a working self-contained HTML file - File opens correctly in Chrome/Firefox/Safari with no network requests
- View 1: Runs overview with stat cards, runs table, score distribution
- View 2: Test cases table with filtering, sorting, search
- View 3: Test case detail with input/output, evaluator reasoning, metadata
- View 4: A/B comparison when 2+ targets/runs present
- View 5: Collapsible trace tree with LLM/tool call details
- View 6: Category breakdown with drill-down
- Meta-refresh updates the page during an active eval run
- Implements
OutputWriterinterface — same pattern as existing writers - No new runtime dependencies added to the CLI
- No external network requests from the generated HTML
Non-goals
- Server-side rendering or hosted dashboard
- React/framework-based SPA
- Real-time SSE streaming (future:
agentv serve) - PDF export
- Monaco Editor / code browsing (future enhancement)
- Human annotation / feedback collection (Confident AI feature, not relevant for local reports)
- Dataset management UI
- Arena / no-code comparison UI
- Authentication or team features
Related
- Research: pi-autoresearch patterns
- Industry: DeepEval Confident AI dashboard — cloud-hosted eval visualization
- Industry: Convex Evals visualizer — React + Convex web dashboard
- Existing writers:
jsonl-writer.ts,yaml-writer.ts,json-writer.ts,junit-writer.ts
MVP Scope (added 2026-03-14)
Design Reference: skill-creator's generate_review.py
Anthropic's skill-creator includes a Python eval-viewer/generate_review.py that produces a static HTML review page. This is the design reference for the MVP — implement the same functionality in TypeScript as html-writer.ts.
What generate_review.py produces:
- Outputs tab: One test case at a time — shows prompt, output files rendered inline, formal grades (assertion pass/fail), and user feedback form
- Benchmark tab: Quantitative summary — pass rates, timing, token usage per configuration, per-evaluation breakdowns, analyst observations
- Feedback download: "Submit All Reviews" button downloads
feedback.json - Static HTML mode:
--static <output_path>generates a self-contained HTML file (no server needed) - Iteration comparison: For iteration 2+, shows previous output and feedback side-by-side
MVP Acceptance Signals (subset of full #562)
Ship these first, then iterate toward the full spec:
- View 1: Runs overview with stat cards (pass rate, duration, tokens, cost)
- View 2: Test cases table with per-evaluator scores, pass/fail status, expandable reasoning
- View 3: Benchmark tab with per-target statistics from
benchmark.json - Reads AgentV companion artifacts (
grading/*.json,timing.json,benchmark.json) produced by feat: skill-eval companion artifacts (grading, timing, benchmark) #565 - Reads JSONL results directly as fallback
- Feedback integration: links to
feedback.jsonworkflow (feat: human review checkpoint and feedback artifact for skill iteration #568) - Self-contained HTML file — no network requests, opens offline
What's deferred to post-MVP
- View 4: A/B regression comparison (full side-by-side)
- View 5: Trace visualization (collapsible tree)
- View 6: Category breakdown
- Meta-refresh live updates during runs
- Score distribution histograms
Relationship to #569 (skill-creator alignment)
The companion artifacts (#565) provide the data layer. This issue provides the visualization layer. Together they give AgentV feature parity with skill-creator's eval-viewer while keeping everything in TypeScript (no Python runtime dependency).
Architecture Decision: Two Report Surfaces (added 2026-03-14)
Per CLAUDE.md's Lightweight Core principle ("Can this be achieved with existing primitives + a plugin or wrapper?"), HTML visualization is a presentation concern, not a universal primitive. Two surfaces:
1. Built-in --format html (CI/CD pipelines)
- Lightweight TypeScript
html-writer.tsin the CLI - Minimal: stat cards, test table, pass/fail, evaluator scores
- No interactivity beyond filtering/sorting
- Justification: CI/CD pipelines need report generation without an agent session, same as
--format junit-xml
2. Skill-invoked Python script (interactive agent sessions)
plugins/agentv-dev/skills/agentv-optimizer/references/generate-report.py- Richer: iteration comparison, feedback forms, analyst observations
- Modeled after skill-creator's
eval-viewer/generate_review.py - The lifecycle skill (Phase 6: Review) invokes this for interactive review
- Users can fork/customize without touching core
Why both?
- Built-in serves the
agentv eval run --format html -o report.htmlCI use case (deterministic, no agent needed) - Skill-invoked serves the interactive improvement loop (richer, customizable, agent-native)
- Both read the same data: JSONL results + companion artifacts (feat: skill-eval companion artifacts (grading, timing, benchmark) #565)