Skip to content

feat: self-contained HTML dashboard with meta-refresh (runs, benchmarks, traces) #562

@christso

Description

@christso

Summary

Add an html-writer.ts that produces a self-contained .html report file — no server, no framework, opens in any browser. Feature parity with DeepEval (Confident AI) and Convex Evals dashboards for the features that make sense in a static, self-contained context.

Problem

AgentV outputs structured data (JSONL, YAML, JSON, JUnit XML) but has no human-readable visual dashboard. Users must manually parse result files or build their own visualization. DeepEval requires a cloud account (Confident AI) and Convex Evals requires a Convex backend — both add vendor dependencies for basic result visualization.

Industry Reference

Feature DeepEval (Confident AI) Convex Evals AgentV Dashboard (proposed)
Run overview ✅ Dataset, avg/median scores, pass %, AI summary ✅ Pass rate bars, stat cards, category breakdown
Test case table ✅ Input/output, per-metric scores, reasoning, threshold, pass/fail ✅ Status, name, category, duration, failure reason
Category breakdown ❌ (flat metric list) ✅ Category → pass rate, passed/failed/total
Score distribution ✅ Histogram per metric
A/B regression comparison ✅ Side-by-side, color-coded improvement/regression ✅ (when 2+ runs in file)
Trace/step visualization ✅ Spans, tool calls, latency ✅ Steps pipeline (filesystem → install → deploy → tsc → eslint → tests) ✅ Collapsible tree
Code/output browser ✅ Monaco Editor with file tree for model output + task source ⏳ Future (out of scope for v1)
Model comparison ✅ Cross-version prompt performance ✅ (multi-target matrix)
Cost & latency tracking ✅ Per-span token cost, latency ✅ Duration, token usage, run cost
Filtering & search ✅ By status, metric, category ✅ By target, evaluator, status
Shareable report ✅ URL-shareable ✅ (hosted on Netlify) ✅ (self-contained .html file)
Live updates during run ❌ (cloud push) ❌ (cloud push) ✅ Meta-refresh
No vendor dependency ❌ (requires Confident AI account) ❌ (requires Convex backend)

Proposed behavior

Output

agentv eval run --format html -o report.html
# or alongside other formats:
agentv eval run --format jsonl --html-report report.html

Produces a single .html file with all data embedded as <script>const data = [...]</script>. CSS/JS inlined. Zero external dependencies. Works offline.

Live updates during a run

<meta http-equiv="refresh" content="2"> — browser auto-reloads every 2s. Writer rewrites the file on each new result.


Dashboard Views

View 1: Runs Overview

The landing view when opening the report.

Stat cards (top row):

  • Total tests / Passed / Failed / Errors
  • Overall pass rate (%)
  • Total duration
  • Total token usage (input + output) and estimated cost

Runs table (if multiple runs or targets):

Column Source
Target name targetLabel
Pass rate Visual bar + percentage
Passed / Failed / Error counts Result status
Avg score Mean of all evaluator scores
Duration Sum of test durations
Token usage Input + output tokens

Click a row → drill into that target's results.

Score distribution (per evaluator):

  • Histogram showing score distribution across all tests
  • Mean, median, min, max displayed

View 2: Test Cases Table

Per-target (or all-targets) test case detail.

Column Source Notes
Status icon pass/fail/error Color-coded
Test ID testId
Target targetLabel If multi-target
Per-evaluator scores Each evaluator as a column Color-coded: green ≥90%, yellow ≥50%, red <50%
Overall score Composite
Duration ms
Failure reason error or evaluator reasoning Expandable

Interactions:

  • Filter by: status (pass/fail/error), target, evaluator, score range
  • Sort by: any column
  • Search by test ID
  • Click row → expand to show full detail (input, output, evaluator reasoning, trace)

View 3: Test Case Detail (Expanded)

When a test case row is expanded or clicked:

Input/Output panel:

  • input (the prompt / test message sequence)
  • actual_output (agent's response)
  • expected_output (if defined)

Evaluator results:

Evaluator Score Pass/Fail Threshold Reasoning
correctness 0.85 0.7 "Response accurately..."
completeness 0.60 0.8 "Missing coverage of..."
  • Show LLM judge reasoning when available
  • Show rubric criteria breakdown when available

Metadata:

  • Duration, token usage, provider, model
  • Timestamp

View 4: A/B Regression Comparison

Available when the report contains results from 2+ runs or targets.

Side-by-side layout:

  • Left column: baseline run/target
  • Right column: comparison run/target
  • Dropdown selectors for which runs to compare

Per-test comparison rows:

  • Green highlight: test that was failing now passes (improvement)
  • Red highlight: test that was passing now fails (regression)
  • Score delta shown with ↑/↓ arrows
  • Unchanged tests shown in neutral color

Summary stats:

  • Improvements count / Regressions count / Unchanged count
  • Net score delta

View 5: Trace Visualization

Collapsible tree view of test execution (same data as agentv trace show --tree).

▶ test-feature-alpha                           ✅ 0.92  12.4s
  ▶ claude-3-5-sonnet                          ✅ 0.92  12.4s
    ├─ LLM Call: initial prompt                       3.2s  1,240 tokens
    ├─ Tool Call: read_file("src/index.ts")            0.1s
    ├─ LLM Call: analysis                             4.1s  2,100 tokens
    ├─ Tool Call: write_file("src/fix.ts")             0.2s
    └─ LLM Call: final response                       4.8s  1,800 tokens

Per-node details (on expand):

  • LLM calls: prompt, response, token count, latency, model
  • Tool calls: tool name, arguments, result, latency
  • Scores: per-evaluator scores at each level

Color coding:

  • Green: score ≥ 90%
  • Yellow: score ≥ 50%
  • Red: score < 50%
  • Gray: no score (tool calls, metadata)

View 6: Category Breakdown

If tests have categories/tags:

Category Tests Pass Rate Avg Score Passed Failed
retrieval 8 87.5% 0.82 7 1
reasoning 5 60.0% 0.71 3 2
tool_use 12 91.7% 0.89 11 1

Click category → filtered test cases table.


Architecture

Writer implementation

Follows the existing OutputWriter interface pattern:

  • apps/cli/src/commands/eval/html-writer.ts
  • On each result: append to in-memory array, rewrite the HTML file with updated embedded data
  • Template is a string literal in the writer (no external template files)
  • Vanilla JS in the generated HTML — no React, no build step, no framework

Navigation

Tab-based navigation between views (Runs Overview, Test Cases, Comparison, Traces, Categories). All views are in the same single HTML file — tab switching is pure JS DOM manipulation, no routing needed.

Chart library

If histograms or bar charts are needed, vendor a minimal chart solution inlined into the template. No CDN dependencies. Alternatively, render charts as pure HTML/CSS (CSS bar widths for pass rate bars, HTML table cells for histograms).


Design latitude

  • Visual design, layout, CSS are implementer's choice
  • May use lightweight vendored libs (bundled into the template string, not CDN)
  • Collapsible sections and tab navigation encouraged
  • Dark mode support is nice-to-have, not required
  • Responsive layout is nice-to-have

Acceptance signals

  • agentv eval run --format html -o report.html produces a working self-contained HTML file
  • File opens correctly in Chrome/Firefox/Safari with no network requests
  • View 1: Runs overview with stat cards, runs table, score distribution
  • View 2: Test cases table with filtering, sorting, search
  • View 3: Test case detail with input/output, evaluator reasoning, metadata
  • View 4: A/B comparison when 2+ targets/runs present
  • View 5: Collapsible trace tree with LLM/tool call details
  • View 6: Category breakdown with drill-down
  • Meta-refresh updates the page during an active eval run
  • Implements OutputWriter interface — same pattern as existing writers
  • No new runtime dependencies added to the CLI
  • No external network requests from the generated HTML

Non-goals

  • Server-side rendering or hosted dashboard
  • React/framework-based SPA
  • Real-time SSE streaming (future: agentv serve)
  • PDF export
  • Monaco Editor / code browsing (future enhancement)
  • Human annotation / feedback collection (Confident AI feature, not relevant for local reports)
  • Dataset management UI
  • Arena / no-code comparison UI
  • Authentication or team features

Related


MVP Scope (added 2026-03-14)

Design Reference: skill-creator's generate_review.py

Anthropic's skill-creator includes a Python eval-viewer/generate_review.py that produces a static HTML review page. This is the design reference for the MVP — implement the same functionality in TypeScript as html-writer.ts.

What generate_review.py produces:

  • Outputs tab: One test case at a time — shows prompt, output files rendered inline, formal grades (assertion pass/fail), and user feedback form
  • Benchmark tab: Quantitative summary — pass rates, timing, token usage per configuration, per-evaluation breakdowns, analyst observations
  • Feedback download: "Submit All Reviews" button downloads feedback.json
  • Static HTML mode: --static <output_path> generates a self-contained HTML file (no server needed)
  • Iteration comparison: For iteration 2+, shows previous output and feedback side-by-side

MVP Acceptance Signals (subset of full #562)

Ship these first, then iterate toward the full spec:

What's deferred to post-MVP

  • View 4: A/B regression comparison (full side-by-side)
  • View 5: Trace visualization (collapsible tree)
  • View 6: Category breakdown
  • Meta-refresh live updates during runs
  • Score distribution histograms

Relationship to #569 (skill-creator alignment)

The companion artifacts (#565) provide the data layer. This issue provides the visualization layer. Together they give AgentV feature parity with skill-creator's eval-viewer while keeping everything in TypeScript (no Python runtime dependency).

Architecture Decision: Two Report Surfaces (added 2026-03-14)

Per CLAUDE.md's Lightweight Core principle ("Can this be achieved with existing primitives + a plugin or wrapper?"), HTML visualization is a presentation concern, not a universal primitive. Two surfaces:

1. Built-in --format html (CI/CD pipelines)

  • Lightweight TypeScript html-writer.ts in the CLI
  • Minimal: stat cards, test table, pass/fail, evaluator scores
  • No interactivity beyond filtering/sorting
  • Justification: CI/CD pipelines need report generation without an agent session, same as --format junit-xml

2. Skill-invoked Python script (interactive agent sessions)

  • plugins/agentv-dev/skills/agentv-optimizer/references/generate-report.py
  • Richer: iteration comparison, feedback forms, analyst observations
  • Modeled after skill-creator's eval-viewer/generate_review.py
  • The lifecycle skill (Phase 6: Review) invokes this for interactive review
  • Users can fork/customize without touching core

Why both?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions