Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
193 changes: 193 additions & 0 deletions apps/web/src/content/docs/guides/human-review.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
---
title: Human Review Checkpoint
description: A structured review step for annotating eval results with qualitative feedback that persists across iterations.
sidebar:
order: 6
---

Human review sits between automated scoring and the next iteration. Automated evaluators catch regressions and enforce thresholds, but a human reviewer spots score-behavior mismatches, qualitative regressions, and cases where a judge is too strict or too lenient.

## When to review

Review after every eval run where you plan to iterate on the skill or agent. The workflow:

1. **Run evals** — `agentv eval EVAL.yaml` or `agentv eval evals.json`
2. **Inspect results** — open the HTML report or scan the results JSONL
3. **Write feedback** — create `feedback.json` alongside the results
4. **Iterate** — use the feedback to guide prompt changes, evaluator tuning, or test case additions
5. **Re-run** — verify improvements in the next eval run

Skip the review step for routine CI gate runs where you only need pass/fail.

## What to look for

| Signal | Example |
|--------|---------|
| **Score-behavior mismatch** | A test scores 0.9 but the output is clearly wrong — the judge missed an error |
| **False positive** | A `contains` check passes on a coincidental substring match |
| **False negative** | An LLM judge penalizes a correct answer that uses different phrasing |
| **Qualitative regression** | Scores stay the same but tone, formatting, or helpfulness degrades |
| **Evaluator miscalibration** | A code judge is too strict on whitespace; a rubric is too lenient on accuracy |
| **Flaky results** | The same test produces wildly different scores across runs |

## How to review

### Inspect results

For workspace evaluations (EVAL.yaml), use the trace viewer:

```bash
# View traces from a specific run
agentv trace show results/2026-03-14T10-32-00_claude/traces.jsonl

# View the HTML report (if generated via #562)
open results/2026-03-14T10-32-00_claude/report.html
```

For simple skill evaluations (evals.json), scan the results JSONL:

```bash
# Show failing tests
cat results/output.jsonl | jq 'select(.score < 0.8)'

# Show all scores
cat results/output.jsonl | jq '{id: .testId, score: .score, verdict: .verdict}'
```

### Write feedback

Create a `feedback.json` file in the results directory, alongside `results.jsonl` or `output.jsonl`:

```
results/
2026-03-14T10-32-00_claude/
results.jsonl # automated eval results
traces.jsonl # execution traces
feedback.json # ← your review annotations
```

## Feedback artifact schema

The `feedback.json` file is a structured annotation of a single eval run. It records the reviewer's qualitative assessment alongside the automated scores.

```json
{
"run_id": "2026-03-14T10-32-00_claude",
"reviewer": "engineer-name",
"timestamp": "2026-03-14T12:00:00Z",
"overall_notes": "Retrieval tests need more diverse queries. Code judge for format-check is too strict on trailing newlines.",
"per_case": [
{
"test_id": "test-feature-alpha",
"verdict": "acceptable",
"notes": "Score is borderline (0.72) but behavior is correct — the judge penalized for different phrasing."
},
{
"test_id": "test-retrieval-basic",
"verdict": "needs_improvement",
"notes": "Missing coverage of multi-document queries.",
"evaluator_overrides": {
"code-judge:format-check": "Too strict — penalized valid output with trailing newline",
"llm-judge:quality": "Score 0.6 seems fair, answer was incomplete"
},
"workspace_notes": "Workspace had stale cached files from previous run — may have affected retrieval results."
},
{
"test_id": "test-edge-case-empty",
"verdict": "flaky",
"notes": "Passed on 2 of 3 runs. Likely non-determinism in the agent's tool selection."
}
]
}
```

### Field reference

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `run_id` | `string` | yes | Identifies the eval run (matches the results directory name or run identifier) |
| `reviewer` | `string` | yes | Who performed the review |
| `timestamp` | `string` (ISO 8601) | yes | When the review was completed |
| `overall_notes` | `string` | no | High-level observations about the run |
| `per_case` | `array` | no | Per-test-case annotations |

### Per-case fields

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `test_id` | `string` | yes | Matches the test `id` from the eval file |
| `verdict` | `enum` | yes | One of: `acceptable`, `needs_improvement`, `incorrect`, `flaky` |
| `notes` | `string` | no | Free-form reviewer notes |
| `evaluator_overrides` | `object` | no | Keyed by evaluator name — reviewer annotations on specific evaluator results |
| `workspace_notes` | `string` | no | Notes about workspace state (relevant for workspace evaluations) |

### Verdict values

| Verdict | Meaning |
|---------|---------|
| `acceptable` | Automated score and actual behavior are both satisfactory |
| `needs_improvement` | The output or coverage needs work — not a bug, but not good enough |
| `incorrect` | The output is wrong, regardless of what the automated score says |
| `flaky` | Results are inconsistent across runs — investigate non-determinism |

### Evaluator overrides (workspace evaluations)

For workspace evaluations with multiple evaluators (code judges, LLM judges, tool trajectory checks), the `evaluator_overrides` field lets the reviewer annotate specific evaluator results:

```json
{
"test_id": "test-refactor-api",
"verdict": "needs_improvement",
"evaluator_overrides": {
"code-judge:test-pass": "Tests pass but the refactored code has a subtle race condition the tests don't cover",
"llm-judge:quality": "Score 0.9 is too high — the agent left dead code behind",
"tool-trajectory:efficiency": "Used 12 tool calls where 5 would suffice, but the result is correct"
},
"workspace_notes": "Agent cloned the repo correctly but didn't clean up temp files."
}
```

Keys use the format `evaluator-type:evaluator-name` to match the evaluators defined in `assert` blocks.

## Storing feedback across iterations

Keep feedback files alongside results to build a history of review decisions:

```
results/
2026-03-12T09-00-00_claude/
results.jsonl
feedback.json # first iteration review
2026-03-14T10-32-00_claude/
results.jsonl
feedback.json # second iteration review
2026-03-15T16-00-00_claude/
results.jsonl
feedback.json # third iteration review
```

This creates a traceable record of what changed between iterations and why. When debugging a regression, check previous `feedback.json` files to see if the issue was noted before.

## Integration with eval workflow

The review checkpoint fits into the broader eval iteration loop:

```
Define tests (EVAL.yaml / evals.json)
Run automated evals
Review results ← you are here
Write feedback.json
Tune prompts / evaluators / test cases
Re-run evals
Compare with previous run (agentv compare)
Review again (if iterating)
```

Use `agentv compare` to quantify changes between runs, then review the diff to confirm that score improvements reflect genuine behavioral improvements.
28 changes: 28 additions & 0 deletions plugins/agentv-dev/skills/agentv-eval-builder/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -576,6 +576,34 @@ agentv create assertion <name> # → .agentv/assertions/<name>.ts
agentv create eval <name> # → evals/<name>.eval.yaml + .cases.jsonl
```

## Human Review Checkpoint

After running evals, perform a human review before iterating. Create `feedback.json` in the results directory alongside `results.jsonl`:

```json
{
"run_id": "2026-03-14T10-32-00_claude",
"reviewer": "engineer-name",
"timestamp": "2026-03-14T12:00:00Z",
"overall_notes": "Summary of observations",
"per_case": [
{
"test_id": "test-id",
"verdict": "acceptable | needs_improvement | incorrect | flaky",
"notes": "Why this verdict",
"evaluator_overrides": { "code-judge:name": "Override note" },
"workspace_notes": "Workspace state observations"
}
]
}
```

Use `evaluator_overrides` for workspace evaluations to annotate specific evaluator results (e.g., "code-judge was too strict"). Use `workspace_notes` for observations about workspace state.

Review workflow: run evals → inspect results (`agentv trace show`) → write feedback → tune prompts/evaluators → re-run.

Full guide: https://agentv.dev/guides/human-review/

## Schemas

- Eval file: `references/eval-schema.json`
Expand Down