EntityProcess · christso · Mar 14, 2026 · Mar 14, 2026
diff --git a/apps/web/src/content/docs/guides/human-review.mdx b/apps/web/src/content/docs/guides/human-review.mdx
@@ -0,0 +1,193 @@
+---
+title: Human Review Checkpoint
+description: A structured review step for annotating eval results with qualitative feedback that persists across iterations.
+sidebar:
+  order: 6
+---
+
+Human review sits between automated scoring and the next iteration. Automated evaluators catch regressions and enforce thresholds, but a human reviewer spots score-behavior mismatches, qualitative regressions, and cases where a judge is too strict or too lenient.
+
+## When to review
+
+Review after every eval run where you plan to iterate on the skill or agent. The workflow:
+
+1. **Run evals** — `agentv eval EVAL.yaml` or `agentv eval evals.json`
+2. **Inspect results** — open the HTML report or scan the results JSONL
+3. **Write feedback** — create `feedback.json` alongside the results
+4. **Iterate** — use the feedback to guide prompt changes, evaluator tuning, or test case additions
+5. **Re-run** — verify improvements in the next eval run
+
+Skip the review step for routine CI gate runs where you only need pass/fail.
+
+## What to look for
+
+| Signal | Example |
+|--------|---------|
+| **Score-behavior mismatch** | A test scores 0.9 but the output is clearly wrong — the judge missed an error |
+| **False positive** | A `contains` check passes on a coincidental substring match |
+| **False negative** | An LLM judge penalizes a correct answer that uses different phrasing |
+| **Qualitative regression** | Scores stay the same but tone, formatting, or helpfulness degrades |
+| **Evaluator miscalibration** | A code judge is too strict on whitespace; a rubric is too lenient on accuracy |
+| **Flaky results** | The same test produces wildly different scores across runs |
+
+## How to review
+
+### Inspect results
+
+For workspace evaluations (EVAL.yaml), use the trace viewer:
+
+```bash
+# View traces from a specific run
+agentv trace show results/2026-03-14T10-32-00_claude/traces.jsonl
+
+# View the HTML report (if generated via #562)
+open results/2026-03-14T10-32-00_claude/report.html
+```
+
+For simple skill evaluations (evals.json), scan the results JSONL:
+
+```bash
+# Show failing tests
+cat results/output.jsonl | jq 'select(.score < 0.8)'
+
+# Show all scores
+cat results/output.jsonl | jq '{id: .testId, score: .score, verdict: .verdict}'
+```
+
+### Write feedback
+
+Create a `feedback.json` file in the results directory, alongside `results.jsonl` or `output.jsonl`:
+
+```
+results/
+  2026-03-14T10-32-00_claude/
+    results.jsonl          # automated eval results
+    traces.jsonl           # execution traces
+    feedback.json          # ← your review annotations
+```
+
+## Feedback artifact schema
+
+The `feedback.json` file is a structured annotation of a single eval run. It records the reviewer's qualitative assessment alongside the automated scores.
+
+```json
+{
+  "run_id": "2026-03-14T10-32-00_claude",
+  "reviewer": "engineer-name",
+  "timestamp": "2026-03-14T12:00:00Z",
+  "overall_notes": "Retrieval tests need more diverse queries. Code judge for format-check is too strict on trailing newlines.",
+  "per_case": [
+    {
+      "test_id": "test-feature-alpha",
+      "verdict": "acceptable",
+      "notes": "Score is borderline (0.72) but behavior is correct — the judge penalized for different phrasing."
+    },
+    {
+      "test_id": "test-retrieval-basic",
+      "verdict": "needs_improvement",
+      "notes": "Missing coverage of multi-document queries.",
+      "evaluator_overrides": {
+        "code-judge:format-check": "Too strict — penalized valid output with trailing newline",
+        "llm-judge:quality": "Score 0.6 seems fair, answer was incomplete"
+      },
+      "workspace_notes": "Workspace had stale cached files from previous run — may have affected retrieval results."
+    },
+    {
+      "test_id": "test-edge-case-empty",
+      "verdict": "flaky",
+      "notes": "Passed on 2 of 3 runs. Likely non-determinism in the agent's tool selection."
+    }
+  ]
+}
+```
+
+### Field reference
+
+| Field | Type | Required | Description |
+|-------|------|----------|-------------|
+| `run_id` | `string` | yes | Identifies the eval run (matches the results directory name or run identifier) |
+| `reviewer` | `string` | yes | Who performed the review |
+| `timestamp` | `string` (ISO 8601) | yes | When the review was completed |
+| `overall_notes` | `string` | no | High-level observations about the run |
+| `per_case` | `array` | no | Per-test-case annotations |
+
+### Per-case fields
+
+| Field | Type | Required | Description |
+|-------|------|----------|-------------|
+| `test_id` | `string` | yes | Matches the test `id` from the eval file |
+| `verdict` | `enum` | yes | One of: `acceptable`, `needs_improvement`, `incorrect`, `flaky` |
+| `notes` | `string` | no | Free-form reviewer notes |
+| `evaluator_overrides` | `object` | no | Keyed by evaluator name — reviewer annotations on specific evaluator results |
+| `workspace_notes` | `string` | no | Notes about workspace state (relevant for workspace evaluations) |
+
+### Verdict values
+
+| Verdict | Meaning |
+|---------|---------|
+| `acceptable` | Automated score and actual behavior are both satisfactory |
+| `needs_improvement` | The output or coverage needs work — not a bug, but not good enough |
+| `incorrect` | The output is wrong, regardless of what the automated score says |
+| `flaky` | Results are inconsistent across runs — investigate non-determinism |
+
+### Evaluator overrides (workspace evaluations)
+
+For workspace evaluations with multiple evaluators (code judges, LLM judges, tool trajectory checks), the `evaluator_overrides` field lets the reviewer annotate specific evaluator results:
+
+```json
+{
+  "test_id": "test-refactor-api",
+  "verdict": "needs_improvement",
+  "evaluator_overrides": {
+    "code-judge:test-pass": "Tests pass but the refactored code has a subtle race condition the tests don't cover",
+    "llm-judge:quality": "Score 0.9 is too high — the agent left dead code behind",
+    "tool-trajectory:efficiency": "Used 12 tool calls where 5 would suffice, but the result is correct"
+  },
+  "workspace_notes": "Agent cloned the repo correctly but didn't clean up temp files."
+}
+```
+
+Keys use the format `evaluator-type:evaluator-name` to match the evaluators defined in `assert` blocks.
+
+## Storing feedback across iterations
+
+Keep feedback files alongside results to build a history of review decisions:
+
+```
+results/
+  2026-03-12T09-00-00_claude/
+    results.jsonl
+    feedback.json          # first iteration review
+  2026-03-14T10-32-00_claude/
+    results.jsonl
+    feedback.json          # second iteration review
+  2026-03-15T16-00-00_claude/
+    results.jsonl
+    feedback.json          # third iteration review
+```
+
+This creates a traceable record of what changed between iterations and why. When debugging a regression, check previous `feedback.json` files to see if the issue was noted before.
+
+## Integration with eval workflow
+
+The review checkpoint fits into the broader eval iteration loop:
+
+```
+Define tests (EVAL.yaml / evals.json)
+        ↓
+  Run automated evals
+        ↓
+  Review results ← you are here
+        ↓
+  Write feedback.json
+        ↓
+  Tune prompts / evaluators / test cases
+        ↓
+  Re-run evals
+        ↓
+  Compare with previous run (agentv compare)
+        ↓
+  Review again (if iterating)
+```
+
+Use `agentv compare` to quantify changes between runs, then review the diff to confirm that score improvements reflect genuine behavioral improvements.
diff --git a/plugins/agentv-dev/skills/agentv-eval-builder/SKILL.md b/plugins/agentv-dev/skills/agentv-eval-builder/SKILL.md
@@ -576,6 +576,34 @@ agentv create assertion <name>  # → .agentv/assertions/<name>.ts
 agentv create eval <name>       # → evals/<name>.eval.yaml + .cases.jsonl
 ```
 
+## Human Review Checkpoint
+
+After running evals, perform a human review before iterating. Create `feedback.json` in the results directory alongside `results.jsonl`:
+
+```json
+{
+  "run_id": "2026-03-14T10-32-00_claude",
+  "reviewer": "engineer-name",
+  "timestamp": "2026-03-14T12:00:00Z",
+  "overall_notes": "Summary of observations",
+  "per_case": [
+    {
+      "test_id": "test-id",
+      "verdict": "acceptable | needs_improvement | incorrect | flaky",
+      "notes": "Why this verdict",
+      "evaluator_overrides": { "code-judge:name": "Override note" },
+      "workspace_notes": "Workspace state observations"
+    }
+  ]
+}
+```
+
+Use `evaluator_overrides` for workspace evaluations to annotate specific evaluator results (e.g., "code-judge was too strict"). Use `workspace_notes` for observations about workspace state.
+
+Review workflow: run evals → inspect results (`agentv trace show`) → write feedback → tune prompts/evaluators → re-run.
+
+Full guide: https://agentv.dev/guides/human-review/
+
 ## Schemas
 
 - Eval file: `references/eval-schema.json`