diff --git a/apps/web/src/content/docs/guides/human-review.mdx b/apps/web/src/content/docs/guides/human-review.mdx new file mode 100644 index 000000000..88bd1ef0c --- /dev/null +++ b/apps/web/src/content/docs/guides/human-review.mdx @@ -0,0 +1,193 @@ +--- +title: Human Review Checkpoint +description: A structured review step for annotating eval results with qualitative feedback that persists across iterations. +sidebar: + order: 6 +--- + +Human review sits between automated scoring and the next iteration. Automated evaluators catch regressions and enforce thresholds, but a human reviewer spots score-behavior mismatches, qualitative regressions, and cases where a judge is too strict or too lenient. + +## When to review + +Review after every eval run where you plan to iterate on the skill or agent. The workflow: + +1. **Run evals** — `agentv eval EVAL.yaml` or `agentv eval evals.json` +2. **Inspect results** — open the HTML report or scan the results JSONL +3. **Write feedback** — create `feedback.json` alongside the results +4. **Iterate** — use the feedback to guide prompt changes, evaluator tuning, or test case additions +5. **Re-run** — verify improvements in the next eval run + +Skip the review step for routine CI gate runs where you only need pass/fail. + +## What to look for + +| Signal | Example | +|--------|---------| +| **Score-behavior mismatch** | A test scores 0.9 but the output is clearly wrong — the judge missed an error | +| **False positive** | A `contains` check passes on a coincidental substring match | +| **False negative** | An LLM judge penalizes a correct answer that uses different phrasing | +| **Qualitative regression** | Scores stay the same but tone, formatting, or helpfulness degrades | +| **Evaluator miscalibration** | A code judge is too strict on whitespace; a rubric is too lenient on accuracy | +| **Flaky results** | The same test produces wildly different scores across runs | + +## How to review + +### Inspect results + +For workspace evaluations (EVAL.yaml), use the trace viewer: + +```bash +# View traces from a specific run +agentv trace show results/2026-03-14T10-32-00_claude/traces.jsonl + +# View the HTML report (if generated via #562) +open results/2026-03-14T10-32-00_claude/report.html +``` + +For simple skill evaluations (evals.json), scan the results JSONL: + +```bash +# Show failing tests +cat results/output.jsonl | jq 'select(.score < 0.8)' + +# Show all scores +cat results/output.jsonl | jq '{id: .testId, score: .score, verdict: .verdict}' +``` + +### Write feedback + +Create a `feedback.json` file in the results directory, alongside `results.jsonl` or `output.jsonl`: + +``` +results/ + 2026-03-14T10-32-00_claude/ + results.jsonl # automated eval results + traces.jsonl # execution traces + feedback.json # ← your review annotations +``` + +## Feedback artifact schema + +The `feedback.json` file is a structured annotation of a single eval run. It records the reviewer's qualitative assessment alongside the automated scores. + +```json +{ + "run_id": "2026-03-14T10-32-00_claude", + "reviewer": "engineer-name", + "timestamp": "2026-03-14T12:00:00Z", + "overall_notes": "Retrieval tests need more diverse queries. Code judge for format-check is too strict on trailing newlines.", + "per_case": [ + { + "test_id": "test-feature-alpha", + "verdict": "acceptable", + "notes": "Score is borderline (0.72) but behavior is correct — the judge penalized for different phrasing." + }, + { + "test_id": "test-retrieval-basic", + "verdict": "needs_improvement", + "notes": "Missing coverage of multi-document queries.", + "evaluator_overrides": { + "code-judge:format-check": "Too strict — penalized valid output with trailing newline", + "llm-judge:quality": "Score 0.6 seems fair, answer was incomplete" + }, + "workspace_notes": "Workspace had stale cached files from previous run — may have affected retrieval results." + }, + { + "test_id": "test-edge-case-empty", + "verdict": "flaky", + "notes": "Passed on 2 of 3 runs. Likely non-determinism in the agent's tool selection." + } + ] +} +``` + +### Field reference + +| Field | Type | Required | Description | +|-------|------|----------|-------------| +| `run_id` | `string` | yes | Identifies the eval run (matches the results directory name or run identifier) | +| `reviewer` | `string` | yes | Who performed the review | +| `timestamp` | `string` (ISO 8601) | yes | When the review was completed | +| `overall_notes` | `string` | no | High-level observations about the run | +| `per_case` | `array` | no | Per-test-case annotations | + +### Per-case fields + +| Field | Type | Required | Description | +|-------|------|----------|-------------| +| `test_id` | `string` | yes | Matches the test `id` from the eval file | +| `verdict` | `enum` | yes | One of: `acceptable`, `needs_improvement`, `incorrect`, `flaky` | +| `notes` | `string` | no | Free-form reviewer notes | +| `evaluator_overrides` | `object` | no | Keyed by evaluator name — reviewer annotations on specific evaluator results | +| `workspace_notes` | `string` | no | Notes about workspace state (relevant for workspace evaluations) | + +### Verdict values + +| Verdict | Meaning | +|---------|---------| +| `acceptable` | Automated score and actual behavior are both satisfactory | +| `needs_improvement` | The output or coverage needs work — not a bug, but not good enough | +| `incorrect` | The output is wrong, regardless of what the automated score says | +| `flaky` | Results are inconsistent across runs — investigate non-determinism | + +### Evaluator overrides (workspace evaluations) + +For workspace evaluations with multiple evaluators (code judges, LLM judges, tool trajectory checks), the `evaluator_overrides` field lets the reviewer annotate specific evaluator results: + +```json +{ + "test_id": "test-refactor-api", + "verdict": "needs_improvement", + "evaluator_overrides": { + "code-judge:test-pass": "Tests pass but the refactored code has a subtle race condition the tests don't cover", + "llm-judge:quality": "Score 0.9 is too high — the agent left dead code behind", + "tool-trajectory:efficiency": "Used 12 tool calls where 5 would suffice, but the result is correct" + }, + "workspace_notes": "Agent cloned the repo correctly but didn't clean up temp files." +} +``` + +Keys use the format `evaluator-type:evaluator-name` to match the evaluators defined in `assert` blocks. + +## Storing feedback across iterations + +Keep feedback files alongside results to build a history of review decisions: + +``` +results/ + 2026-03-12T09-00-00_claude/ + results.jsonl + feedback.json # first iteration review + 2026-03-14T10-32-00_claude/ + results.jsonl + feedback.json # second iteration review + 2026-03-15T16-00-00_claude/ + results.jsonl + feedback.json # third iteration review +``` + +This creates a traceable record of what changed between iterations and why. When debugging a regression, check previous `feedback.json` files to see if the issue was noted before. + +## Integration with eval workflow + +The review checkpoint fits into the broader eval iteration loop: + +``` +Define tests (EVAL.yaml / evals.json) + ↓ + Run automated evals + ↓ + Review results ← you are here + ↓ + Write feedback.json + ↓ + Tune prompts / evaluators / test cases + ↓ + Re-run evals + ↓ + Compare with previous run (agentv compare) + ↓ + Review again (if iterating) +``` + +Use `agentv compare` to quantify changes between runs, then review the diff to confirm that score improvements reflect genuine behavioral improvements. diff --git a/plugins/agentv-dev/skills/agentv-eval-builder/SKILL.md b/plugins/agentv-dev/skills/agentv-eval-builder/SKILL.md index 17f32366f..1f0b50a7c 100644 --- a/plugins/agentv-dev/skills/agentv-eval-builder/SKILL.md +++ b/plugins/agentv-dev/skills/agentv-eval-builder/SKILL.md @@ -576,6 +576,34 @@ agentv create assertion # → .agentv/assertions/.ts agentv create eval # → evals/.eval.yaml + .cases.jsonl ``` +## Human Review Checkpoint + +After running evals, perform a human review before iterating. Create `feedback.json` in the results directory alongside `results.jsonl`: + +```json +{ + "run_id": "2026-03-14T10-32-00_claude", + "reviewer": "engineer-name", + "timestamp": "2026-03-14T12:00:00Z", + "overall_notes": "Summary of observations", + "per_case": [ + { + "test_id": "test-id", + "verdict": "acceptable | needs_improvement | incorrect | flaky", + "notes": "Why this verdict", + "evaluator_overrides": { "code-judge:name": "Override note" }, + "workspace_notes": "Workspace state observations" + } + ] +} +``` + +Use `evaluator_overrides` for workspace evaluations to annotate specific evaluator results (e.g., "code-judge was too strict"). Use `workspace_notes` for observations about workspace state. + +Review workflow: run evals → inspect results (`agentv trace show`) → write feedback → tune prompts/evaluators → re-run. + +Full guide: https://agentv.dev/guides/human-review/ + ## Schemas - Eval file: `references/eval-schema.json`