feat: human review checkpoint and feedback artifact for skill iteration

## Objective

Add a documented human review step to the skill evaluation workflow with a structured feedback artifact, enabling reviewers to annotate eval results with qualitative notes that persist across iterations and inform the next improvement cycle.

## Architecture Boundary

external-first (docs + schema definition)

## MVP Deliverable

**Docs + schema definition only.** The minimum viable deliverable is:
1. A documented review workflow ("after automated scoring, before next iteration")
2. A defined feedback artifact schema
3. Guidance on where to store feedback (alongside results)

Interactive tooling (CLI helper, feedback collection in HTML report) is a follow-up, not part of this issue.

## Context

Both Anthropic's skill-creator and the HBOon/Tessl workflow treat human review as a first-class step. The skill-creator uses `eval-viewer/generate_review.py` to produce a review surface and collects structured `feedback.json`.

AgentV's `agentv-optimizer` skill already includes a review-like checkpoint in Phase 4 (Polish) where the human reviews changes before handoff. The static HTML dashboard (#562) provides visualization. What's missing is a documented, lightweight review step with a portable feedback artifact that works independently of the optimizer and dashboard.

## Design Latitude

- Choose the feedback artifact format (JSON recommended for consistency with other artifacts)
- The review workflow can be as simple as: "open the HTML report (#562), review failures, write feedback.json in the results directory"
- Choose specific field names and verdict enums for the schema
- May define the schema in the docs, in a JSON Schema file, or both

## Suggested Schema (adapt as needed)

```json
{
  "run_id": "2026-03-14T10-32-00_claude",
  "reviewer": "engineer-name",
  "timestamp": "2026-03-14T12:00:00Z",
  "overall_notes": "Retrieval tests need more diverse queries",
  "per_case": [
    { "test_id": "test-feature-alpha", "verdict": "acceptable", "notes": "Score is borderline but behavior is correct" },
    { "test_id": "test-retrieval-basic", "verdict": "needs_improvement", "notes": "Missing coverage of multi-document queries" }
  ]
}
```

## Acceptance Signals

- A documented review step exists in the skill evaluation workflow docs
- A feedback artifact schema is defined with at minimum: run reference, reviewer, timestamp, overall notes, and per-case annotations
- The workflow documentation explains when to review (after automated scoring, before next iteration) and what to look for (score-behavior mismatches, false positives, qualitative regressions)
- The feedback artifact location is specified (e.g., `feedback.json` alongside `results.jsonl`)
- The `agentv-eval-builder` skill reference card is updated to mention the review step

## Non-Goals

- Building a feedback collection UI or interactive review tool (follow-up)
- Adding feedback ingestion to the optimizer workflow (follow-up — optimizer already has its own review checkpoint)
- Real-time collaborative annotation
- Auto-generating improvement suggestions from feedback (that's the analyzer's domain — #567)
- Replacing automated scoring with human review

## Related

- #562 — Static HTML dashboard (potential review surface for reading results)
- #563 — Self-hosted dashboard (potential future review surface with persistence)
- `agentv-optimizer` Phase 4 (Polish) — existing review checkpoint within the optimizer workflow

## Research Basis

- [Skill Creator Findings — Key Pattern #4: Human review is first-class](https://github.com/agentevals/agentevals-research/blob/main/research/findings/skill-creator/README.md)
- [Tessl Findings — Key Pattern #4: Keep human review in the loop](https://github.com/agentevals/agentevals-research/blob/main/research/findings/tessl-skill-improvement/README.md)
- [Skill Lifecycle Alignment — Explore Later #2: Lightweight review surface](https://github.com/agentevals/agentevals-research/blob/main/research/agentv/skill-lifecycle-alignment.md)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: human review checkpoint and feedback artifact for skill iteration #568

Objective

Architecture Boundary

MVP Deliverable

Context

Design Latitude

Suggested Schema (adapt as needed)

Acceptance Signals

Non-Goals

Related

Research Basis

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: human review checkpoint and feedback artifact for skill iteration #568

Description

Objective

Architecture Boundary

MVP Deliverable

Context

Design Latitude

Suggested Schema (adapt as needed)

Acceptance Signals

Non-Goals

Related

Research Basis

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions