-
Notifications
You must be signed in to change notification settings - Fork 0
feat: human review checkpoint and feedback artifact for skill iteration #568
Description
Objective
Add a documented human review step to the skill evaluation workflow with a structured feedback artifact, enabling reviewers to annotate eval results with qualitative notes that persist across iterations and inform the next improvement cycle.
Architecture Boundary
external-first (docs + schema definition)
MVP Deliverable
Docs + schema definition only. The minimum viable deliverable is:
- A documented review workflow ("after automated scoring, before next iteration")
- A defined feedback artifact schema
- Guidance on where to store feedback (alongside results)
Interactive tooling (CLI helper, feedback collection in HTML report) is a follow-up, not part of this issue.
Context
Both Anthropic's skill-creator and the HBOon/Tessl workflow treat human review as a first-class step. The skill-creator uses eval-viewer/generate_review.py to produce a review surface and collects structured feedback.json.
AgentV's agentv-optimizer skill already includes a review-like checkpoint in Phase 4 (Polish) where the human reviews changes before handoff. The static HTML dashboard (#562) provides visualization. What's missing is a documented, lightweight review step with a portable feedback artifact that works independently of the optimizer and dashboard.
Design Latitude
- Choose the feedback artifact format (JSON recommended for consistency with other artifacts)
- The review workflow can be as simple as: "open the HTML report (feat: self-contained HTML dashboard with meta-refresh (runs, benchmarks, traces) #562), review failures, write feedback.json in the results directory"
- Choose specific field names and verdict enums for the schema
- May define the schema in the docs, in a JSON Schema file, or both
Suggested Schema (adapt as needed)
{
"run_id": "2026-03-14T10-32-00_claude",
"reviewer": "engineer-name",
"timestamp": "2026-03-14T12:00:00Z",
"overall_notes": "Retrieval tests need more diverse queries",
"per_case": [
{ "test_id": "test-feature-alpha", "verdict": "acceptable", "notes": "Score is borderline but behavior is correct" },
{ "test_id": "test-retrieval-basic", "verdict": "needs_improvement", "notes": "Missing coverage of multi-document queries" }
]
}Acceptance Signals
- A documented review step exists in the skill evaluation workflow docs
- A feedback artifact schema is defined with at minimum: run reference, reviewer, timestamp, overall notes, and per-case annotations
- The workflow documentation explains when to review (after automated scoring, before next iteration) and what to look for (score-behavior mismatches, false positives, qualitative regressions)
- The feedback artifact location is specified (e.g.,
feedback.jsonalongsideresults.jsonl) - The
agentv-eval-builderskill reference card is updated to mention the review step
Non-Goals
- Building a feedback collection UI or interactive review tool (follow-up)
- Adding feedback ingestion to the optimizer workflow (follow-up — optimizer already has its own review checkpoint)
- Real-time collaborative annotation
- Auto-generating improvement suggestions from feedback (that's the analyzer's domain — feat: eval analyzer pass for weak assertions and flaky scenarios #567)
- Replacing automated scoring with human review
Related
- feat: self-contained HTML dashboard with meta-refresh (runs, benchmarks, traces) #562 — Static HTML dashboard (potential review surface for reading results)
- feat: AgentV Studio — eval management platform with historical trends, quality gates, and orchestration #563 — Self-hosted dashboard (potential future review surface with persistence)
agentv-optimizerPhase 4 (Polish) — existing review checkpoint within the optimizer workflow