Skip to content

feat: human review checkpoint and feedback artifact for skill iteration #568

@christso

Description

@christso

Objective

Add a documented human review step to the skill evaluation workflow with a structured feedback artifact, enabling reviewers to annotate eval results with qualitative notes that persist across iterations and inform the next improvement cycle.

Architecture Boundary

external-first (docs + schema definition)

MVP Deliverable

Docs + schema definition only. The minimum viable deliverable is:

  1. A documented review workflow ("after automated scoring, before next iteration")
  2. A defined feedback artifact schema
  3. Guidance on where to store feedback (alongside results)

Interactive tooling (CLI helper, feedback collection in HTML report) is a follow-up, not part of this issue.

Context

Both Anthropic's skill-creator and the HBOon/Tessl workflow treat human review as a first-class step. The skill-creator uses eval-viewer/generate_review.py to produce a review surface and collects structured feedback.json.

AgentV's agentv-optimizer skill already includes a review-like checkpoint in Phase 4 (Polish) where the human reviews changes before handoff. The static HTML dashboard (#562) provides visualization. What's missing is a documented, lightweight review step with a portable feedback artifact that works independently of the optimizer and dashboard.

Design Latitude

  • Choose the feedback artifact format (JSON recommended for consistency with other artifacts)
  • The review workflow can be as simple as: "open the HTML report (feat: self-contained HTML dashboard with meta-refresh (runs, benchmarks, traces) #562), review failures, write feedback.json in the results directory"
  • Choose specific field names and verdict enums for the schema
  • May define the schema in the docs, in a JSON Schema file, or both

Suggested Schema (adapt as needed)

{
  "run_id": "2026-03-14T10-32-00_claude",
  "reviewer": "engineer-name",
  "timestamp": "2026-03-14T12:00:00Z",
  "overall_notes": "Retrieval tests need more diverse queries",
  "per_case": [
    { "test_id": "test-feature-alpha", "verdict": "acceptable", "notes": "Score is borderline but behavior is correct" },
    { "test_id": "test-retrieval-basic", "verdict": "needs_improvement", "notes": "Missing coverage of multi-document queries" }
  ]
}

Acceptance Signals

  • A documented review step exists in the skill evaluation workflow docs
  • A feedback artifact schema is defined with at minimum: run reference, reviewer, timestamp, overall notes, and per-case annotations
  • The workflow documentation explains when to review (after automated scoring, before next iteration) and what to look for (score-behavior mismatches, false positives, qualitative regressions)
  • The feedback artifact location is specified (e.g., feedback.json alongside results.jsonl)
  • The agentv-eval-builder skill reference card is updated to mention the review step

Non-Goals

  • Building a feedback collection UI or interactive review tool (follow-up)
  • Adding feedback ingestion to the optimizer workflow (follow-up — optimizer already has its own review checkpoint)
  • Real-time collaborative annotation
  • Auto-generating improvement suggestions from feedback (that's the analyzer's domain — feat: eval analyzer pass for weak assertions and flaky scenarios #567)
  • Replacing automated scoring with human review

Related

Research Basis

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions