Skip to content

docs: canonical skill-improvement workflow guide #564

@christso

Description

@christso

Objective

Publish a user-facing skill-improvement workflow guide that teaches AgentV users how to iteratively evaluate and improve skills using evals.json, agentv run, agentv compare, and EVAL.yaml — aligning with the industry lifecycle pattern established by Anthropic's skill-creator and Tessl.

Architecture Boundary

docs-examples

Context

Research (agentevals-research PR #33) found that Anthropic's skill-creator and Tessl both treat skill quality as a full lifecycle: Define → Draft → Run paired evals → Grade → Review → Improve → Re-run → Package. AgentV already has all the underlying primitives and a sophisticated optimizer skill (agentv-optimizer) with 5-phase workflow, SIMBA/GEPA analysis patterns, and baseline comparison — but no user-facing documentation tying the primitives together into a simple, approachable guide.

The agentv-optimizer skill is powerful but complex (467 lines, 4 sub-agents). This guide should be a simpler entry point that teaches the core loop and points users to the optimizer for advanced iteration.

What already exists in AgentV

  • evals.json support with conversion to EVAL YAML (agentv-eval-builder skill)
  • EVAL.yaml for deterministic, multi-turn, and workspace-aware evaluation
  • agentv compare for before/after baseline comparison
  • agentv-optimizer skill with Discovery → Planning → Optimization → Polish → Handoff phases
  • agentv-eval-orchestrator skill that generates benchmark.json in Agent Skills format
  • agentv-trace-analyst skill for failure investigation and A/B comparison
  • optimizer-reflector agent with SIMBA (self-introspective failure analysis) and GEPA (trace reflection)

The gap is a simple user-facing guide, not the underlying capabilities.

Design Latitude

  • Choose the best doc structure and placement within the existing Astro docs site
  • May restructure or extend existing guides (agent-skills-evals.mdx, running-evals.mdx, tools/compare.mdx) or create a new guide that links to them
  • Choose how to present the progression from evals.json to EVAL.yaml
  • Progressive disclosure guidance for skill authoring can be inline or a separate section
  • May include one or more worked examples
  • Should complement (not duplicate) the agentv-optimizer skill — the guide teaches the manual loop, the skill automates it

Migration Guide: Skill-Creator → AgentV

The workflow guide should include a migration section for users coming from skill-creator's eval approach. The migration is near-zero friction:

Skill-Creator AgentV Notes
evals.json agentv eval run evals.json Direct — no conversion needed
claude -p "prompt" agentv eval run evals.json --target claude Same eval, richer engine
grading.json (read) grading.json (write) Same schema, AgentV produces it
benchmark.json (read) benchmark.json (write) Same schema, AgentV produces it
with-skill vs without-skill --target claude --target gpt4o Multi-provider instead of binary
eval-viewer/generate_review.py agentv eval run --format html (#562) Or keep using eval-viewer — artifacts are compatible
Graduate to richer evals agentv eval convert evals.json → EVAL.yaml Adds: workspace, code judges, tool trajectory, multi-turn

The guide should make clear: you do not need to rewrite your evals.json. AgentV reads it directly. The only change is the command you run.

Acceptance Signals

  • A user can follow the guide from scratch to run a paired baseline-vs-candidate skill evaluation using existing AgentV commands
  • The guide covers: writing scenarios in evals.json, running baseline and candidate with agentv run, comparing with agentv compare, graduating to EVAL.yaml for deterministic/process-level checks, and reviewing failures before iterating
  • Baseline comparison is presented as the default evaluation pattern (with-skill vs without-skill / previous-skill)
  • Discovery-path contamination risks are documented: skills auto-loaded from .claude/skills/ and plugin discovery directories affect baseline validity. The practical workaround is to develop skills outside discovery paths and inject only for the candidate run
  • Packaging guidance: distributable skill assets should exclude local eval authoring artifacts (evals/, iteration directories)
  • Progressive disclosure is recommended for skill authoring (short triggerable instructions + on-demand references/scripts)
  • The agentv-eval-builder skill reference card is updated to cross-reference this guide (per CLAUDE.md Documentation Updates guidelines)
  • The guide references the agentv-optimizer skill for users who want automated iteration

Non-Goals

  • Adding new CLI commands or runtime features
  • Building trigger-evaluation tooling
  • Creating a review UI (covered separately)
  • Prescribing a specific skill directory structure (that's the Agent Skills standard's domain)
  • Replacing or duplicating the agentv-optimizer skill's content

Research Basis

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions