-
Notifications
You must be signed in to change notification settings - Fork 0
docs: canonical skill-improvement workflow guide #564
Description
Objective
Publish a user-facing skill-improvement workflow guide that teaches AgentV users how to iteratively evaluate and improve skills using evals.json, agentv run, agentv compare, and EVAL.yaml — aligning with the industry lifecycle pattern established by Anthropic's skill-creator and Tessl.
Architecture Boundary
docs-examples
Context
Research (agentevals-research PR #33) found that Anthropic's skill-creator and Tessl both treat skill quality as a full lifecycle: Define → Draft → Run paired evals → Grade → Review → Improve → Re-run → Package. AgentV already has all the underlying primitives and a sophisticated optimizer skill (agentv-optimizer) with 5-phase workflow, SIMBA/GEPA analysis patterns, and baseline comparison — but no user-facing documentation tying the primitives together into a simple, approachable guide.
The agentv-optimizer skill is powerful but complex (467 lines, 4 sub-agents). This guide should be a simpler entry point that teaches the core loop and points users to the optimizer for advanced iteration.
What already exists in AgentV
evals.jsonsupport with conversion to EVAL YAML (agentv-eval-builderskill)EVAL.yamlfor deterministic, multi-turn, and workspace-aware evaluationagentv comparefor before/after baseline comparisonagentv-optimizerskill with Discovery → Planning → Optimization → Polish → Handoff phasesagentv-eval-orchestratorskill that generatesbenchmark.jsonin Agent Skills formatagentv-trace-analystskill for failure investigation and A/B comparisonoptimizer-reflectoragent with SIMBA (self-introspective failure analysis) and GEPA (trace reflection)
The gap is a simple user-facing guide, not the underlying capabilities.
Design Latitude
- Choose the best doc structure and placement within the existing Astro docs site
- May restructure or extend existing guides (
agent-skills-evals.mdx,running-evals.mdx,tools/compare.mdx) or create a new guide that links to them - Choose how to present the progression from
evals.jsontoEVAL.yaml - Progressive disclosure guidance for skill authoring can be inline or a separate section
- May include one or more worked examples
- Should complement (not duplicate) the
agentv-optimizerskill — the guide teaches the manual loop, the skill automates it
Migration Guide: Skill-Creator → AgentV
The workflow guide should include a migration section for users coming from skill-creator's eval approach. The migration is near-zero friction:
| Skill-Creator | AgentV | Notes |
|---|---|---|
evals.json |
agentv eval run evals.json |
Direct — no conversion needed |
claude -p "prompt" |
agentv eval run evals.json --target claude |
Same eval, richer engine |
grading.json (read) |
grading.json (write) |
Same schema, AgentV produces it |
benchmark.json (read) |
benchmark.json (write) |
Same schema, AgentV produces it |
| with-skill vs without-skill | --target claude --target gpt4o |
Multi-provider instead of binary |
eval-viewer/generate_review.py |
agentv eval run --format html (#562) |
Or keep using eval-viewer — artifacts are compatible |
| Graduate to richer evals | agentv eval convert evals.json → EVAL.yaml |
Adds: workspace, code judges, tool trajectory, multi-turn |
The guide should make clear: you do not need to rewrite your evals.json. AgentV reads it directly. The only change is the command you run.
Acceptance Signals
- A user can follow the guide from scratch to run a paired baseline-vs-candidate skill evaluation using existing AgentV commands
- The guide covers: writing scenarios in
evals.json, running baseline and candidate withagentv run, comparing withagentv compare, graduating toEVAL.yamlfor deterministic/process-level checks, and reviewing failures before iterating - Baseline comparison is presented as the default evaluation pattern (with-skill vs without-skill / previous-skill)
- Discovery-path contamination risks are documented: skills auto-loaded from
.claude/skills/and plugin discovery directories affect baseline validity. The practical workaround is to develop skills outside discovery paths and inject only for the candidate run - Packaging guidance: distributable skill assets should exclude local eval authoring artifacts (evals/, iteration directories)
- Progressive disclosure is recommended for skill authoring (short triggerable instructions + on-demand references/scripts)
- The
agentv-eval-builderskill reference card is updated to cross-reference this guide (per CLAUDE.md Documentation Updates guidelines) - The guide references the
agentv-optimizerskill for users who want automated iteration
Non-Goals
- Adding new CLI commands or runtime features
- Building trigger-evaluation tooling
- Creating a review UI (covered separately)
- Prescribing a specific skill directory structure (that's the Agent Skills standard's domain)
- Replacing or duplicating the
agentv-optimizerskill's content