docs: canonical skill-improvement workflow guide

## Objective

Publish a user-facing skill-improvement workflow guide that teaches AgentV users how to iteratively evaluate and improve skills using `evals.json`, `agentv run`, `agentv compare`, and `EVAL.yaml` — aligning with the industry lifecycle pattern established by Anthropic's skill-creator and Tessl.

## Architecture Boundary

docs-examples

## Context

Research ([agentevals-research PR #33](https://github.com/agentevals/agentevals-research/pull/33)) found that Anthropic's `skill-creator` and Tessl both treat skill quality as a full lifecycle: Define → Draft → Run paired evals → Grade → Review → Improve → Re-run → Package. AgentV already has all the underlying primitives and a sophisticated optimizer skill (`agentv-optimizer`) with 5-phase workflow, SIMBA/GEPA analysis patterns, and baseline comparison — but no user-facing documentation tying the primitives together into a simple, approachable guide.

The `agentv-optimizer` skill is powerful but complex (467 lines, 4 sub-agents). This guide should be a simpler entry point that teaches the core loop and points users to the optimizer for advanced iteration.

## What already exists in AgentV

- `evals.json` support with conversion to EVAL YAML (`agentv-eval-builder` skill)
- `EVAL.yaml` for deterministic, multi-turn, and workspace-aware evaluation
- `agentv compare` for before/after baseline comparison
- `agentv-optimizer` skill with Discovery → Planning → Optimization → Polish → Handoff phases
- `agentv-eval-orchestrator` skill that generates `benchmark.json` in Agent Skills format
- `agentv-trace-analyst` skill for failure investigation and A/B comparison
- `optimizer-reflector` agent with SIMBA (self-introspective failure analysis) and GEPA (trace reflection)

The gap is a simple user-facing guide, not the underlying capabilities.

## Design Latitude

- Choose the best doc structure and placement within the existing Astro docs site
- May restructure or extend existing guides (`agent-skills-evals.mdx`, `running-evals.mdx`, `tools/compare.mdx`) or create a new guide that links to them
- Choose how to present the progression from `evals.json` to `EVAL.yaml`
- Progressive disclosure guidance for skill authoring can be inline or a separate section
- May include one or more worked examples
- Should complement (not duplicate) the `agentv-optimizer` skill — the guide teaches the manual loop, the skill automates it

## Migration Guide: Skill-Creator → AgentV

The workflow guide should include a migration section for users coming from skill-creator's eval approach. The migration is near-zero friction:

| Skill-Creator | AgentV | Notes |
|--------------|--------|-------|
| `evals.json` | `agentv eval run evals.json` | Direct — no conversion needed |
| `claude -p "prompt" ` | `agentv eval run evals.json --target claude` | Same eval, richer engine |
| `grading.json` (read) | `grading.json` (write) | Same schema, AgentV produces it |
| `benchmark.json` (read) | `benchmark.json` (write) | Same schema, AgentV produces it |
| with-skill vs without-skill | `--target claude --target gpt4o` | Multi-provider instead of binary |
| `eval-viewer/generate_review.py` | `agentv eval run --format html` (#562) | Or keep using eval-viewer — artifacts are compatible |
| Graduate to richer evals | `agentv eval convert evals.json` → EVAL.yaml | Adds: workspace, code judges, tool trajectory, multi-turn |

The guide should make clear: **you do not need to rewrite your evals.json.** AgentV reads it directly. The only change is the command you run.

## Acceptance Signals

- A user can follow the guide from scratch to run a paired baseline-vs-candidate skill evaluation using existing AgentV commands
- The guide covers: writing scenarios in `evals.json`, running baseline and candidate with `agentv run`, comparing with `agentv compare`, graduating to `EVAL.yaml` for deterministic/process-level checks, and reviewing failures before iterating
- Baseline comparison is presented as the default evaluation pattern (with-skill vs without-skill / previous-skill)
- Discovery-path contamination risks are documented: skills auto-loaded from `.claude/skills/` and plugin discovery directories affect baseline validity. The practical workaround is to develop skills outside discovery paths and inject only for the candidate run
- Packaging guidance: distributable skill assets should exclude local eval authoring artifacts (evals/, iteration directories)
- Progressive disclosure is recommended for skill authoring (short triggerable instructions + on-demand references/scripts)
- The `agentv-eval-builder` skill reference card is updated to cross-reference this guide (per CLAUDE.md Documentation Updates guidelines)
- The guide references the `agentv-optimizer` skill for users who want automated iteration

## Non-Goals

- Adding new CLI commands or runtime features
- Building trigger-evaluation tooling
- Creating a review UI (covered separately)
- Prescribing a specific skill directory structure (that's the Agent Skills standard's domain)
- Replacing or duplicating the `agentv-optimizer` skill's content

## Research Basis

- [Skill Lifecycle Alignment Memo](https://github.com/agentevals/agentevals-research/blob/main/research/agentv/skill-lifecycle-alignment.md)
- [Anthropic Skill Creator Findings](https://github.com/agentevals/agentevals-research/blob/main/research/findings/skill-creator/README.md)
- [Tessl + HBOon Findings](https://github.com/agentevals/agentevals-research/blob/main/research/findings/tessl-skill-improvement/README.md)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: canonical skill-improvement workflow guide #564

Objective

Architecture Boundary

Context

What already exists in AgentV

Design Latitude

Migration Guide: Skill-Creator → AgentV

Acceptance Signals

Non-Goals

Research Basis

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Skill-Creator	AgentV	Notes
`evals.json`	`agentv eval run evals.json`	Direct — no conversion needed
`claude -p "prompt"`	`agentv eval run evals.json --target claude`	Same eval, richer engine
`grading.json` (read)	`grading.json` (write)	Same schema, AgentV produces it
`benchmark.json` (read)	`benchmark.json` (write)	Same schema, AgentV produces it
with-skill vs without-skill	`--target claude --target gpt4o`	Multi-provider instead of binary
`eval-viewer/generate_review.py`	`agentv eval run --format html` (#562)	Or keep using eval-viewer — artifacts are compatible
Graduate to richer evals	`agentv eval convert evals.json` → EVAL.yaml	Adds: workspace, code judges, tool trajectory, multi-turn

docs: canonical skill-improvement workflow guide #564

Description

Objective

Architecture Boundary

Context

What already exists in AgentV

Design Latitude

Migration Guide: Skill-Creator → AgentV

Acceptance Signals

Non-Goals

Research Basis

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions