From 3fac206a4bb86e4d6464e6c2b99711956a74e6b6 Mon Sep 17 00:00:00 2001
From: Christopher Tso <christso@gmail.com>
Date: Sat, 14 Mar 2026 04:12:55 +0000
Subject: [PATCH] docs: canonical skill-improvement workflow guide (#564)

Publish user-facing guide for iterative skill evaluation covering the full
loop: write scenarios, run baseline, compare, review, improve, re-run.
Includes migration section for skill-creator users and cross-reference
from agentv-eval-builder.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---
 .../guides/skill-improvement-workflow.mdx     | 309 ++++++++++++++++++
 .../skills/agentv-eval-builder/SKILL.md       |   4 +
 2 files changed, 313 insertions(+)
 create mode 100644 apps/web/src/content/docs/guides/skill-improvement-workflow.mdx

diff --git a/apps/web/src/content/docs/guides/skill-improvement-workflow.mdx b/apps/web/src/content/docs/guides/skill-improvement-workflow.mdx
new file mode 100644
index 000000000..d86047967
--- /dev/null
+++ b/apps/web/src/content/docs/guides/skill-improvement-workflow.mdx
@@ -0,0 +1,309 @@
+---
+title: Skill Improvement Workflow
+description: Iteratively evaluate and improve agent skills using AgentV
+sidebar:
+  order: 6
+---
+
+## Introduction
+
+AgentV supports a full evaluation-driven improvement loop for skills and agents. Instead of guessing whether a change makes things better, you run structured evaluations before and after, then compare.
+
+This guide teaches the **core manual loop**. For automated iteration that runs the full cycle hands-free, see [agentv-optimizer](#automated-iteration).
+
+## The Core Loop
+
+Every skill improvement follows the same cycle:
+
+```
+┌─────────────────┐
+│ Write Scenarios  │
+└────────┬────────┘
+         ▼
+┌─────────────────┐
+│  Run Baseline    │◄──────────────────┐
+└────────┬────────┘                    │
+         ▼                             │
+┌─────────────────┐                    │
+│  Run Candidate   │                   │
+└────────┬────────┘                    │
+         ▼                             │
+┌─────────────────┐                    │
+│    Compare       │                   │
+└────────┬────────┘                    │
+         ▼                             │
+┌─────────────────┐                    │
+│ Review Failures  │                   │
+└────────┬────────┘                    │
+         ▼                             │
+┌─────────────────┐                    │
+│ Improve Skill    │────── Re-run ─────┘
+└─────────────────┘
+```
+
+1. **Write test scenarios** that capture what the skill should do
+2. **Run a baseline** evaluation without the skill (or with the previous version)
+3. **Run a candidate** evaluation with the new or updated skill
+4. **Compare** the two runs to see what improved and what regressed
+5. **Review failures** to understand why specific cases failed
+6. **Improve** the skill based on failure analysis
+7. **Re-run** and iterate until the candidate consistently beats the baseline
+
+## Step 1: Write Test Scenarios
+
+Start with `evals.json` for quick iteration. It's the simplest format and works directly with AgentV — no conversion needed.
+
+```json
+{
+  "skill_name": "code-reviewer",
+  "evals": [
+    {
+      "id": 1,
+      "prompt": "Review this Python function for bugs:\n\ndef divide(a, b):\n    return a / b",
+      "expected_output": "The function should handle division by zero.",
+      "assertions": [
+        "Identifies the division by zero risk",
+        "Suggests adding error handling"
+      ]
+    },
+    {
+      "id": 2,
+      "prompt": "Review this function:\n\ndef greet(name):\n    return f'Hello, {name}!'",
+      "expected_output": "The function is simple and correct.",
+      "assertions": [
+        "Does not flag false issues",
+        "Acknowledges the function is straightforward"
+      ]
+    }
+  ]
+}
+```
+
+For assisted authoring, use the `agentv-eval-builder` skill — it knows the full eval file schema and can generate test cases from descriptions.
+
+:::tip
+Start with 5–10 focused test cases. You can always add more as you discover edge cases during the review step.
+:::
+
+## Step 2: Run Baseline Evaluation
+
+Run the evaluation **without** the skill loaded to establish a baseline:
+
+```bash
+agentv eval evals.json --target baseline
+```
+
+This produces a results file (e.g., `results-baseline.jsonl`) showing how the agent performs on its own.
+
+### Baseline isolation
+
+Skills in `.claude/skills/` are auto-loaded by progressive disclosure. This means your baseline may accidentally include the skill you're testing.
+
+**Workaround:** Develop skills outside discovery paths during the evaluation cycle. Keep your skill-in-progress in a working directory (e.g., `drafts/`) and only move it to `.claude/skills/` when you're satisfied with the evaluation results.
+
+```bash
+# Skill lives outside the discovery path during development
+drafts/
+  my-skill/
+    SKILL.md
+
+# Baseline run won't pick it up
+agentv eval evals.json --target baseline
+```
+
+## Step 3: Run Candidate Evaluation
+
+Run the same evaluation **with** the skill loaded:
+
+```bash
+agentv eval evals.json --target candidate
+```
+
+Or if using agent mode (no API keys required):
+
+```bash
+# Get the input prompt for a test case
+agentv prompt eval input evals.json --test-id 1
+
+# After running the agent, judge the response
+agentv prompt eval judge evals.json --test-id 1 --answer-file answer.txt
+```
+
+Agent mode is useful when you want to evaluate skills with agents that don't have a direct API integration — you orchestrate the agent yourself and let AgentV handle the judging.
+
+## Step 4: Compare Results
+
+Compare the baseline and candidate runs:
+
+```bash
+agentv compare results-baseline.jsonl results-candidate.jsonl
+```
+
+The comparison output shows:
+
+- **Per-test score deltas** — which cases improved, regressed, or stayed the same
+- **Aggregate statistics** — overall pass rate change, mean score shift
+- **Regressions** — cases that were passing before but now fail (these need immediate attention)
+
+Look for:
+- ✅ **Net positive delta** — more cases improved than regressed
+- ⚠️ **Any regressions** — even one regression deserves investigation
+- 📊 **Score distribution** — are improvements concentrated or spread across cases?
+
+## Step 5: Review Failures
+
+Use trace inspection to understand why specific cases failed:
+
+```bash
+agentv trace show <trace-id>
+```
+
+When reviewing failures, categorize them:
+
+| Category | Description | Action |
+|----------|-------------|--------|
+| **True failure** | The skill genuinely handled the case wrong | Improve the skill |
+| **False positive** | Got a passing score but the answer was wrong | Tighten assertions |
+| **False negative** | Correct answer but scored as failing | Fix the evaluation criteria |
+| **Systematic pattern** | Multiple failures share the same root cause | Address the pattern, not individual cases |
+
+Systematic patterns are the highest-value findings. A single skill improvement that fixes a pattern can resolve multiple test failures at once.
+
+## Step 6: Improve the Skill
+
+Apply targeted improvements based on your failure analysis:
+
+- **Keep changes small and testable.** One improvement per iteration makes it easy to attribute score changes.
+- **Document what changed and why.** A brief note in your commit message helps when reviewing the improvement history.
+- **Address systematic patterns first.** These give the best return on effort.
+
+```markdown
+<!-- Example: skill change log in a commit message -->
+fix(code-reviewer): handle edge case for single-line functions
+
+The skill was flagging all single-line functions as "too terse" even when
+they were appropriate (e.g., simple getters). Added context-aware length
+assessment.
+
+Failure pattern: tests 2, 5, 8 all failed with false-positive complexity warnings.
+```
+
+## Step 7: Re-run and Iterate
+
+Loop back to Step 3 with the improved skill:
+
+```bash
+# Run the improved candidate
+agentv eval evals.json --target candidate
+
+# Compare against the previous baseline
+agentv compare results-baseline.jsonl results-candidate.jsonl
+```
+
+Each iteration should show:
+- Previous regressions resolved
+- No new regressions introduced
+- Steady improvement in overall pass rate
+
+:::note
+Keep your baseline stable across iterations. Only re-run the baseline when the test scenarios themselves change (Step 1), not when the skill changes.
+:::
+
+## Graduating to EVAL.yaml
+
+When `evals.json` becomes limiting — you need workspace isolation, code judges, tool trajectory checks, or multi-turn conversations — graduate to EVAL.yaml:
+
+```bash
+agentv convert evals.json -o eval.yaml
+```
+
+The generated YAML preserves all your existing test cases and adds comments showing AgentV features you can use:
+
+```yaml
+# Converted from Agent Skills evals.json
+tests:
+  - id: "1"
+    criteria: |-
+      The function should handle division by zero.
+    input:
+      - role: user
+        content: "Review this Python function for bugs:..."
+    assert:
+      - name: assertion-1
+        type: llm-judge
+        prompt: "Identifies the division by zero risk"
+      # Replace with type: contains for deterministic checks:
+      # - type: contains
+      #   value: "ZeroDivisionError"
+```
+
+After converting, you can:
+- Replace `llm-judge` assertions with faster deterministic evaluators (`contains`, `regex`, `equals`)
+- Add `workspace` configuration for file-system isolation
+- Use `code-judge` for custom scoring logic
+- Define `tool-trajectory` assertions to check tool usage patterns
+
+See [Skill Evals (evals.json)](/guides/agent-skills-evals/) for the full field mapping and side-by-side comparison.
+
+## Migration from Skill-Creator
+
+If you've been using the Agent Skills skill-creator workflow, AgentV reads your existing files directly — no rewrite needed.
+
+| Skill-Creator | AgentV | Notes |
+|--------------|--------|-------|
+| `evals.json` | `agentv eval evals.json` | Direct — no conversion needed |
+| `claude -p "prompt"` | `agentv eval evals.json --target claude` | Same eval, richer engine |
+| `grading.json` (read) | `grading.json` (write) | Same schema, AgentV produces it |
+| `benchmark.json` (read) | `benchmark.json` (write) | Same schema, AgentV produces it |
+| with-skill vs without-skill | `--target baseline --target candidate` | Structured comparison |
+| Graduate to richer evals | `agentv convert evals.json` → EVAL.yaml | Adds workspace, code judges, etc. |
+
+**Key takeaway:** You do not need to rewrite your `evals.json`. AgentV reads it directly and adds a richer evaluation engine on top.
+
+## Baseline Comparison Best Practices
+
+### Discovery-path contamination
+
+Skills placed in `.claude/skills/` are auto-discovered and loaded into every agent session. This means your baseline run may unknowingly include the skill you're trying to evaluate.
+
+**Mitigation strategies:**
+1. **Develop outside discovery paths** — keep skills in `drafts/` or `wip/` during evaluation
+2. **Use explicit target configurations** — configure baseline and candidate targets with different skill sets
+3. **Verify baseline purity** — run a smoke test to confirm the baseline agent doesn't reference your skill
+
+### Packaging guidance
+
+When distributing skills, exclude evaluation files from the distributable package:
+
+```
+my-skill/
+  SKILL.md           # ✅ distribute
+  evals/             # ❌ exclude from distribution
+    evals.json
+    eval.yaml
+    results/
+```
+
+Evals are development-time artifacts. End users don't need them, and including them adds unnecessary weight to the package.
+
+### Progressive disclosure for skill authoring
+
+Start simple and add complexity only when the evaluation results demand it:
+
+1. **Start with `evals.json`** — 5-10 test cases, natural-language assertions
+2. **Add deterministic checks** — when you find assertions that can be exact (`contains`, `regex`)
+3. **Graduate to EVAL.yaml** — when you need workspace isolation or code judges
+4. **Add tool trajectory checks** — when tool usage patterns matter
+5. **Use rubrics** — when you need weighted, structured scoring criteria
+
+## Automated Iteration
+
+For users who want the full automated improvement cycle, the `agentv-optimizer` skill runs a 5-phase optimization loop:
+
+1. **Analyze** — examines the current skill and evaluation results
+2. **Hypothesize** — generates improvement hypotheses from failure patterns
+3. **Implement** — applies targeted skill modifications
+4. **Evaluate** — re-runs the evaluation suite
+5. **Decide** — keeps improvements that help, reverts those that don't
+
+The optimizer uses the same core loop described in this guide but automates the human steps. Start with the manual loop to build intuition, then graduate to the optimizer when you're comfortable with the evaluation workflow.
diff --git a/plugins/agentv-dev/skills/agentv-eval-builder/SKILL.md b/plugins/agentv-dev/skills/agentv-eval-builder/SKILL.md
index 17f32366f..4dce26028 100644
--- a/plugins/agentv-dev/skills/agentv-eval-builder/SKILL.md
+++ b/plugins/agentv-dev/skills/agentv-eval-builder/SKILL.md
@@ -576,6 +576,10 @@ agentv create assertion <name>  # → .agentv/assertions/<name>.ts
 agentv create eval <name>       # → evals/<name>.eval.yaml + .cases.jsonl
 ```
 
+## Skill Improvement Workflow
+
+For a complete guide to iterating on skills using evaluations — writing scenarios, running baselines, comparing results, and improving — see the [Skill Improvement Workflow](https://agentv.dev/guides/skill-improvement-workflow/) guide.
+
 ## Schemas
 
 - Eval file: `references/eval-schema.json`