Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
309 changes: 309 additions & 0 deletions apps/web/src/content/docs/guides/skill-improvement-workflow.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,309 @@
---
title: Skill Improvement Workflow
description: Iteratively evaluate and improve agent skills using AgentV
sidebar:
order: 6
---

## Introduction

AgentV supports a full evaluation-driven improvement loop for skills and agents. Instead of guessing whether a change makes things better, you run structured evaluations before and after, then compare.

This guide teaches the **core manual loop**. For automated iteration that runs the full cycle hands-free, see [agentv-optimizer](#automated-iteration).

## The Core Loop

Every skill improvement follows the same cycle:

```
┌─────────────────┐
│ Write Scenarios │
└────────┬────────┘
┌─────────────────┐
│ Run Baseline │◄──────────────────┐
└────────┬────────┘ │
▼ │
┌─────────────────┐ │
│ Run Candidate │ │
└────────┬────────┘ │
▼ │
┌─────────────────┐ │
│ Compare │ │
└────────┬────────┘ │
▼ │
┌─────────────────┐ │
│ Review Failures │ │
└────────┬────────┘ │
▼ │
┌─────────────────┐ │
│ Improve Skill │────── Re-run ─────┘
└─────────────────┘
```

1. **Write test scenarios** that capture what the skill should do
2. **Run a baseline** evaluation without the skill (or with the previous version)
3. **Run a candidate** evaluation with the new or updated skill
4. **Compare** the two runs to see what improved and what regressed
5. **Review failures** to understand why specific cases failed
6. **Improve** the skill based on failure analysis
7. **Re-run** and iterate until the candidate consistently beats the baseline

## Step 1: Write Test Scenarios

Start with `evals.json` for quick iteration. It's the simplest format and works directly with AgentV — no conversion needed.

```json
{
"skill_name": "code-reviewer",
"evals": [
{
"id": 1,
"prompt": "Review this Python function for bugs:\n\ndef divide(a, b):\n return a / b",
"expected_output": "The function should handle division by zero.",
"assertions": [
"Identifies the division by zero risk",
"Suggests adding error handling"
]
},
{
"id": 2,
"prompt": "Review this function:\n\ndef greet(name):\n return f'Hello, {name}!'",
"expected_output": "The function is simple and correct.",
"assertions": [
"Does not flag false issues",
"Acknowledges the function is straightforward"
]
}
]
}
```

For assisted authoring, use the `agentv-eval-builder` skill — it knows the full eval file schema and can generate test cases from descriptions.

:::tip
Start with 5–10 focused test cases. You can always add more as you discover edge cases during the review step.
:::

## Step 2: Run Baseline Evaluation

Run the evaluation **without** the skill loaded to establish a baseline:

```bash
agentv eval evals.json --target baseline
```

This produces a results file (e.g., `results-baseline.jsonl`) showing how the agent performs on its own.

### Baseline isolation

Skills in `.claude/skills/` are auto-loaded by progressive disclosure. This means your baseline may accidentally include the skill you're testing.

**Workaround:** Develop skills outside discovery paths during the evaluation cycle. Keep your skill-in-progress in a working directory (e.g., `drafts/`) and only move it to `.claude/skills/` when you're satisfied with the evaluation results.

```bash
# Skill lives outside the discovery path during development
drafts/
my-skill/
SKILL.md

# Baseline run won't pick it up
agentv eval evals.json --target baseline
```

## Step 3: Run Candidate Evaluation

Run the same evaluation **with** the skill loaded:

```bash
agentv eval evals.json --target candidate
```

Or if using agent mode (no API keys required):

```bash
# Get the input prompt for a test case
agentv prompt eval input evals.json --test-id 1

# After running the agent, judge the response
agentv prompt eval judge evals.json --test-id 1 --answer-file answer.txt
```

Agent mode is useful when you want to evaluate skills with agents that don't have a direct API integration — you orchestrate the agent yourself and let AgentV handle the judging.

## Step 4: Compare Results

Compare the baseline and candidate runs:

```bash
agentv compare results-baseline.jsonl results-candidate.jsonl
```

The comparison output shows:

- **Per-test score deltas** — which cases improved, regressed, or stayed the same
- **Aggregate statistics** — overall pass rate change, mean score shift
- **Regressions** — cases that were passing before but now fail (these need immediate attention)

Look for:
- ✅ **Net positive delta** — more cases improved than regressed
- ⚠️ **Any regressions** — even one regression deserves investigation
- 📊 **Score distribution** — are improvements concentrated or spread across cases?

## Step 5: Review Failures

Use trace inspection to understand why specific cases failed:

```bash
agentv trace show <trace-id>
```

When reviewing failures, categorize them:

| Category | Description | Action |
|----------|-------------|--------|
| **True failure** | The skill genuinely handled the case wrong | Improve the skill |
| **False positive** | Got a passing score but the answer was wrong | Tighten assertions |
| **False negative** | Correct answer but scored as failing | Fix the evaluation criteria |
| **Systematic pattern** | Multiple failures share the same root cause | Address the pattern, not individual cases |

Systematic patterns are the highest-value findings. A single skill improvement that fixes a pattern can resolve multiple test failures at once.

## Step 6: Improve the Skill

Apply targeted improvements based on your failure analysis:

- **Keep changes small and testable.** One improvement per iteration makes it easy to attribute score changes.
- **Document what changed and why.** A brief note in your commit message helps when reviewing the improvement history.
- **Address systematic patterns first.** These give the best return on effort.

```markdown
<!-- Example: skill change log in a commit message -->
fix(code-reviewer): handle edge case for single-line functions

The skill was flagging all single-line functions as "too terse" even when
they were appropriate (e.g., simple getters). Added context-aware length
assessment.

Failure pattern: tests 2, 5, 8 all failed with false-positive complexity warnings.
```

## Step 7: Re-run and Iterate

Loop back to Step 3 with the improved skill:

```bash
# Run the improved candidate
agentv eval evals.json --target candidate

# Compare against the previous baseline
agentv compare results-baseline.jsonl results-candidate.jsonl
```

Each iteration should show:
- Previous regressions resolved
- No new regressions introduced
- Steady improvement in overall pass rate

:::note
Keep your baseline stable across iterations. Only re-run the baseline when the test scenarios themselves change (Step 1), not when the skill changes.
:::

## Graduating to EVAL.yaml

When `evals.json` becomes limiting — you need workspace isolation, code judges, tool trajectory checks, or multi-turn conversations — graduate to EVAL.yaml:

```bash
agentv convert evals.json -o eval.yaml
```

The generated YAML preserves all your existing test cases and adds comments showing AgentV features you can use:

```yaml
# Converted from Agent Skills evals.json
tests:
- id: "1"
criteria: |-
The function should handle division by zero.
input:
- role: user
content: "Review this Python function for bugs:..."
assert:
- name: assertion-1
type: llm-judge
prompt: "Identifies the division by zero risk"
# Replace with type: contains for deterministic checks:
# - type: contains
# value: "ZeroDivisionError"
```

After converting, you can:
- Replace `llm-judge` assertions with faster deterministic evaluators (`contains`, `regex`, `equals`)
- Add `workspace` configuration for file-system isolation
- Use `code-judge` for custom scoring logic
- Define `tool-trajectory` assertions to check tool usage patterns

See [Skill Evals (evals.json)](/guides/agent-skills-evals/) for the full field mapping and side-by-side comparison.

## Migration from Skill-Creator

If you've been using the Agent Skills skill-creator workflow, AgentV reads your existing files directly — no rewrite needed.

| Skill-Creator | AgentV | Notes |
|--------------|--------|-------|
| `evals.json` | `agentv eval evals.json` | Direct — no conversion needed |
| `claude -p "prompt"` | `agentv eval evals.json --target claude` | Same eval, richer engine |
| `grading.json` (read) | `grading.json` (write) | Same schema, AgentV produces it |
| `benchmark.json` (read) | `benchmark.json` (write) | Same schema, AgentV produces it |
| with-skill vs without-skill | `--target baseline --target candidate` | Structured comparison |
| Graduate to richer evals | `agentv convert evals.json` → EVAL.yaml | Adds workspace, code judges, etc. |

**Key takeaway:** You do not need to rewrite your `evals.json`. AgentV reads it directly and adds a richer evaluation engine on top.

## Baseline Comparison Best Practices

### Discovery-path contamination

Skills placed in `.claude/skills/` are auto-discovered and loaded into every agent session. This means your baseline run may unknowingly include the skill you're trying to evaluate.

**Mitigation strategies:**
1. **Develop outside discovery paths** — keep skills in `drafts/` or `wip/` during evaluation
2. **Use explicit target configurations** — configure baseline and candidate targets with different skill sets
3. **Verify baseline purity** — run a smoke test to confirm the baseline agent doesn't reference your skill

### Packaging guidance

When distributing skills, exclude evaluation files from the distributable package:

```
my-skill/
SKILL.md # ✅ distribute
evals/ # ❌ exclude from distribution
evals.json
eval.yaml
results/
```

Evals are development-time artifacts. End users don't need them, and including them adds unnecessary weight to the package.

### Progressive disclosure for skill authoring

Start simple and add complexity only when the evaluation results demand it:

1. **Start with `evals.json`** — 5-10 test cases, natural-language assertions
2. **Add deterministic checks** — when you find assertions that can be exact (`contains`, `regex`)
3. **Graduate to EVAL.yaml** — when you need workspace isolation or code judges
4. **Add tool trajectory checks** — when tool usage patterns matter
5. **Use rubrics** — when you need weighted, structured scoring criteria

## Automated Iteration

For users who want the full automated improvement cycle, the `agentv-optimizer` skill runs a 5-phase optimization loop:

1. **Analyze** — examines the current skill and evaluation results
2. **Hypothesize** — generates improvement hypotheses from failure patterns
3. **Implement** — applies targeted skill modifications
4. **Evaluate** — re-runs the evaluation suite
5. **Decide** — keeps improvements that help, reverts those that don't

The optimizer uses the same core loop described in this guide but automates the human steps. Start with the manual loop to build intuition, then graduate to the optimizer when you're comfortable with the evaluation workflow.
3 changes: 3 additions & 0 deletions plugins/agentv-dev/skills/agentv-eval-builder/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -582,6 +582,9 @@ agentv create assertion <name> # → .agentv/assertions/<name>.ts
agentv create eval <name> # → evals/<name>.eval.yaml + .cases.jsonl
```

## Skill Improvement Workflow

For a complete guide to iterating on skills using evaluations — writing scenarios, running baselines, comparing results, and improving — see the [Skill Improvement Workflow](https://agentv.dev/guides/skill-improvement-workflow/) guide.
## Human Review Checkpoint

After running evals, perform a human review before iterating. Create `feedback.json` in the results directory alongside `results.jsonl`:
Expand Down