From 0df6fb066ead9f5c9e3c7d21200f8a1d6476169c Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Sat, 14 Mar 2026 04:10:46 +0000 Subject: [PATCH] docs: separate execution quality from trigger quality in eval guidance (#566) Add evaluation-types guide explaining the distinction between execution quality (what AgentV measures) and trigger quality (skill-creator's domain). Update agentv-eval-builder reference card to cross-reference. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .../content/docs/guides/evaluation-types.mdx | 100 ++++++++++++++++++ .../skills/agentv-eval-builder/SKILL.md | 14 +++ 2 files changed, 114 insertions(+) create mode 100644 apps/web/src/content/docs/guides/evaluation-types.mdx diff --git a/apps/web/src/content/docs/guides/evaluation-types.mdx b/apps/web/src/content/docs/guides/evaluation-types.mdx new file mode 100644 index 000000000..388463223 --- /dev/null +++ b/apps/web/src/content/docs/guides/evaluation-types.mdx @@ -0,0 +1,100 @@ +--- +title: Execution Quality vs Trigger Quality +description: Two distinct evaluation concerns for AI agents and skills — what AgentV measures, and what belongs to skill-creator tooling. +sidebar: + order: 6 +--- + +Agent evaluation has two fundamentally different concerns: **execution quality** and **trigger quality**. They require different tooling, different methodologies, and different optimization surfaces. Conflating them leads to eval configs that are noisy, hard to maintain, and unreliable. + +## What is execution quality? + +> **"Does the skill help when loaded?"** + +Execution quality evaluates output quality, correctness, and completeness once an agent or skill is invoked. Given a specific input, does the agent produce the right output? + +This is what AgentV's eval tooling measures. When you write an `EVAL.yaml`, define assertions in `evals.json`, or run `agentv eval run`, you are evaluating execution quality. + +**Examples:** +- Does the code-review skill produce accurate, actionable feedback? +- Does the refactoring agent preserve behavior while improving structure? +- Does the documentation skill generate correct, complete docs? + +**Characteristics:** +- **Deterministic-ish** — the same input produces similar output across runs +- **Testable with fixed assertions** — you can write specific pass/fail criteria +- **Bounded scope** — one skill, one input, one expected behavior + +## What is trigger quality? + +> **"Does the system load the skill when it should?"** + +Trigger quality evaluates whether the right skill is activated for the right prompts. When a user says "review this PR," does the system route to the code-review skill? When they say "explain this function," does it route to the documentation skill instead? + +**Examples:** +- Does the code-review skill trigger on "review this diff" but not on "write a test"? +- Does the skill description accurately capture when the skill should activate? +- Are there prompt phrasings that should trigger the skill but don't? + +**Characteristics:** +- **Noisy** — model routing varies across runs, even with identical prompts +- **Requires statistical sampling** — repeated trials, not single-shot assertions +- **Different optimization surface** — you're tuning descriptions and metadata, not agent logic + +## Why they are different problems + +| Dimension | Execution quality | Trigger quality | +|-----------|------------------|-----------------| +| **Question** | "Does it help?" | "Does it activate?" | +| **Signal type** | Deterministic-ish | Noisy / statistical | +| **Test method** | Fixed assertions, rubrics, judges | Repeated trials, train/test splits | +| **What you tune** | Agent logic, prompts, tool use | Skill descriptions, trigger metadata | +| **Failure mode** | Wrong output | Wrong routing | +| **Optimization** | Pass/fail per test case | Accuracy rate over a sample | + +Mixing these concerns in a single eval config creates problems: +- Execution evals become flaky because trigger noise pollutes results +- Trigger evals are too coarse because they inherit execution assertions +- Debugging failures becomes ambiguous — is the skill wrong, or was the wrong skill loaded? + +## What AgentV evaluates + +AgentV's eval tooling is designed for **execution quality**: + +- **`EVAL.yaml`** — define test cases with inputs, expected outputs, and assertions +- **`evals.json`** — lightweight skill evaluation format (prompt/expected-output pairs) +- **`agentv eval run`** — execute evaluations and collect results +- **Evaluators** — `llm-judge`, `code-judge`, `tool-trajectory`, `rubrics`, `contains`, `regex`, and others all measure execution behavior + +These tools assume the skill is already loaded and invoked. They measure what happens *after* routing, not the routing decision itself. + +## What about trigger quality? + +Trigger quality evaluation is a distinct discipline with its own tooling requirements: + +- **Repeated trials** — run the same prompt many times to measure trigger rates +- **Train/test splits** — separate prompts used for tuning from prompts used for validation +- **Description optimization** — iteratively improve skill descriptions based on trigger accuracy +- **Held-out model selection** — evaluate across different routing models + +Anthropic's skill-creator tooling demonstrates this approach with repeated trigger trials, train/test splits, and dedicated description-improvement workflows. This is a statistical optimization problem, not a pass/fail testing problem. + +For now, trigger quality optimization belongs in **skill-creator's domain** — it requires specialized tooling that is architecturally separate from execution evaluation. + +## Practical guidance + +**Do not use execution eval configs for trigger evaluation.** Specifically: + +- Do not add "does this skill trigger?" test cases to your `EVAL.yaml` +- Do not use `agentv eval run` to measure trigger rates +- Do not conflate routing failures with execution failures in eval results + +**If you need to test trigger quality:** +- Use skill-creator's trigger evaluation tooling +- Design trigger tests as statistical experiments (sample sizes, confidence intervals) +- Keep trigger evaluation in a separate workflow from execution evaluation + +**Keep your eval configs focused:** +- `EVAL.yaml` and `evals.json` → execution quality only +- Assertions should test output correctness, not routing behavior +- If an eval is flaky, check whether you've accidentally mixed trigger concerns into execution tests diff --git a/plugins/agentv-dev/skills/agentv-eval-builder/SKILL.md b/plugins/agentv-dev/skills/agentv-eval-builder/SKILL.md index 17f32366f..75840b45d 100644 --- a/plugins/agentv-dev/skills/agentv-eval-builder/SKILL.md +++ b/plugins/agentv-dev/skills/agentv-eval-builder/SKILL.md @@ -7,6 +7,12 @@ description: Create and maintain AgentV evaluation files for testing AI agent pe Comprehensive docs: https://agentv.dev +## Evaluation Types + +AgentV evaluations measure **execution quality** — whether your agent or skill produces correct output when invoked. + +For **trigger quality** (whether the right skill is triggered for the right prompts), see the [Evaluation Types guide](https://agentv.dev/guides/evaluation-types/). Do not use execution eval configs (`EVAL.yaml`, `evals.json`) for trigger evaluation — these are distinct concerns requiring different tooling and methodologies. + ## Starting from evals.json? If the project already has an Agent Skills `evals.json` file, use it as a starting point instead of writing YAML from scratch: @@ -576,6 +582,14 @@ agentv create assertion # → .agentv/assertions/.ts agentv create eval # → evals/.eval.yaml + .cases.jsonl ``` +## Skill Improvement Workflow + +For a complete guide to iterating on skills using evaluations — writing scenarios, running baselines, comparing results, and improving — see the [Skill Improvement Workflow](https://agentv.dev/guides/skill-improvement-workflow/) guide. + +## Skill Improvement Workflow + +For a complete guide to iterating on skills using evaluations — writing scenarios, running baselines, comparing results, and improving — see the [Skill Improvement Workflow](https://agentv.dev/guides/skill-improvement-workflow/) guide. + ## Schemas - Eval file: `references/eval-schema.json`