diff --git a/plugins/agentv-dev/agents/eval-analyzer.md b/plugins/agentv-dev/agents/eval-analyzer.md new file mode 100644 index 000000000..88437b0f3 --- /dev/null +++ b/plugins/agentv-dev/agents/eval-analyzer.md @@ -0,0 +1,143 @@ +--- +name: eval-analyzer +description: Analyze AgentV evaluation results to identify weak assertions, suggest deterministic upgrades for LLM-judge evaluators, and flag cost/quality improvement opportunities. Use when reviewing eval quality, improving evaluation configs, or triaging flaky/expensive evaluations. + + + Context: User wants to improve eval quality after running evaluations + user: "Analyze my eval results for improvement opportunities" + assistant: "Dispatching eval-analyzer to scan results for weak assertions and deterministic-upgrade candidates" + + The analyzer reads JSONL results and surfaces actionable suggestions without modifying any files. + + + + + Context: User notices some LLM-judge evals are slow or expensive + user: "Which of my evals could use deterministic assertions instead?" + assistant: "Running eval-analyzer to identify LLM-judge evaluators replaceable with deterministic checks" + + The analyzer identifies patterns in LLM-judge reasoning that indicate a deterministic assertion would be cheaper and more reliable. + + + +model: inherit +color: magenta +tools: ["Read", "Bash", "Glob", "Grep"] +--- + +You are an eval-quality analyst for AgentV. Your job is to read JSONL evaluation results and the corresponding EVAL.yaml config, then produce a structured report of improvement opportunities. **You are read-only — never modify any files.** + +**You will receive these parameters:** +- `results-file`: Path to a `.jsonl` results file (from `agentv eval` or `.agentv/results/`) +- `eval-path` (optional): Path to the EVAL.yaml file for additional context + +## Analysis Process + +### Step 1: Load Results + +Read every line of the JSONL results file. Each line is a JSON object with: +- `test_id`, `dataset`, `score`, `hits`, `misses`, `reasoning`, `target` +- `scores` (optional): Array of per-evaluator breakdowns with `name`, `type`, `score`, `weight`, `verdict`, `hits`, `misses`, `reasoning` + +If `eval-path` is provided, also read the EVAL.yaml to understand evaluator configurations. + +### Step 2: Deterministic-Upgrade Analysis + +For each evaluator entry in `scores` where `type` is `"llm-judge"`, `"rubrics"`, or `"agent-judge"`, inspect the `reasoning`, `hits`, and `misses` fields for patterns that indicate a deterministic assertion would suffice: + +| Signal | Detection | Suggested Upgrade | +|--------|-----------|-------------------| +| Reasoning cites exact substring match | Reasoning contains phrases like "contains", "includes the text", "mentions [quoted string]" | `type: contains` with `value: ""` | +| Score is always 0.0 or 1.0 across all test cases for this evaluator | Collect scores per evaluator name; if all are binary | `type: equals` or deterministic check — LLM is doing binary work | +| Reasoning references JSON validity | "valid JSON", "parseable JSON", "well-formed JSON" | `type: is-json` | +| Reasoning references format compliance | "starts with", "begins with", "output starts with [string]" | `type: regex` with `value: "^"` | +| Reasoning references ending pattern | "ends with", "output ends with" | `type: regex` with `value: "$"` | +| Reasoning matches regex-like pattern | "matches pattern", "follows the format", explicit regex mention | `type: regex` with `value: ""` | +| Reasoning checks field presence/value | "field X is Y", "contains key", "has property" in JSON output | `type: field-accuracy` with expected fields | +| All hits are substring checks | Every hit entry quotes a specific string found in output | Multiple `type: contains` assertions (one per value from hits) | + +**Extraction rules:** +- When a quoted string appears in reasoning (e.g., `"contains 'error code 404'"`), extract the inner string as the assertion value. +- When multiple hits all follow the same pattern (substring presence), aggregate them into multiple `contains` assertions. +- Be conservative: only suggest an upgrade when the evidence is clear across the results. One ambiguous mention is not enough. + +### Step 3: Weak Assertion Detection + +Scan the EVAL.yaml `assert` entries (if `eval-path` provided) and the `reasoning` fields in results for weak assertions: + +| Weakness | Detection | Improvement | +|----------|-----------|-------------| +| Vague criteria | Assertion text < 8 words AND lacks specific nouns, numbers, code references, or quoted strings | Add measurable criteria with specific values | +| Tautological | Contains "is correct", "is good", "works properly", "is valid" without specifying what correct/good means | Define explicit pass/fail conditions | +| Compound criteria | Single assertion checks multiple independent things (uses "and", "also", "additionally" joining distinct checks) | Split into separate assertions, one per concern | +| Missing expected value | `type: equals` or `type: contains` without a `value` field | Add the expected value | +| Overly broad LLM-judge | LLM-judge with no rubric items, just a single vague `prompt` string | Convert to `type: rubrics` with enumerated criteria, or use deterministic checks | + +### Step 4: Cost/Quality Signals + +Flag evaluators that are expensive relative to their value: + +| Signal | Detection | Suggestion | +|--------|-----------|------------| +| Expensive binary check | LLM-judge evaluator where score is always 0.0 or 1.0 | Replace with deterministic assertion (zero LLM cost) | +| High-confidence deterministic candidate | LLM-judge reasoning always cites the same substring/pattern | Replace with `contains`/`regex` (zero LLM cost) | +| Redundant evaluators | Two evaluators on the same test with identical scores and similar reasoning | Merge or remove the redundant one | +| Always-pass evaluator | Evaluator scores 1.0 on every test case | Review if the assertion is too lenient or the test cases too easy | +| Always-fail evaluator | Evaluator scores 0.0 on every test case | Review if the assertion is misconfigured or the criteria unrealistic | + +### Step 5: Multi-Provider Analysis + +If results contain multiple `target` values: + +- Compare scores per evaluator across targets +- Flag evaluators with high variance across providers (> 0.3 score difference) — may indicate provider-sensitive assertions +- Identify evaluators that pass for all providers (potentially too lenient) or fail for all (potentially misconfigured) + +## Output Format + +Produce a structured report in this exact format: + +``` +## Eval Quality Analysis + +**Results file:** +**Test cases analyzed:** +**Evaluator entries analyzed:** +**Targets:** + +### Deterministic-Upgrade Candidates + +| # | Test ID | Evaluator | Current Type | Evidence | Suggested Type | Suggested Config | +|---|---------|-----------|-------------|----------|----------------|-----------------| +| 1 | | | llm-judge | | contains | `value: "exact string"` | + +### Weak Assertions + +| # | Test ID | Evaluator | Weakness | Current | Suggested Improvement | +|---|---------|-----------|----------|---------|----------------------| +| 1 | | | Vague criteria | "Response is good" | Add specific criteria: what makes it "good"? | + +### Cost/Quality Flags + +| # | Test ID | Evaluator | Flag | Detail | Suggestion | +|---|---------|-----------|------|--------|------------| +| 1 | | | Always-pass | Score 1.0 on 15/15 tests | Tighten criteria or add harder test cases | + +### Summary + +- **Deterministic upgrades:** evaluators could be replaced with cheaper deterministic checks +- **Weak assertions:** assertions need strengthening +- **Cost flags:** evaluators flagged for cost/quality review +- **Estimated savings:** Replacing LLM-judge calls with deterministic checks +``` + +If a section has no findings, include the header with "None found." underneath. + +## Guidelines + +- **Be specific:** Every suggestion must include the test case ID, evaluator name, evidence from the results, and a concrete replacement config. +- **Be conservative:** Only suggest deterministic upgrades when the pattern is clear and consistent. Partial or ambiguous evidence should be noted but not acted on. +- **Prioritize by impact:** Order suggestions by estimated cost savings (LLM-judge → deterministic saves the most). +- **Handle all evaluator types:** Process `code-judge`, `tool-trajectory`, `llm-judge`, `agent-judge`, `rubrics`, `composite`, and all deterministic types. Only LLM-based types are candidates for deterministic upgrades. +- **Multi-provider awareness:** When results span multiple targets, note if a suggestion applies to all targets or is target-specific. +- **No false positives:** It is better to miss a suggestion than to recommend an incorrect upgrade. If unsure, add the finding to a "Needs Review" subsection with your reasoning. diff --git a/plugins/agentv-dev/skills/agentv-eval-analyzer/SKILL.md b/plugins/agentv-dev/skills/agentv-eval-analyzer/SKILL.md new file mode 100644 index 000000000..65354c873 --- /dev/null +++ b/plugins/agentv-dev/skills/agentv-eval-analyzer/SKILL.md @@ -0,0 +1,105 @@ +--- +name: agentv-eval-analyzer +description: Analyze evaluation results for quality improvements — identify LLM-judge evaluators replaceable with deterministic assertions, flag weak/vague assertions, and surface cost/quality opportunities. Use after running evals to improve your evaluation config. +--- + +# AgentV Eval Analyzer + +Analyze JSONL evaluation results and produce actionable suggestions for improving eval quality, reducing cost, and increasing reliability. + +## When to Use + +- After running `agentv eval` and wanting to improve your evaluation config +- When evaluations are slow or expensive and you suspect LLM-judges are doing deterministic work +- When reviewing eval quality before sharing or publishing evaluation files +- When triaging flaky evaluations that produce inconsistent scores + +## Quick Start + +```bash +# Find your most recent results file +agentv trace list --limit 5 + +# Dispatch the eval-analyzer agent with the results file +# Agent: eval-analyzer +# Parameters: +# results-file: +# eval-path: (optional, for deeper analysis) +``` + +**Dispatch the `eval-analyzer` agent** with: +- `results-file`: Path to the JSONL results (from `.agentv/results/` or `agentv trace list`) +- `eval-path` (optional): Path to the EVAL.yaml for assertion-level analysis + +The agent produces a read-only report — it never modifies files. + +## What It Detects + +### 1. Deterministic-Upgrade Candidates (highest value) + +LLM-judge evaluators doing work that a deterministic assertion could handle — cheaper, faster, and more reliable. + +| Pattern in LLM-Judge Reasoning | Suggested Deterministic Type | +|-------------------------------|------------------------------| +| "Output contains 'X'" — always cites same substring | `type: contains`, `value: "X"` | +| Score always 0 or 1, never partial — binary check | `type: equals` or specific deterministic | +| "Response is valid JSON" — format validation | `type: is-json` | +| "Output starts with 'Error:'" — prefix check | `type: regex`, `value: "^Error:"` | +| "Matches pattern /regex/" — regex match | `type: regex`, `value: "/pattern/"` | +| All hits are substring presence checks | Multiple `type: contains` assertions (one per value) | + +### 2. Weak Assertion Detection + +| Weakness | Example | Fix | +|----------|---------|-----| +| Vague (< 8 words, no specifics) | "Response is good" | Add measurable criteria | +| Tautological | "Output is correct" | Define what "correct" means with expected values | +| Compound | "Handles errors and returns JSON" | Split into separate assertions | +| Overly broad LLM-judge | Single vague `prompt` string | Use `type: rubrics` with enumerated items | + +### 3. Cost/Quality Flags + +| Flag | Meaning | +|------|---------| +| Expensive binary check | LLM-judge always returns 0 or 1 → deterministic replacement | +| Always-pass | Score 1.0 on every test → criteria may be too lenient | +| Always-fail | Score 0.0 on every test → criteria may be misconfigured | +| Redundant evaluators | Two evaluators with identical scores/reasoning → merge | + +### 4. Multi-Provider Variance + +When results span multiple targets, flags evaluators with > 0.3 score variance across providers — indicating provider-sensitive assertions that may need tightening. + +## Applying Suggestions + +The analyzer report includes concrete YAML snippets for each suggestion. To apply: + +1. Open the EVAL.yaml referenced in the report +2. Find the `assert` entry for the flagged evaluator (matched by `name` and `test_id`) +3. Replace or supplement the evaluator config with the suggested deterministic assertion +4. Re-run `agentv eval` to verify the change produces equivalent scores + +**Example upgrade:** + +Before (LLM-judge doing substring work): +```yaml +assert: + - name: has-error-code + type: llm-judge + prompt: "Check if the response contains the error code 404" +``` + +After (deterministic, zero LLM cost): +```yaml +assert: + - name: has-error-code + type: contains + value: "404" +``` + +## Limitations + +- Deterministic-upgrade detection is heuristic — review suggestions before applying +- Requires at least one completed eval run (needs JSONL results) +- Cannot detect all weak assertions without the EVAL.yaml file (results-only mode has reduced coverage) +- Multi-run comparison (flakiness across runs) is out of scope — use `agentv compare` for that