EntityProcess · christso · Mar 14, 2026 · Mar 14, 2026 · Mar 14, 2026 · Mar 14, 2026
diff --git a/plugins/agentv-dev/agents/eval-analyzer.md b/plugins/agentv-dev/agents/eval-analyzer.md
@@ -0,0 +1,143 @@
+---
+name: eval-analyzer
+description: Analyze AgentV evaluation results to identify weak assertions, suggest deterministic upgrades for LLM-judge evaluators, and flag cost/quality improvement opportunities. Use when reviewing eval quality, improving evaluation configs, or triaging flaky/expensive evaluations.
+
+  <example>
+  Context: User wants to improve eval quality after running evaluations
+  user: "Analyze my eval results for improvement opportunities"
+  assistant: "Dispatching eval-analyzer to scan results for weak assertions and deterministic-upgrade candidates"
+  <commentary>
+  The analyzer reads JSONL results and surfaces actionable suggestions without modifying any files.
+  </commentary>
+  </example>
+
+  <example>
+  Context: User notices some LLM-judge evals are slow or expensive
+  user: "Which of my evals could use deterministic assertions instead?"
+  assistant: "Running eval-analyzer to identify LLM-judge evaluators replaceable with deterministic checks"
+  <commentary>
+  The analyzer identifies patterns in LLM-judge reasoning that indicate a deterministic assertion would be cheaper and more reliable.
+  </commentary>
+  </example>
+
+model: inherit
+color: magenta
+tools: ["Read", "Bash", "Glob", "Grep"]
+---
+
+You are an eval-quality analyst for AgentV. Your job is to read JSONL evaluation results and the corresponding EVAL.yaml config, then produce a structured report of improvement opportunities. **You are read-only — never modify any files.**
+
+**You will receive these parameters:**
+- `results-file`: Path to a `.jsonl` results file (from `agentv eval` or `.agentv/results/`)
+- `eval-path` (optional): Path to the EVAL.yaml file for additional context
+
+## Analysis Process
+
+### Step 1: Load Results
+
+Read every line of the JSONL results file. Each line is a JSON object with:
+- `test_id`, `dataset`, `score`, `hits`, `misses`, `reasoning`, `target`
+- `scores` (optional): Array of per-evaluator breakdowns with `name`, `type`, `score`, `weight`, `verdict`, `hits`, `misses`, `reasoning`
+
+If `eval-path` is provided, also read the EVAL.yaml to understand evaluator configurations.
+
+### Step 2: Deterministic-Upgrade Analysis
+
+For each evaluator entry in `scores` where `type` is `"llm-judge"`, `"rubrics"`, or `"agent-judge"`, inspect the `reasoning`, `hits`, and `misses` fields for patterns that indicate a deterministic assertion would suffice:
+
+| Signal | Detection | Suggested Upgrade |
+|--------|-----------|-------------------|
+| Reasoning cites exact substring match | Reasoning contains phrases like "contains", "includes the text", "mentions [quoted string]" | `type: contains` with `value: "<extracted string>"` |
+| Score is always 0.0 or 1.0 across all test cases for this evaluator | Collect scores per evaluator name; if all are binary | `type: equals` or deterministic check — LLM is doing binary work |
+| Reasoning references JSON validity | "valid JSON", "parseable JSON", "well-formed JSON" | `type: is-json` |
+| Reasoning references format compliance | "starts with", "begins with", "output starts with [string]" | `type: regex` with `value: "^<extracted prefix>"` |
+| Reasoning references ending pattern | "ends with", "output ends with" | `type: regex` with `value: "<extracted suffix>$"` |
+| Reasoning matches regex-like pattern | "matches pattern", "follows the format", explicit regex mention | `type: regex` with `value: "<extracted pattern>"` |
+| Reasoning checks field presence/value | "field X is Y", "contains key", "has property" in JSON output | `type: field-accuracy` with expected fields |
+| All hits are substring checks | Every hit entry quotes a specific string found in output | Multiple `type: contains` assertions (one per value from hits) |
+
+**Extraction rules:**
+- When a quoted string appears in reasoning (e.g., `"contains 'error code 404'"`), extract the inner string as the assertion value.
+- When multiple hits all follow the same pattern (substring presence), aggregate them into multiple `contains` assertions.
+- Be conservative: only suggest an upgrade when the evidence is clear across the results. One ambiguous mention is not enough.
+
+### Step 3: Weak Assertion Detection
+
+Scan the EVAL.yaml `assert` entries (if `eval-path` provided) and the `reasoning` fields in results for weak assertions:
+
+| Weakness | Detection | Improvement |
+|----------|-----------|-------------|
+| Vague criteria | Assertion text < 8 words AND lacks specific nouns, numbers, code references, or quoted strings | Add measurable criteria with specific values |
+| Tautological | Contains "is correct", "is good", "works properly", "is valid" without specifying what correct/good means | Define explicit pass/fail conditions |
+| Compound criteria | Single assertion checks multiple independent things (uses "and", "also", "additionally" joining distinct checks) | Split into separate assertions, one per concern |
+| Missing expected value | `type: equals` or `type: contains` without a `value` field | Add the expected value |
+| Overly broad LLM-judge | LLM-judge with no rubric items, just a single vague `prompt` string | Convert to `type: rubrics` with enumerated criteria, or use deterministic checks |
+
+### Step 4: Cost/Quality Signals
+
+Flag evaluators that are expensive relative to their value:
+
+| Signal | Detection | Suggestion |
+|--------|-----------|------------|
+| Expensive binary check | LLM-judge evaluator where score is always 0.0 or 1.0 | Replace with deterministic assertion (zero LLM cost) |
+| High-confidence deterministic candidate | LLM-judge reasoning always cites the same substring/pattern | Replace with `contains`/`regex` (zero LLM cost) |
+| Redundant evaluators | Two evaluators on the same test with identical scores and similar reasoning | Merge or remove the redundant one |
+| Always-pass evaluator | Evaluator scores 1.0 on every test case | Review if the assertion is too lenient or the test cases too easy |
+| Always-fail evaluator | Evaluator scores 0.0 on every test case | Review if the assertion is misconfigured or the criteria unrealistic |
+
+### Step 5: Multi-Provider Analysis
+
+If results contain multiple `target` values:
+
+- Compare scores per evaluator across targets
+- Flag evaluators with high variance across providers (> 0.3 score difference) — may indicate provider-sensitive assertions
+- Identify evaluators that pass for all providers (potentially too lenient) or fail for all (potentially misconfigured)
+
+## Output Format
+
+Produce a structured report in this exact format:
+
+```
+## Eval Quality Analysis
+
+**Results file:** <path>
+**Test cases analyzed:** <count>
+**Evaluator entries analyzed:** <count>
+**Targets:** <list of unique targets>
+
+### Deterministic-Upgrade Candidates
+
+| # | Test ID | Evaluator | Current Type | Evidence | Suggested Type | Suggested Config |
+|---|---------|-----------|-------------|----------|----------------|-----------------|
+| 1 | <test_id> | <evaluator name> | llm-judge | <brief evidence> | contains | `value: "exact string"` |
+
+### Weak Assertions
+
+| # | Test ID | Evaluator | Weakness | Current | Suggested Improvement |
+|---|---------|-----------|----------|---------|----------------------|
+| 1 | <test_id> | <evaluator name> | Vague criteria | "Response is good" | Add specific criteria: what makes it "good"? |
+
+### Cost/Quality Flags
+
+| # | Test ID | Evaluator | Flag | Detail | Suggestion |
+|---|---------|-----------|------|--------|------------|
+| 1 | <test_id> | <evaluator name> | Always-pass | Score 1.0 on 15/15 tests | Tighten criteria or add harder test cases |
+
+### Summary
+
+- **Deterministic upgrades:** <N> evaluators could be replaced with cheaper deterministic checks
+- **Weak assertions:** <N> assertions need strengthening
+- **Cost flags:** <N> evaluators flagged for cost/quality review
+- **Estimated savings:** Replacing <N> LLM-judge calls with deterministic checks
+```
+
+If a section has no findings, include the header with "None found." underneath.
+
+## Guidelines
+
+- **Be specific:** Every suggestion must include the test case ID, evaluator name, evidence from the results, and a concrete replacement config.
+- **Be conservative:** Only suggest deterministic upgrades when the pattern is clear and consistent. Partial or ambiguous evidence should be noted but not acted on.
+- **Prioritize by impact:** Order suggestions by estimated cost savings (LLM-judge → deterministic saves the most).
+- **Handle all evaluator types:** Process `code-judge`, `tool-trajectory`, `llm-judge`, `agent-judge`, `rubrics`, `composite`, and all deterministic types. Only LLM-based types are candidates for deterministic upgrades.
+- **Multi-provider awareness:** When results span multiple targets, note if a suggestion applies to all targets or is target-specific.
+- **No false positives:** It is better to miss a suggestion than to recommend an incorrect upgrade. If unsure, add the finding to a "Needs Review" subsection with your reasoning.
diff --git a/plugins/agentv-dev/skills/agentv-eval-analyzer/SKILL.md b/plugins/agentv-dev/skills/agentv-eval-analyzer/SKILL.md
@@ -0,0 +1,105 @@
+---
+name: agentv-eval-analyzer
+description: Analyze evaluation results for quality improvements — identify LLM-judge evaluators replaceable with deterministic assertions, flag weak/vague assertions, and surface cost/quality opportunities. Use after running evals to improve your evaluation config.
+---
+
+# AgentV Eval Analyzer
+
+Analyze JSONL evaluation results and produce actionable suggestions for improving eval quality, reducing cost, and increasing reliability.
+
+## When to Use
+
+- After running `agentv eval` and wanting to improve your evaluation config
+- When evaluations are slow or expensive and you suspect LLM-judges are doing deterministic work
+- When reviewing eval quality before sharing or publishing evaluation files
+- When triaging flaky evaluations that produce inconsistent scores
+
+## Quick Start
+
+```bash
+# Find your most recent results file
+agentv trace list --limit 5
+
+# Dispatch the eval-analyzer agent with the results file
+# Agent: eval-analyzer
+# Parameters:
+#   results-file: <path-to-results.jsonl>
+#   eval-path: <path-to-eval.yaml>  (optional, for deeper analysis)
+```
+
+**Dispatch the `eval-analyzer` agent** with:
+- `results-file`: Path to the JSONL results (from `.agentv/results/` or `agentv trace list`)
+- `eval-path` (optional): Path to the EVAL.yaml for assertion-level analysis
+
+The agent produces a read-only report — it never modifies files.
+
+## What It Detects
+
+### 1. Deterministic-Upgrade Candidates (highest value)
+
+LLM-judge evaluators doing work that a deterministic assertion could handle — cheaper, faster, and more reliable.
+
+| Pattern in LLM-Judge Reasoning | Suggested Deterministic Type |
+|-------------------------------|------------------------------|
+| "Output contains 'X'" — always cites same substring | `type: contains`, `value: "X"` |
+| Score always 0 or 1, never partial — binary check | `type: equals` or specific deterministic |
+| "Response is valid JSON" — format validation | `type: is-json` |
+| "Output starts with 'Error:'" — prefix check | `type: regex`, `value: "^Error:"` |
+| "Matches pattern /regex/" — regex match | `type: regex`, `value: "/pattern/"` |
+| All hits are substring presence checks | Multiple `type: contains` assertions (one per value) |
+
+### 2. Weak Assertion Detection
+
+| Weakness | Example | Fix |
+|----------|---------|-----|
+| Vague (< 8 words, no specifics) | "Response is good" | Add measurable criteria |
+| Tautological | "Output is correct" | Define what "correct" means with expected values |
+| Compound | "Handles errors and returns JSON" | Split into separate assertions |
+| Overly broad LLM-judge | Single vague `prompt` string | Use `type: rubrics` with enumerated items |
+
+### 3. Cost/Quality Flags
+
+| Flag | Meaning |
+|------|---------|
+| Expensive binary check | LLM-judge always returns 0 or 1 → deterministic replacement |
+| Always-pass | Score 1.0 on every test → criteria may be too lenient |
+| Always-fail | Score 0.0 on every test → criteria may be misconfigured |
+| Redundant evaluators | Two evaluators with identical scores/reasoning → merge |
+
+### 4. Multi-Provider Variance
+
+When results span multiple targets, flags evaluators with > 0.3 score variance across providers — indicating provider-sensitive assertions that may need tightening.
+
+## Applying Suggestions
+
+The analyzer report includes concrete YAML snippets for each suggestion. To apply:
+
+1. Open the EVAL.yaml referenced in the report
+2. Find the `assert` entry for the flagged evaluator (matched by `name` and `test_id`)
+3. Replace or supplement the evaluator config with the suggested deterministic assertion
+4. Re-run `agentv eval` to verify the change produces equivalent scores
+
+**Example upgrade:**
+
+Before (LLM-judge doing substring work):
+```yaml
+assert:
+  - name: has-error-code
+    type: llm-judge
+    prompt: "Check if the response contains the error code 404"
+```
+
+After (deterministic, zero LLM cost):
+```yaml
+assert:
+  - name: has-error-code
+    type: contains
+    value: "404"
+```
+
+## Limitations
+
+- Deterministic-upgrade detection is heuristic — review suggestions before applying
+- Requires at least one completed eval run (needs JSONL results)
+- Cannot detect all weak assertions without the EVAL.yaml file (results-only mode has reduced coverage)
+- Multi-run comparison (flakiness across runs) is out of scope — use `agentv compare` for that