-
Notifications
You must be signed in to change notification settings - Fork 0
feat: eval analyzer pass for weak assertions and flaky scenarios #567
Description
Objective
Add a standalone eval-quality analysis pass that helps users improve their evaluation configs by identifying weak assertions, suggesting deterministic upgrades, and flagging flaky scenarios — available as a CLI command or agentv-dev skill independent of the full optimizer workflow.
Architecture Boundary
external-first (agentv-dev skill or CLI subcommand)
Context
AgentV's optimizer-reflector agent already performs sophisticated analysis via SIMBA (self-introspective failure analysis) and GEPA (trace reflection), and optimizer-discovery already "challenges eval quality assumptions." However, these are embedded in the 5-phase agentv-optimizer workflow and require running the full optimization loop.
Anthropic's skill-creator uses a dedicated agents/analyzer.md role that runs independently to surface eval quality issues. AgentV should expose similar standalone analysis — not by copying skill-creator's prompts (AgentV's existing SIMBA/GEPA patterns are more sophisticated), but by making the analysis accessible outside the optimizer workflow and adding one pattern AgentV currently lacks: deterministic-upgrade suggestions.
MVP Scope: Deterministic-Upgrade Suggestions
The single highest-value capability to ship first. When an LLM judge is doing work that a deterministic assertion could handle, users pay unnecessary cost and get unnecessary variance.
Concrete Heuristics
1. Pattern-match detection — Flag when LLM-judge reasoning consistently matches a simple text check:
| Signal | Example | Suggested Upgrade |
|---|---|---|
| Reasoning always cites exact substring | "evidence": "Output contains 'error code 404'" across 5+ runs |
type: contains, value: "error code 404" |
| Score is always 0.0 or 1.0, never partial | LLM judge returns binary for "is JSON valid" checks | type: is_json |
| Reasoning references format compliance | "evidence": "Response is valid JSON with required fields" |
type: is_json + type: field_accuracy |
| Reasoning matches regex pattern | "evidence": "Output starts with 'Error:' as expected" |
type: starts_with, value: "Error:" |
2. Weak assertion detection — Flag vague natural-language criteria:
| Weak (flag it) | Why it's weak | Stronger alternative |
|---|---|---|
"Response is good" |
No measurable criteria | "Response identifies the root cause and proposes a fix with code" |
"Output is correct" |
Tautological | "Output matches expected value" → then use type: equals |
"Handles edge cases" |
Unspecified which cases | Enumerate specific edge cases as separate test cases |
"Is helpful and accurate" |
Compound + subjective | Split into two criteria with specific evidence requirements |
Detection heuristic: flag assertions under 8 words that lack specific nouns, numbers, or code references.
3. Cost/quality signal — Flag evaluators where LLM-judge cost is disproportionate:
| Signal | Suggestion |
|---|---|
| LLM-judge token usage > 500 tokens for a check that always returns 1.0 | Replace with deterministic assertion |
| LLM-judge takes > 3s for binary yes/no check | Consider deterministic alternative |
What this does NOT include (follow-up work)
- Multi-run variance detection (requires history repo / multiple runs — separate concern)
- Blind A/B comparison
- Auto-fixing eval configs
- Full SIMBA/GEPA analysis (already available via
agentv-optimizer)
Design Latitude
- Recommended approach: agentv-dev skill (aligns with CLAUDE.md's AI-First Design principle) that can also work as a CLI subcommand
- May reuse patterns from
optimizer-reflectorandoptimizer-discoveryagents - Choose output format: console summary with actionable suggestions, JSON report, or both
- May start with just deterministic-upgrade detection and add weak-assertion detection as a fast follow
- Should work with a single JSONL results file (no multi-run requirement)
Acceptance Signals
- Running the analyzer on a JSONL results file identifies LLM-judge evaluations that could be replaced with deterministic assertions (
contains,regex,equals,is_json,starts_with,ends_with,field_accuracy) - Each suggestion includes: the test case ID, the evaluator name, the evidence pattern detected, and the suggested deterministic assertion type + value
- Running the analyzer on results with vague assertions flags them with the specific weakness and a concrete improvement suggestion
- Works with a single JSONL file — does not require multiple runs or a history repo
Non-Goals
- Replacing LLM judges automatically (suggestions only)
- Auto-fixing eval configs
- Multi-run variance analysis (requires history repo, separate concern)
- Blind A/B comparison
- Trigger-quality analysis
- Duplicating SIMBA/GEPA analysis already available via
agentv-optimizer
Research Basis
- Skill Creator Findings — agents/analyzer.md role
- Skill Lifecycle Alignment — Explore Later #1: Analyzer-style output
- Existing AgentV patterns:
optimizer-reflector(SIMBA/GEPA),optimizer-discovery(eval quality assessment)