Skip to content

feat: eval analyzer pass for weak assertions and flaky scenarios #567

@christso

Description

@christso

Objective

Add a standalone eval-quality analysis pass that helps users improve their evaluation configs by identifying weak assertions, suggesting deterministic upgrades, and flagging flaky scenarios — available as a CLI command or agentv-dev skill independent of the full optimizer workflow.

Architecture Boundary

external-first (agentv-dev skill or CLI subcommand)

Context

AgentV's optimizer-reflector agent already performs sophisticated analysis via SIMBA (self-introspective failure analysis) and GEPA (trace reflection), and optimizer-discovery already "challenges eval quality assumptions." However, these are embedded in the 5-phase agentv-optimizer workflow and require running the full optimization loop.

Anthropic's skill-creator uses a dedicated agents/analyzer.md role that runs independently to surface eval quality issues. AgentV should expose similar standalone analysis — not by copying skill-creator's prompts (AgentV's existing SIMBA/GEPA patterns are more sophisticated), but by making the analysis accessible outside the optimizer workflow and adding one pattern AgentV currently lacks: deterministic-upgrade suggestions.

MVP Scope: Deterministic-Upgrade Suggestions

The single highest-value capability to ship first. When an LLM judge is doing work that a deterministic assertion could handle, users pay unnecessary cost and get unnecessary variance.

Concrete Heuristics

1. Pattern-match detection — Flag when LLM-judge reasoning consistently matches a simple text check:

Signal Example Suggested Upgrade
Reasoning always cites exact substring "evidence": "Output contains 'error code 404'" across 5+ runs type: contains, value: "error code 404"
Score is always 0.0 or 1.0, never partial LLM judge returns binary for "is JSON valid" checks type: is_json
Reasoning references format compliance "evidence": "Response is valid JSON with required fields" type: is_json + type: field_accuracy
Reasoning matches regex pattern "evidence": "Output starts with 'Error:' as expected" type: starts_with, value: "Error:"

2. Weak assertion detection — Flag vague natural-language criteria:

Weak (flag it) Why it's weak Stronger alternative
"Response is good" No measurable criteria "Response identifies the root cause and proposes a fix with code"
"Output is correct" Tautological "Output matches expected value" → then use type: equals
"Handles edge cases" Unspecified which cases Enumerate specific edge cases as separate test cases
"Is helpful and accurate" Compound + subjective Split into two criteria with specific evidence requirements

Detection heuristic: flag assertions under 8 words that lack specific nouns, numbers, or code references.

3. Cost/quality signal — Flag evaluators where LLM-judge cost is disproportionate:

Signal Suggestion
LLM-judge token usage > 500 tokens for a check that always returns 1.0 Replace with deterministic assertion
LLM-judge takes > 3s for binary yes/no check Consider deterministic alternative

What this does NOT include (follow-up work)

  • Multi-run variance detection (requires history repo / multiple runs — separate concern)
  • Blind A/B comparison
  • Auto-fixing eval configs
  • Full SIMBA/GEPA analysis (already available via agentv-optimizer)

Design Latitude

  • Recommended approach: agentv-dev skill (aligns with CLAUDE.md's AI-First Design principle) that can also work as a CLI subcommand
  • May reuse patterns from optimizer-reflector and optimizer-discovery agents
  • Choose output format: console summary with actionable suggestions, JSON report, or both
  • May start with just deterministic-upgrade detection and add weak-assertion detection as a fast follow
  • Should work with a single JSONL results file (no multi-run requirement)

Acceptance Signals

  • Running the analyzer on a JSONL results file identifies LLM-judge evaluations that could be replaced with deterministic assertions (contains, regex, equals, is_json, starts_with, ends_with, field_accuracy)
  • Each suggestion includes: the test case ID, the evaluator name, the evidence pattern detected, and the suggested deterministic assertion type + value
  • Running the analyzer on results with vague assertions flags them with the specific weakness and a concrete improvement suggestion
  • Works with a single JSONL file — does not require multiple runs or a history repo

Non-Goals

  • Replacing LLM judges automatically (suggestions only)
  • Auto-fixing eval configs
  • Multi-run variance analysis (requires history repo, separate concern)
  • Blind A/B comparison
  • Trigger-quality analysis
  • Duplicating SIMBA/GEPA analysis already available via agentv-optimizer

Research Basis

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions