feat: eval analyzer pass for weak assertions and flaky scenarios

## Objective

Add a standalone eval-quality analysis pass that helps users improve their evaluation configs by identifying weak assertions, suggesting deterministic upgrades, and flagging flaky scenarios — available as a CLI command or agentv-dev skill independent of the full optimizer workflow.

## Architecture Boundary

external-first (agentv-dev skill or CLI subcommand)

## Context

AgentV's `optimizer-reflector` agent already performs sophisticated analysis via SIMBA (self-introspective failure analysis) and GEPA (trace reflection), and `optimizer-discovery` already "challenges eval quality assumptions." However, these are embedded in the 5-phase `agentv-optimizer` workflow and require running the full optimization loop.

Anthropic's skill-creator uses a dedicated `agents/analyzer.md` role that runs independently to surface eval quality issues. AgentV should expose similar standalone analysis — not by copying skill-creator's prompts (AgentV's existing SIMBA/GEPA patterns are more sophisticated), but by making the analysis accessible outside the optimizer workflow and adding one pattern AgentV currently lacks: **deterministic-upgrade suggestions**.

## MVP Scope: Deterministic-Upgrade Suggestions

The single highest-value capability to ship first. When an LLM judge is doing work that a deterministic assertion could handle, users pay unnecessary cost and get unnecessary variance.

### Concrete Heuristics

**1. Pattern-match detection** — Flag when LLM-judge reasoning consistently matches a simple text check:

| Signal | Example | Suggested Upgrade |
|--------|---------|-------------------|
| Reasoning always cites exact substring | `"evidence": "Output contains 'error code 404'"` across 5+ runs | `type: contains`, `value: "error code 404"` |
| Score is always 0.0 or 1.0, never partial | LLM judge returns binary for "is JSON valid" checks | `type: is_json` |
| Reasoning references format compliance | `"evidence": "Response is valid JSON with required fields"` | `type: is_json` + `type: field_accuracy` |
| Reasoning matches regex pattern | `"evidence": "Output starts with 'Error:' as expected"` | `type: starts_with`, `value: "Error:"` |

**2. Weak assertion detection** — Flag vague natural-language criteria:

| Weak (flag it) | Why it's weak | Stronger alternative |
|----------------|--------------|---------------------|
| `"Response is good"` | No measurable criteria | `"Response identifies the root cause and proposes a fix with code"` |
| `"Output is correct"` | Tautological | `"Output matches expected value"` → then use `type: equals` |
| `"Handles edge cases"` | Unspecified which cases | Enumerate specific edge cases as separate test cases |
| `"Is helpful and accurate"` | Compound + subjective | Split into two criteria with specific evidence requirements |

Detection heuristic: flag assertions under 8 words that lack specific nouns, numbers, or code references.

**3. Cost/quality signal** — Flag evaluators where LLM-judge cost is disproportionate:

| Signal | Suggestion |
|--------|-----------|
| LLM-judge token usage > 500 tokens for a check that always returns 1.0 | Replace with deterministic assertion |
| LLM-judge takes > 3s for binary yes/no check | Consider deterministic alternative |

### What this does NOT include (follow-up work)

- Multi-run variance detection (requires history repo / multiple runs — separate concern)
- Blind A/B comparison
- Auto-fixing eval configs
- Full SIMBA/GEPA analysis (already available via `agentv-optimizer`)

## Design Latitude

- **Recommended approach**: agentv-dev skill (aligns with CLAUDE.md's AI-First Design principle) that can also work as a CLI subcommand
- May reuse patterns from `optimizer-reflector` and `optimizer-discovery` agents
- Choose output format: console summary with actionable suggestions, JSON report, or both
- May start with just deterministic-upgrade detection and add weak-assertion detection as a fast follow
- Should work with a single JSONL results file (no multi-run requirement)

## Acceptance Signals

- Running the analyzer on a JSONL results file identifies LLM-judge evaluations that could be replaced with deterministic assertions (`contains`, `regex`, `equals`, `is_json`, `starts_with`, `ends_with`, `field_accuracy`)
- Each suggestion includes: the test case ID, the evaluator name, the evidence pattern detected, and the suggested deterministic assertion type + value
- Running the analyzer on results with vague assertions flags them with the specific weakness and a concrete improvement suggestion
- Works with a single JSONL file — does not require multiple runs or a history repo

## Non-Goals

- Replacing LLM judges automatically (suggestions only)
- Auto-fixing eval configs
- Multi-run variance analysis (requires history repo, separate concern)
- Blind A/B comparison
- Trigger-quality analysis
- Duplicating SIMBA/GEPA analysis already available via `agentv-optimizer`

## Research Basis

- [Skill Creator Findings — agents/analyzer.md role](https://github.com/agentevals/agentevals-research/blob/main/research/findings/skill-creator/README.md)
- [Skill Lifecycle Alignment — Explore Later #1: Analyzer-style output](https://github.com/agentevals/agentevals-research/blob/main/research/agentv/skill-lifecycle-alignment.md)
- Existing AgentV patterns: `optimizer-reflector` (SIMBA/GEPA), `optimizer-discovery` (eval quality assessment)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: eval analyzer pass for weak assertions and flaky scenarios #567

Objective

Architecture Boundary

Context

MVP Scope: Deterministic-Upgrade Suggestions

Concrete Heuristics

What this does NOT include (follow-up work)

Design Latitude

Acceptance Signals

Non-Goals

Research Basis

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Signal	Example	Suggested Upgrade
Reasoning always cites exact substring	`"evidence": "Output contains 'error code 404'"` across 5+ runs	`type: contains`, `value: "error code 404"`
Score is always 0.0 or 1.0, never partial	LLM judge returns binary for "is JSON valid" checks	`type: is_json`
Reasoning references format compliance	`"evidence": "Response is valid JSON with required fields"`	`type: is_json` + `type: field_accuracy`
Reasoning matches regex pattern	`"evidence": "Output starts with 'Error:' as expected"`	`type: starts_with`, `value: "Error:"`

Weak (flag it)	Why it's weak	Stronger alternative
`"Response is good"`	No measurable criteria	`"Response identifies the root cause and proposes a fix with code"`
`"Output is correct"`	Tautological	`"Output matches expected value"` → then use `type: equals`
`"Handles edge cases"`	Unspecified which cases	Enumerate specific edge cases as separate test cases
`"Is helpful and accurate"`	Compound + subjective	Split into two criteria with specific evidence requirements

Signal	Suggestion
LLM-judge token usage > 500 tokens for a check that always returns 1.0	Replace with deterministic assertion
LLM-judge takes > 3s for binary yes/no check	Consider deterministic alternative

feat: eval analyzer pass for weak assertions and flaky scenarios #567

Description

Objective

Architecture Boundary

Context

MVP Scope: Deterministic-Upgrade Suggestions

Concrete Heuristics

What this does NOT include (follow-up work)

Design Latitude

Acceptance Signals

Non-Goals

Research Basis

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions