feat: blind A/B comparison with dynamic rubrics and post-comparison analysis

## Objective

Add blind A/B comparison capabilities to AgentV's evaluation workflow, adopting two techniques from Anthropic's skill-creator `comparator.md` and `analyzer.md`: bias-free blind comparison with dynamic rubric generation, and post-comparison analysis that explains WHY the winner won.

## Architecture Boundary

external-first (new agents in `plugins/agentv-dev/agents/` + skill enhancement)

## AgentV's broader comparison scope

AgentV's `agentv compare` already supports N-way comparison across multiple providers (`--group-by target`), not just binary A/B. The blind comparison must preserve this: when comparing 3+ providers, randomize all labels (A, B, C...) and evaluate each pair or use a round-robin tournament. Do not regress to binary-only comparison.

## What skill-creator has that AgentV doesn't

### 1. Blind comparison (comparator.md)

The comparator judges two outputs labeled A and B without knowing which is baseline vs candidate:

```
"You receive two outputs labeled A and B, but you do NOT know which skill
 produced which. This prevents bias toward a particular skill or approach.
 Your judgment is based purely on output quality and task completion."
```

AgentV's `agentv compare` shows results with target labels visible. This introduces confirmation bias — reviewers (human or LLM) tend to favor the "new" version.

### 2. Dynamic rubric generation

Instead of fixed metrics, the comparator generates task-specific rubrics:

```
Content Rubric (what the output contains):
| Criterion | 1 (Poor) | 3 (Acceptable) | 5 (Excellent) |
| Correctness | Major errors | Minor errors | Fully correct |
| Completeness | Missing key elements | Mostly complete | All elements present |

Structure Rubric (how the output is organized):
| Criterion | 1 (Poor) | 3 (Acceptable) | 5 (Excellent) |
| Organization | Disorganized | Reasonably organized | Clear, logical structure |
```

Criteria adapt to the task: PDF form → "Field alignment", "Text readability"; Data output → "Schema correctness", "Data types". AgentV's compare uses the same fixed evaluator scores for all task types.

### 3. Post-comparison analysis (analyzer.md — unblinding section)

After blind comparison, the analyzer "unblinds" results and explains what made the winner better:

```
"Read Both Skills → Read Both Transcripts → Analyze Instruction Following →
 Identify Winner Strengths → Identify Loser Weaknesses →
 Generate Improvement Suggestions (prioritized by impact)"
```

It scores instruction-following (1-10) and produces categorized suggestions (`instructions`, `tools`, `examples`, `error_handling`, `structure`, `references`) with priority levels (`high`/`medium`/`low`).

AgentV's `optimizer-reflector` does failure analysis (SIMBA/GEPA) but doesn't do post-comparison analysis explaining why one output beat another.

## What to keep from AgentV

- `agentv compare` CLI with pairwise and N-way comparison (the data pipeline is good)
- `agentv-trace-analyst` for detailed trace analysis
- SIMBA/GEPA patterns in `optimizer-reflector` for root cause diagnosis (complementary to post-comparison analysis)
- `--group-by target|dataset|test-id` for multi-dimensional comparison

## Design Latitude

- May be a new agent pair (`blind-comparator.md` + `comparison-analyzer.md`) or enhancements to existing agents
- May be invoked via a new skill, an enhancement to `agentv-trace-analyst`, or a flag on `agentv compare`
- Choose how to randomize A/B labels (the CLI can shuffle, or the agent prompt can instruct shuffling)
- Dynamic rubric generation can be a general-purpose feature or specific to blind comparison mode
- Post-comparison analysis may be optional (only run when the user wants to understand close calls)
- Choose output format — may follow skill-creator's JSON structure or adapt to AgentV's existing result format

## Acceptance Signals

- A blind comparison mode exists where the judge evaluates two outputs without knowing which is baseline/candidate
- The comparison generates task-specific rubrics (not just fixed metrics) with content and structure dimensions
- Each output receives a multi-dimensional score (content score, structure score, overall)
- A post-comparison analysis can explain why the winner won and suggest prioritized improvements for the loser
- The blind comparison result includes a `winner` field, structured rubric scores, and per-output strengths/weaknesses
- The analysis output is compatible with the skill-creator's comparator/analyzer JSON formats

## Non-Goals

- Replacing `agentv compare` CLI (this complements it with richer qualitative comparison)
- Modifying the core evaluation engine
- Automated skill improvement (this surfaces insights, humans/optimizer act on them)
- Trigger evaluation comparison

## Source Material

- [Anthropic skill-creator comparator.md](https://github.com/anthropics/skills/blob/main/skills/skill-creator/agents/comparator.md) — blind comparison with dynamic rubrics
- [Anthropic skill-creator analyzer.md](https://github.com/anthropics/skills/blob/main/skills/skill-creator/agents/analyzer.md) — post-comparison analysis and benchmark pattern analysis
- Current AgentV: `plugins/agentv-dev/agents/optimizer-reflector.md`, `plugins/agentv-dev/skills/agentv-trace-analyst/SKILL.md`



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: blind A/B comparison with dynamic rubrics and post-comparison analysis #571

Objective

Architecture Boundary

AgentV's broader comparison scope

What skill-creator has that AgentV doesn't

1. Blind comparison (comparator.md)

2. Dynamic rubric generation

3. Post-comparison analysis (analyzer.md — unblinding section)

What to keep from AgentV

Design Latitude

Acceptance Signals

Non-Goals

Source Material

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: blind A/B comparison with dynamic rubrics and post-comparison analysis #571

Description

Objective

Architecture Boundary

AgentV's broader comparison scope

What skill-creator has that AgentV doesn't

1. Blind comparison (comparator.md)

2. Dynamic rubric generation

3. Post-comparison analysis (analyzer.md — unblinding section)

What to keep from AgentV

Design Latitude

Acceptance Signals

Non-Goals

Source Material

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions