-
Notifications
You must be signed in to change notification settings - Fork 0
feat: blind A/B comparison with dynamic rubrics and post-comparison analysis #571
Description
Objective
Add blind A/B comparison capabilities to AgentV's evaluation workflow, adopting two techniques from Anthropic's skill-creator comparator.md and analyzer.md: bias-free blind comparison with dynamic rubric generation, and post-comparison analysis that explains WHY the winner won.
Architecture Boundary
external-first (new agents in plugins/agentv-dev/agents/ + skill enhancement)
AgentV's broader comparison scope
AgentV's agentv compare already supports N-way comparison across multiple providers (--group-by target), not just binary A/B. The blind comparison must preserve this: when comparing 3+ providers, randomize all labels (A, B, C...) and evaluate each pair or use a round-robin tournament. Do not regress to binary-only comparison.
What skill-creator has that AgentV doesn't
1. Blind comparison (comparator.md)
The comparator judges two outputs labeled A and B without knowing which is baseline vs candidate:
"You receive two outputs labeled A and B, but you do NOT know which skill
produced which. This prevents bias toward a particular skill or approach.
Your judgment is based purely on output quality and task completion."
AgentV's agentv compare shows results with target labels visible. This introduces confirmation bias — reviewers (human or LLM) tend to favor the "new" version.
2. Dynamic rubric generation
Instead of fixed metrics, the comparator generates task-specific rubrics:
Content Rubric (what the output contains):
| Criterion | 1 (Poor) | 3 (Acceptable) | 5 (Excellent) |
| Correctness | Major errors | Minor errors | Fully correct |
| Completeness | Missing key elements | Mostly complete | All elements present |
Structure Rubric (how the output is organized):
| Criterion | 1 (Poor) | 3 (Acceptable) | 5 (Excellent) |
| Organization | Disorganized | Reasonably organized | Clear, logical structure |
Criteria adapt to the task: PDF form → "Field alignment", "Text readability"; Data output → "Schema correctness", "Data types". AgentV's compare uses the same fixed evaluator scores for all task types.
3. Post-comparison analysis (analyzer.md — unblinding section)
After blind comparison, the analyzer "unblinds" results and explains what made the winner better:
"Read Both Skills → Read Both Transcripts → Analyze Instruction Following →
Identify Winner Strengths → Identify Loser Weaknesses →
Generate Improvement Suggestions (prioritized by impact)"
It scores instruction-following (1-10) and produces categorized suggestions (instructions, tools, examples, error_handling, structure, references) with priority levels (high/medium/low).
AgentV's optimizer-reflector does failure analysis (SIMBA/GEPA) but doesn't do post-comparison analysis explaining why one output beat another.
What to keep from AgentV
agentv compareCLI with pairwise and N-way comparison (the data pipeline is good)agentv-trace-analystfor detailed trace analysis- SIMBA/GEPA patterns in
optimizer-reflectorfor root cause diagnosis (complementary to post-comparison analysis) --group-by target|dataset|test-idfor multi-dimensional comparison
Design Latitude
- May be a new agent pair (
blind-comparator.md+comparison-analyzer.md) or enhancements to existing agents - May be invoked via a new skill, an enhancement to
agentv-trace-analyst, or a flag onagentv compare - Choose how to randomize A/B labels (the CLI can shuffle, or the agent prompt can instruct shuffling)
- Dynamic rubric generation can be a general-purpose feature or specific to blind comparison mode
- Post-comparison analysis may be optional (only run when the user wants to understand close calls)
- Choose output format — may follow skill-creator's JSON structure or adapt to AgentV's existing result format
Acceptance Signals
- A blind comparison mode exists where the judge evaluates two outputs without knowing which is baseline/candidate
- The comparison generates task-specific rubrics (not just fixed metrics) with content and structure dimensions
- Each output receives a multi-dimensional score (content score, structure score, overall)
- A post-comparison analysis can explain why the winner won and suggest prioritized improvements for the loser
- The blind comparison result includes a
winnerfield, structured rubric scores, and per-output strengths/weaknesses - The analysis output is compatible with the skill-creator's comparator/analyzer JSON formats
Non-Goals
- Replacing
agentv compareCLI (this complements it with richer qualitative comparison) - Modifying the core evaluation engine
- Automated skill improvement (this surfaces insights, humans/optimizer act on them)
- Trigger evaluation comparison
Source Material
- Anthropic skill-creator comparator.md — blind comparison with dynamic rubrics
- Anthropic skill-creator analyzer.md — post-comparison analysis and benchmark pattern analysis
- Current AgentV:
plugins/agentv-dev/agents/optimizer-reflector.md,plugins/agentv-dev/skills/agentv-trace-analyst/SKILL.md