Skip to content

feat: blind A/B comparison with dynamic rubrics and post-comparison analysis #571

@christso

Description

@christso

Objective

Add blind A/B comparison capabilities to AgentV's evaluation workflow, adopting two techniques from Anthropic's skill-creator comparator.md and analyzer.md: bias-free blind comparison with dynamic rubric generation, and post-comparison analysis that explains WHY the winner won.

Architecture Boundary

external-first (new agents in plugins/agentv-dev/agents/ + skill enhancement)

AgentV's broader comparison scope

AgentV's agentv compare already supports N-way comparison across multiple providers (--group-by target), not just binary A/B. The blind comparison must preserve this: when comparing 3+ providers, randomize all labels (A, B, C...) and evaluate each pair or use a round-robin tournament. Do not regress to binary-only comparison.

What skill-creator has that AgentV doesn't

1. Blind comparison (comparator.md)

The comparator judges two outputs labeled A and B without knowing which is baseline vs candidate:

"You receive two outputs labeled A and B, but you do NOT know which skill
 produced which. This prevents bias toward a particular skill or approach.
 Your judgment is based purely on output quality and task completion."

AgentV's agentv compare shows results with target labels visible. This introduces confirmation bias — reviewers (human or LLM) tend to favor the "new" version.

2. Dynamic rubric generation

Instead of fixed metrics, the comparator generates task-specific rubrics:

Content Rubric (what the output contains):
| Criterion | 1 (Poor) | 3 (Acceptable) | 5 (Excellent) |
| Correctness | Major errors | Minor errors | Fully correct |
| Completeness | Missing key elements | Mostly complete | All elements present |

Structure Rubric (how the output is organized):
| Criterion | 1 (Poor) | 3 (Acceptable) | 5 (Excellent) |
| Organization | Disorganized | Reasonably organized | Clear, logical structure |

Criteria adapt to the task: PDF form → "Field alignment", "Text readability"; Data output → "Schema correctness", "Data types". AgentV's compare uses the same fixed evaluator scores for all task types.

3. Post-comparison analysis (analyzer.md — unblinding section)

After blind comparison, the analyzer "unblinds" results and explains what made the winner better:

"Read Both Skills → Read Both Transcripts → Analyze Instruction Following →
 Identify Winner Strengths → Identify Loser Weaknesses →
 Generate Improvement Suggestions (prioritized by impact)"

It scores instruction-following (1-10) and produces categorized suggestions (instructions, tools, examples, error_handling, structure, references) with priority levels (high/medium/low).

AgentV's optimizer-reflector does failure analysis (SIMBA/GEPA) but doesn't do post-comparison analysis explaining why one output beat another.

What to keep from AgentV

  • agentv compare CLI with pairwise and N-way comparison (the data pipeline is good)
  • agentv-trace-analyst for detailed trace analysis
  • SIMBA/GEPA patterns in optimizer-reflector for root cause diagnosis (complementary to post-comparison analysis)
  • --group-by target|dataset|test-id for multi-dimensional comparison

Design Latitude

  • May be a new agent pair (blind-comparator.md + comparison-analyzer.md) or enhancements to existing agents
  • May be invoked via a new skill, an enhancement to agentv-trace-analyst, or a flag on agentv compare
  • Choose how to randomize A/B labels (the CLI can shuffle, or the agent prompt can instruct shuffling)
  • Dynamic rubric generation can be a general-purpose feature or specific to blind comparison mode
  • Post-comparison analysis may be optional (only run when the user wants to understand close calls)
  • Choose output format — may follow skill-creator's JSON structure or adapt to AgentV's existing result format

Acceptance Signals

  • A blind comparison mode exists where the judge evaluates two outputs without knowing which is baseline/candidate
  • The comparison generates task-specific rubrics (not just fixed metrics) with content and structure dimensions
  • Each output receives a multi-dimensional score (content score, structure score, overall)
  • A post-comparison analysis can explain why the winner won and suggest prioritized improvements for the loser
  • The blind comparison result includes a winner field, structured rubric scores, and per-output strengths/weaknesses
  • The analysis output is compatible with the skill-creator's comparator/analyzer JSON formats

Non-Goals

  • Replacing agentv compare CLI (this complements it with richer qualitative comparison)
  • Modifying the core evaluation engine
  • Automated skill improvement (this surfaces insights, humans/optimizer act on them)
  • Trigger evaluation comparison

Source Material

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions