feat: unified agent-evaluation lifecycle skill (combine eval-orchestrator + optimizer)

## Objective

Expand `agentv-optimizer` into a unified agent-evaluation lifecycle skill that covers the full evaluation improvement loop: run → grade → compare → analyze → review → optimize → re-run. Unlike skill-creator (which evaluates single skills in isolation), the lifecycle skill preserves AgentV's full environment capabilities: workspace isolation, multi-turn conversations, multi-provider comparison, tool trajectory scoring, code judges, and integration testing in real repos. This combines the currently fragmented eval-orchestrator and optimizer workflows into a single skill that matches Anthropic's skill-creator lifecycle pattern, while keeping `agentv-eval-builder` separate for test case authoring.

## Evaluation Modes

The lifecycle skill handles two evaluation modes with the same phases:

| | Agent/Workspace Evaluation (PRIMARY) | Skill Evaluation (SECONDARY) |
|--|--------------------------------------|------------------------------|
| **Input** | EVAL.yaml with workspace config | evals.json (drop-in from skill-creator) |
| **Tests** | Agent working in cloned repos, multi-turn conversations | Single-prompt skill output |
| **Evaluators** | Code judges, tool trajectory, LLM judge, deterministic | LLM judge, deterministic assertions |
| **Targets** | Multiple providers (Claude, GPT, Copilot, custom CLI) | With-skill vs without-skill |
| **Optimizing** | Agent task prompts, system prompts, tool configs | SKILL.md content |

The lifecycle phases (run → grade → compare → analyze → review → optimize) are the same for both modes. The skill auto-detects the mode from the input format (EVAL.yaml vs evals.json).

Existing EVAL.yaml users should see zero disruption — the combined skill expands the optimizer with richer grading, comparison, and analysis phases. It does NOT narrow scope to skill-only evaluation.

## Architecture Boundary

external-first (skill + agent prompt changes in `plugins/agentv-dev/`)

## Why combine

AgentV currently splits the agent evaluation lifecycle across four skills:

| Step in the loop | Current AgentV skill | Problem |
|-----------------|---------------------|---------|
| Create test cases | agentv-eval-builder | Fine — distinct task |
| Run evals | agentv-eval-orchestrator | User must switch skills |
| Analyze results | agentv-trace-analyst | User must switch skills |
| Optimize prompts | agentv-optimizer | Owns a 5-phase workflow but doesn't own run/grade/compare |

Anthropic's skill-creator handles the entire lifecycle in one SKILL.md, dispatching to sub-agents (grader, comparator, analyzer) as needed. The user invokes one skill and the skill orchestrates the phases.

**Problems with the current split:**
1. Users must know which skill to invoke at each step
2. Context is lost between skill invocations (eval results, iteration history, optimization strategy)
3. Trigger keywords overlap between the four skills (#572)
4. The optimizer already has a 5-phase workflow with 4 sub-agents — it's naturally the orchestration layer but stops short of owning run, grade, compare, and review

## Proposed Phases

Expand the optimizer's 5-phase workflow to 8 phases covering the full lifecycle:

### Phase 1: Discovery (existing)
Agent: `optimizer-discovery`
- Analyze eval file, understand agent purpose
- Challenge eval quality assumptions
- Triage failures (must-fix / nice-to-have / eval-issue)
- Scope the optimization

### Phase 2: Run Baseline (absorb from eval-orchestrator)
- Run `agentv eval run` or `agentv prompt eval` against baseline and candidate
- Support both EVAL.yaml and evals.json formats
- Multi-target evaluation for paired comparison
- Document baseline isolation (discovery-path contamination)

### Phase 3: Grade (enhanced eval-judge — #570)
Agent: `eval-judge` (enhanced)
- Per-assertion structured evidence ({text, passed, evidence})
- Claims extraction and verification
- Eval self-critique ("weak assertion = false confidence")
- Surface vs substance guards
- User notes integration
- Output grading.json artifact (#565)

### Phase 4: Compare (blind A/B — #571)
Agents: `blind-comparator` (new), `comparison-analyzer` (new)
- Blind comparison — judge doesn't know which is baseline
- Dynamic task-specific rubric generation
- Multi-dimensional scoring (content + structure)
- Post-comparison analysis (unblinding, instruction-following scoring)
- Skip this phase when only one run exists

### Phase 5: Analyze (standalone analyzer — #567 + existing SIMBA/GEPA)
Agent: `optimizer-reflector` (enhanced)
- Existing SIMBA pattern (self-introspective failure analysis)
- Existing GEPA pattern (trace reflection with diagnosis categories)
- NEW: Deterministic-upgrade suggestions (LLM-judge → contains/regex/is_json)
- NEW: Benchmark pattern analysis (always-pass, always-fail, flaky detection)
- NEW: Weak assertion identification
- Trend analysis across iterations (existing stagnation detection)

### Phase 6: Review (human review checkpoint — #568)
- Present results to human reviewer (reference #562 HTML dashboard if available)
- Collect structured feedback (feedback.json artifact)
- Gate: human approves iteration or requests changes
- This phase may be skipped in automated/CI mode

### Phase 7: Optimize (existing optimizer-curator + optimizer-polish)
Agents: `optimizer-curator`, `optimizer-polish`
- Existing surgical prompt editing (ADD/UPDATE/DELETE/NEGATIVE CONSTRAINT)
- Existing generalization and polish
- Existing variant tracking (2-3 promising variants)

### Phase 8: Re-run + Iterate
- Loop back to Phase 2 with the modified skill/prompt
- Compare against previous iteration's baseline
- Exit when: target pass rate reached, human approves, or stagnation detected

## Agent Inventory

Existing agents (keep):
- `optimizer-discovery` — Phase 1
- `eval-candidate` — Phase 2 (candidate LLM for agent-mode evals)
- `eval-judge` — Phase 3 (enhanced with #570 patterns)
- `optimizer-reflector` — Phase 5 (enhanced with #567 patterns)
- `optimizer-curator` — Phase 7
- `optimizer-polish` — Phase 7

New agents:
- `blind-comparator` — Phase 4 (from #571)
- `comparison-analyzer` — Phase 4 (from #571, post-comparison unblinding)

Total: 8 agents, dispatched on-demand by the skill. The SKILL.md is the orchestration layer.

## What stays separate

- **agentv-eval-builder** — Creating eval files is a distinct task. Users often create evals without running the lifecycle.
- **agentv-trace-analyst** — Ad-hoc analysis outside the lifecycle loop. Useful for one-off investigations.
- **agentv-chat-to-eval** — Conversation conversion, unrelated to the lifecycle.
- **agentv-onboarding** — Setup, unrelated.

## What gets deprecated

- **agentv-eval-orchestrator** — Its "run evals" capability is absorbed into Phase 2 of the combined skill. The skill file can be kept temporarily with a description pointing to the combined skill, then removed.

## Design Latitude

- Choose the skill name: expand `agentv-optimizer` (rename to `agentv-skill-eval` or similar) vs. create a new skill and deprecate the optimizer
- Choose how to handle progressive disclosure: the SKILL.md should describe all 8 phases but load agent references on-demand
- Choose which phases are skippable (e.g., Phase 4 blind comparison, Phase 6 human review)
- Choose whether to implement all phases at once or incrementally (add phases to the existing optimizer skill over multiple PRs)
- Phase 2 can delegate to `agentv eval run` CLI or use the agent-mode approach — implementer's choice

## Relationship to Existing Issues

This issue is the **orchestration layer** that ties the individual capability issues together:

| Issue | Becomes | Phase |
|-------|---------|-------|
| #570 (eval-judge enhancement) | Enhanced eval-judge agent | Phase 3 |
| #571 (blind comparison) | New comparator + analyzer agents | Phase 4 |
| #567 (eval analyzer) | Enhanced optimizer-reflector agent | Phase 5 |
| #568 (review checkpoint) | Review phase in the skill | Phase 6 |
| #565 (companion artifacts) | Artifact output across phases | Phases 3, 5 |
| #572 (trigger disambiguation) | Simpler — one skill to disambiguate, not four | — |
| #564 (workflow guide) | Documents this skill's workflow for end users | — |

Individual issues can still be implemented as separate PRs (one agent enhancement per PR). This issue adds the SKILL.md that orchestrates them into the lifecycle.

## Migration Reference in Skill

The combined skill's `references/` directory should include a migration reference (e.g., `references/migrating-from-skill-creator.md`) that the skill can load when a user asks about migration or is working with skill-creator's evals.json. This reference should cover:

1. **Drop-in replacement**: `agentv eval run evals.json` runs skill-creator evals directly — no conversion
2. **What you gain**: workspace isolation, code judges, tool trajectory, N-way provider comparison, multi-turn evaluation
3. **Artifact compatibility**: grading.json/benchmark.json output is readable by skill-creator's eval-viewer
4. **Graduating to EVAL.yaml**: when and how to convert evals.json to EVAL.yaml for richer evaluation features
5. **What stays in skill-creator**: trigger optimization, .skill packaging, skill authoring — AgentV does not replace these

## Acceptance Signals

- A single skill invocation runs the complete agent evaluation lifecycle (run → grade → compare → analyze → review → optimize → re-run)
- The skill dispatches to specialized agents at each phase (not a monolithic prompt)
- Users can enter the lifecycle at any phase (e.g., "I already have results, just analyze them")
- The skill produces companion artifacts (#565) at relevant phases (grading.json after Phase 3, benchmark.json after Phase 5)
- Phase 6 (human review) can be skipped with a flag or prompt
- The skill handles both evals.json and EVAL.yaml input formats
- agentv-eval-orchestrator is deprecated or merged
- The trigger description clearly disambiguates from skill-creator (#572)

## Non-Goals

- Absorbing agentv-eval-builder (test case creation stays separate)
- Absorbing agentv-trace-analyst (ad-hoc analysis stays separate)
- Adding trigger optimization (future work, needs skill discovery story)
- Adding packaging/distribution (.skill bundles)
- Core runtime changes (this is all skill/agent prompt work)

## Source Material

- [Anthropic skill-creator SKILL.md](https://github.com/anthropics/skills/blob/main/skills/skill-creator/SKILL.md) — unified lifecycle skill pattern
- Current AgentV optimizer: `plugins/agentv-dev/skills/agentv-optimizer/SKILL.md` (467 lines, 5-phase workflow)
- Current AgentV eval-orchestrator: `plugins/agentv-dev/skills/agentv-eval-orchestrator/SKILL.md` (77 lines)

## Critical: eval-orchestrator capabilities that must be preserved

When absorbing eval-orchestrator, the combined skill must NOT lose these capabilities:

1. **Workspace evaluation** — Clone repos, run setup/teardown scripts, evaluate agent behavior in real project contexts. This is AgentV's key differentiator vs. skill-creator.
2. **Multi-provider targets** — Run the same eval against Claude, GPT, Copilot, Gemini, custom CLI agents simultaneously. Not just with-skill vs without-skill.
3. **Multi-turn conversation evaluation** — Test multi-turn agent interactions with `conversation_id` tracking, not just single-prompt evaluation.
4. **Code judges** — First-class Python/TypeScript evaluator scripts via `defineCodeJudge()`, not just LLM-based grading.
5. **Tool trajectory scoring** — Evaluate tool call sequences (correct tools used, correct order, no unnecessary calls).
6. **Workspace file change tracking** — Evaluate agent output by diffing workspace files, not just text output.
7. **All eval formats** — EVAL.yaml, evals.json, JSONL datasets. Not just evals.json.
8. **Agent-mode AND CLI-mode** — Agent mode (no API keys, uses eval-candidate + eval-judge agents) and CLI mode (end-to-end with API keys).

The combined skill is an EXPANSION of eval-orchestrator's scope (adding grading, comparison, analysis, review phases), not a REPLACEMENT with narrower skill-creator patterns.

## Relationship to Skill-Creator

The combined skill is NOT a replacement for skill-creator. They are complementary:

- **Skill-creator** writes `evals.json` for a skill and runs simple paired evals via `claude -p`
- **AgentV's lifecycle skill** can run the SAME `evals.json` with richer evaluation: workspace isolation, code judges, multi-provider comparison, tool trajectory scoring
- **AgentV outputs** grading/benchmark artifacts that skill-creator's `eval-viewer` can read

The upgrade path: a user starts with skill-creator's simple eval loop, then graduates to AgentV's lifecycle skill when they need environment-level evaluation. Artifacts are portable between the two.

The combined skill should accept `evals.json` as input (already supported by eval-orchestrator) and produce artifacts compatible with skill-creator's tooling (ensured by #565's superset schemas).



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: unified agent-evaluation lifecycle skill (combine eval-orchestrator + optimizer) #573

Objective

Evaluation Modes

Architecture Boundary

Why combine

Proposed Phases

Phase 1: Discovery (existing)

Phase 2: Run Baseline (absorb from eval-orchestrator)

Phase 3: Grade (enhanced eval-judge — #570)

Phase 4: Compare (blind A/B — #571)

Phase 5: Analyze (standalone analyzer — #567 + existing SIMBA/GEPA)

Phase 6: Review (human review checkpoint — #568)

Phase 7: Optimize (existing optimizer-curator + optimizer-polish)

Phase 8: Re-run + Iterate

Agent Inventory

What stays separate

What gets deprecated

Design Latitude

Relationship to Existing Issues

Migration Reference in Skill

Acceptance Signals

Non-Goals

Source Material

Critical: eval-orchestrator capabilities that must be preserved

Relationship to Skill-Creator

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	Agent/Workspace Evaluation (PRIMARY)	Skill Evaluation (SECONDARY)
Input	EVAL.yaml with workspace config	evals.json (drop-in from skill-creator)
Tests	Agent working in cloned repos, multi-turn conversations	Single-prompt skill output
Evaluators	Code judges, tool trajectory, LLM judge, deterministic	LLM judge, deterministic assertions
Targets	Multiple providers (Claude, GPT, Copilot, custom CLI)	With-skill vs without-skill
Optimizing	Agent task prompts, system prompts, tool configs	SKILL.md content

Step in the loop	Current AgentV skill	Problem
Create test cases	agentv-eval-builder	Fine — distinct task
Run evals	agentv-eval-orchestrator	User must switch skills
Analyze results	agentv-trace-analyst	User must switch skills
Optimize prompts	agentv-optimizer	Owns a 5-phase workflow but doesn't own run/grade/compare

Issue	Becomes	Phase
#570 (eval-judge enhancement)	Enhanced eval-judge agent	Phase 3
#571 (blind comparison)	New comparator + analyzer agents	Phase 4
#567 (eval analyzer)	Enhanced optimizer-reflector agent	Phase 5
#568 (review checkpoint)	Review phase in the skill	Phase 6
#565 (companion artifacts)	Artifact output across phases	Phases 3, 5
#572 (trigger disambiguation)	Simpler — one skill to disambiguate, not four	—
#564 (workflow guide)	Documents this skill's workflow for end users	—

feat: unified agent-evaluation lifecycle skill (combine eval-orchestrator + optimizer) #573

Description

Objective

Evaluation Modes

Architecture Boundary

Why combine

Proposed Phases

Phase 1: Discovery (existing)

Phase 2: Run Baseline (absorb from eval-orchestrator)

Phase 3: Grade (enhanced eval-judge — #570)

Phase 4: Compare (blind A/B — #571)

Phase 5: Analyze (standalone analyzer — #567 + existing SIMBA/GEPA)

Phase 6: Review (human review checkpoint — #568)

Phase 7: Optimize (existing optimizer-curator + optimizer-polish)

Phase 8: Re-run + Iterate

Agent Inventory

What stays separate

What gets deprecated

Design Latitude

Relationship to Existing Issues

Migration Reference in Skill

Acceptance Signals

Non-Goals

Source Material

Critical: eval-orchestrator capabilities that must be preserved

Relationship to Skill-Creator

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions