Skip to content

feat: unified agent-evaluation lifecycle skill (combine eval-orchestrator + optimizer) #573

@christso

Description

@christso

Objective

Expand agentv-optimizer into a unified agent-evaluation lifecycle skill that covers the full evaluation improvement loop: run → grade → compare → analyze → review → optimize → re-run. Unlike skill-creator (which evaluates single skills in isolation), the lifecycle skill preserves AgentV's full environment capabilities: workspace isolation, multi-turn conversations, multi-provider comparison, tool trajectory scoring, code judges, and integration testing in real repos. This combines the currently fragmented eval-orchestrator and optimizer workflows into a single skill that matches Anthropic's skill-creator lifecycle pattern, while keeping agentv-eval-builder separate for test case authoring.

Evaluation Modes

The lifecycle skill handles two evaluation modes with the same phases:

Agent/Workspace Evaluation (PRIMARY) Skill Evaluation (SECONDARY)
Input EVAL.yaml with workspace config evals.json (drop-in from skill-creator)
Tests Agent working in cloned repos, multi-turn conversations Single-prompt skill output
Evaluators Code judges, tool trajectory, LLM judge, deterministic LLM judge, deterministic assertions
Targets Multiple providers (Claude, GPT, Copilot, custom CLI) With-skill vs without-skill
Optimizing Agent task prompts, system prompts, tool configs SKILL.md content

The lifecycle phases (run → grade → compare → analyze → review → optimize) are the same for both modes. The skill auto-detects the mode from the input format (EVAL.yaml vs evals.json).

Existing EVAL.yaml users should see zero disruption — the combined skill expands the optimizer with richer grading, comparison, and analysis phases. It does NOT narrow scope to skill-only evaluation.

Architecture Boundary

external-first (skill + agent prompt changes in plugins/agentv-dev/)

Why combine

AgentV currently splits the agent evaluation lifecycle across four skills:

Step in the loop Current AgentV skill Problem
Create test cases agentv-eval-builder Fine — distinct task
Run evals agentv-eval-orchestrator User must switch skills
Analyze results agentv-trace-analyst User must switch skills
Optimize prompts agentv-optimizer Owns a 5-phase workflow but doesn't own run/grade/compare

Anthropic's skill-creator handles the entire lifecycle in one SKILL.md, dispatching to sub-agents (grader, comparator, analyzer) as needed. The user invokes one skill and the skill orchestrates the phases.

Problems with the current split:

  1. Users must know which skill to invoke at each step
  2. Context is lost between skill invocations (eval results, iteration history, optimization strategy)
  3. Trigger keywords overlap between the four skills (fix: disambiguate agentv eval skill triggers from skill-creator #572)
  4. The optimizer already has a 5-phase workflow with 4 sub-agents — it's naturally the orchestration layer but stops short of owning run, grade, compare, and review

Proposed Phases

Expand the optimizer's 5-phase workflow to 8 phases covering the full lifecycle:

Phase 1: Discovery (existing)

Agent: optimizer-discovery

  • Analyze eval file, understand agent purpose
  • Challenge eval quality assumptions
  • Triage failures (must-fix / nice-to-have / eval-issue)
  • Scope the optimization

Phase 2: Run Baseline (absorb from eval-orchestrator)

  • Run agentv eval run or agentv prompt eval against baseline and candidate
  • Support both EVAL.yaml and evals.json formats
  • Multi-target evaluation for paired comparison
  • Document baseline isolation (discovery-path contamination)

Phase 3: Grade (enhanced eval-judge — #570)

Agent: eval-judge (enhanced)

Phase 4: Compare (blind A/B — #571)

Agents: blind-comparator (new), comparison-analyzer (new)

  • Blind comparison — judge doesn't know which is baseline
  • Dynamic task-specific rubric generation
  • Multi-dimensional scoring (content + structure)
  • Post-comparison analysis (unblinding, instruction-following scoring)
  • Skip this phase when only one run exists

Phase 5: Analyze (standalone analyzer — #567 + existing SIMBA/GEPA)

Agent: optimizer-reflector (enhanced)

  • Existing SIMBA pattern (self-introspective failure analysis)
  • Existing GEPA pattern (trace reflection with diagnosis categories)
  • NEW: Deterministic-upgrade suggestions (LLM-judge → contains/regex/is_json)
  • NEW: Benchmark pattern analysis (always-pass, always-fail, flaky detection)
  • NEW: Weak assertion identification
  • Trend analysis across iterations (existing stagnation detection)

Phase 6: Review (human review checkpoint — #568)

Phase 7: Optimize (existing optimizer-curator + optimizer-polish)

Agents: optimizer-curator, optimizer-polish

  • Existing surgical prompt editing (ADD/UPDATE/DELETE/NEGATIVE CONSTRAINT)
  • Existing generalization and polish
  • Existing variant tracking (2-3 promising variants)

Phase 8: Re-run + Iterate

  • Loop back to Phase 2 with the modified skill/prompt
  • Compare against previous iteration's baseline
  • Exit when: target pass rate reached, human approves, or stagnation detected

Agent Inventory

Existing agents (keep):

New agents:

Total: 8 agents, dispatched on-demand by the skill. The SKILL.md is the orchestration layer.

What stays separate

  • agentv-eval-builder — Creating eval files is a distinct task. Users often create evals without running the lifecycle.
  • agentv-trace-analyst — Ad-hoc analysis outside the lifecycle loop. Useful for one-off investigations.
  • agentv-chat-to-eval — Conversation conversion, unrelated to the lifecycle.
  • agentv-onboarding — Setup, unrelated.

What gets deprecated

  • agentv-eval-orchestrator — Its "run evals" capability is absorbed into Phase 2 of the combined skill. The skill file can be kept temporarily with a description pointing to the combined skill, then removed.

Design Latitude

  • Choose the skill name: expand agentv-optimizer (rename to agentv-skill-eval or similar) vs. create a new skill and deprecate the optimizer
  • Choose how to handle progressive disclosure: the SKILL.md should describe all 8 phases but load agent references on-demand
  • Choose which phases are skippable (e.g., Phase 4 blind comparison, Phase 6 human review)
  • Choose whether to implement all phases at once or incrementally (add phases to the existing optimizer skill over multiple PRs)
  • Phase 2 can delegate to agentv eval run CLI or use the agent-mode approach — implementer's choice

Relationship to Existing Issues

This issue is the orchestration layer that ties the individual capability issues together:

Issue Becomes Phase
#570 (eval-judge enhancement) Enhanced eval-judge agent Phase 3
#571 (blind comparison) New comparator + analyzer agents Phase 4
#567 (eval analyzer) Enhanced optimizer-reflector agent Phase 5
#568 (review checkpoint) Review phase in the skill Phase 6
#565 (companion artifacts) Artifact output across phases Phases 3, 5
#572 (trigger disambiguation) Simpler — one skill to disambiguate, not four
#564 (workflow guide) Documents this skill's workflow for end users

Individual issues can still be implemented as separate PRs (one agent enhancement per PR). This issue adds the SKILL.md that orchestrates them into the lifecycle.

Migration Reference in Skill

The combined skill's references/ directory should include a migration reference (e.g., references/migrating-from-skill-creator.md) that the skill can load when a user asks about migration or is working with skill-creator's evals.json. This reference should cover:

  1. Drop-in replacement: agentv eval run evals.json runs skill-creator evals directly — no conversion
  2. What you gain: workspace isolation, code judges, tool trajectory, N-way provider comparison, multi-turn evaluation
  3. Artifact compatibility: grading.json/benchmark.json output is readable by skill-creator's eval-viewer
  4. Graduating to EVAL.yaml: when and how to convert evals.json to EVAL.yaml for richer evaluation features
  5. What stays in skill-creator: trigger optimization, .skill packaging, skill authoring — AgentV does not replace these

Acceptance Signals

  • A single skill invocation runs the complete agent evaluation lifecycle (run → grade → compare → analyze → review → optimize → re-run)
  • The skill dispatches to specialized agents at each phase (not a monolithic prompt)
  • Users can enter the lifecycle at any phase (e.g., "I already have results, just analyze them")
  • The skill produces companion artifacts (feat: skill-eval companion artifacts (grading, timing, benchmark) #565) at relevant phases (grading.json after Phase 3, benchmark.json after Phase 5)
  • Phase 6 (human review) can be skipped with a flag or prompt
  • The skill handles both evals.json and EVAL.yaml input formats
  • agentv-eval-orchestrator is deprecated or merged
  • The trigger description clearly disambiguates from skill-creator (fix: disambiguate agentv eval skill triggers from skill-creator #572)

Non-Goals

  • Absorbing agentv-eval-builder (test case creation stays separate)
  • Absorbing agentv-trace-analyst (ad-hoc analysis stays separate)
  • Adding trigger optimization (future work, needs skill discovery story)
  • Adding packaging/distribution (.skill bundles)
  • Core runtime changes (this is all skill/agent prompt work)

Source Material

  • Anthropic skill-creator SKILL.md — unified lifecycle skill pattern
  • Current AgentV optimizer: plugins/agentv-dev/skills/agentv-optimizer/SKILL.md (467 lines, 5-phase workflow)
  • Current AgentV eval-orchestrator: plugins/agentv-dev/skills/agentv-eval-orchestrator/SKILL.md (77 lines)

Critical: eval-orchestrator capabilities that must be preserved

When absorbing eval-orchestrator, the combined skill must NOT lose these capabilities:

  1. Workspace evaluation — Clone repos, run setup/teardown scripts, evaluate agent behavior in real project contexts. This is AgentV's key differentiator vs. skill-creator.
  2. Multi-provider targets — Run the same eval against Claude, GPT, Copilot, Gemini, custom CLI agents simultaneously. Not just with-skill vs without-skill.
  3. Multi-turn conversation evaluation — Test multi-turn agent interactions with conversation_id tracking, not just single-prompt evaluation.
  4. Code judges — First-class Python/TypeScript evaluator scripts via defineCodeJudge(), not just LLM-based grading.
  5. Tool trajectory scoring — Evaluate tool call sequences (correct tools used, correct order, no unnecessary calls).
  6. Workspace file change tracking — Evaluate agent output by diffing workspace files, not just text output.
  7. All eval formats — EVAL.yaml, evals.json, JSONL datasets. Not just evals.json.
  8. Agent-mode AND CLI-mode — Agent mode (no API keys, uses eval-candidate + eval-judge agents) and CLI mode (end-to-end with API keys).

The combined skill is an EXPANSION of eval-orchestrator's scope (adding grading, comparison, analysis, review phases), not a REPLACEMENT with narrower skill-creator patterns.

Relationship to Skill-Creator

The combined skill is NOT a replacement for skill-creator. They are complementary:

  • Skill-creator writes evals.json for a skill and runs simple paired evals via claude -p
  • AgentV's lifecycle skill can run the SAME evals.json with richer evaluation: workspace isolation, code judges, multi-provider comparison, tool trajectory scoring
  • AgentV outputs grading/benchmark artifacts that skill-creator's eval-viewer can read

The upgrade path: a user starts with skill-creator's simple eval loop, then graduates to AgentV's lifecycle skill when they need environment-level evaluation. Artifacts are portable between the two.

The combined skill should accept evals.json as input (already supported by eval-orchestrator) and produce artifacts compatible with skill-creator's tooling (ensured by #565's superset schemas).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions