-
Notifications
You must be signed in to change notification settings - Fork 0
feat: unified agent-evaluation lifecycle skill (combine eval-orchestrator + optimizer) #573
Description
Objective
Expand agentv-optimizer into a unified agent-evaluation lifecycle skill that covers the full evaluation improvement loop: run → grade → compare → analyze → review → optimize → re-run. Unlike skill-creator (which evaluates single skills in isolation), the lifecycle skill preserves AgentV's full environment capabilities: workspace isolation, multi-turn conversations, multi-provider comparison, tool trajectory scoring, code judges, and integration testing in real repos. This combines the currently fragmented eval-orchestrator and optimizer workflows into a single skill that matches Anthropic's skill-creator lifecycle pattern, while keeping agentv-eval-builder separate for test case authoring.
Evaluation Modes
The lifecycle skill handles two evaluation modes with the same phases:
| Agent/Workspace Evaluation (PRIMARY) | Skill Evaluation (SECONDARY) | |
|---|---|---|
| Input | EVAL.yaml with workspace config | evals.json (drop-in from skill-creator) |
| Tests | Agent working in cloned repos, multi-turn conversations | Single-prompt skill output |
| Evaluators | Code judges, tool trajectory, LLM judge, deterministic | LLM judge, deterministic assertions |
| Targets | Multiple providers (Claude, GPT, Copilot, custom CLI) | With-skill vs without-skill |
| Optimizing | Agent task prompts, system prompts, tool configs | SKILL.md content |
The lifecycle phases (run → grade → compare → analyze → review → optimize) are the same for both modes. The skill auto-detects the mode from the input format (EVAL.yaml vs evals.json).
Existing EVAL.yaml users should see zero disruption — the combined skill expands the optimizer with richer grading, comparison, and analysis phases. It does NOT narrow scope to skill-only evaluation.
Architecture Boundary
external-first (skill + agent prompt changes in plugins/agentv-dev/)
Why combine
AgentV currently splits the agent evaluation lifecycle across four skills:
| Step in the loop | Current AgentV skill | Problem |
|---|---|---|
| Create test cases | agentv-eval-builder | Fine — distinct task |
| Run evals | agentv-eval-orchestrator | User must switch skills |
| Analyze results | agentv-trace-analyst | User must switch skills |
| Optimize prompts | agentv-optimizer | Owns a 5-phase workflow but doesn't own run/grade/compare |
Anthropic's skill-creator handles the entire lifecycle in one SKILL.md, dispatching to sub-agents (grader, comparator, analyzer) as needed. The user invokes one skill and the skill orchestrates the phases.
Problems with the current split:
- Users must know which skill to invoke at each step
- Context is lost between skill invocations (eval results, iteration history, optimization strategy)
- Trigger keywords overlap between the four skills (fix: disambiguate agentv eval skill triggers from skill-creator #572)
- The optimizer already has a 5-phase workflow with 4 sub-agents — it's naturally the orchestration layer but stops short of owning run, grade, compare, and review
Proposed Phases
Expand the optimizer's 5-phase workflow to 8 phases covering the full lifecycle:
Phase 1: Discovery (existing)
Agent: optimizer-discovery
- Analyze eval file, understand agent purpose
- Challenge eval quality assumptions
- Triage failures (must-fix / nice-to-have / eval-issue)
- Scope the optimization
Phase 2: Run Baseline (absorb from eval-orchestrator)
- Run
agentv eval runoragentv prompt evalagainst baseline and candidate - Support both EVAL.yaml and evals.json formats
- Multi-target evaluation for paired comparison
- Document baseline isolation (discovery-path contamination)
Phase 3: Grade (enhanced eval-judge — #570)
Agent: eval-judge (enhanced)
- Per-assertion structured evidence ({text, passed, evidence})
- Claims extraction and verification
- Eval self-critique ("weak assertion = false confidence")
- Surface vs substance guards
- User notes integration
- Output grading.json artifact (feat: skill-eval companion artifacts (grading, timing, benchmark) #565)
Phase 4: Compare (blind A/B — #571)
Agents: blind-comparator (new), comparison-analyzer (new)
- Blind comparison — judge doesn't know which is baseline
- Dynamic task-specific rubric generation
- Multi-dimensional scoring (content + structure)
- Post-comparison analysis (unblinding, instruction-following scoring)
- Skip this phase when only one run exists
Phase 5: Analyze (standalone analyzer — #567 + existing SIMBA/GEPA)
Agent: optimizer-reflector (enhanced)
- Existing SIMBA pattern (self-introspective failure analysis)
- Existing GEPA pattern (trace reflection with diagnosis categories)
- NEW: Deterministic-upgrade suggestions (LLM-judge → contains/regex/is_json)
- NEW: Benchmark pattern analysis (always-pass, always-fail, flaky detection)
- NEW: Weak assertion identification
- Trend analysis across iterations (existing stagnation detection)
Phase 6: Review (human review checkpoint — #568)
- Present results to human reviewer (reference feat: self-contained HTML dashboard with meta-refresh (runs, benchmarks, traces) #562 HTML dashboard if available)
- Collect structured feedback (feedback.json artifact)
- Gate: human approves iteration or requests changes
- This phase may be skipped in automated/CI mode
Phase 7: Optimize (existing optimizer-curator + optimizer-polish)
Agents: optimizer-curator, optimizer-polish
- Existing surgical prompt editing (ADD/UPDATE/DELETE/NEGATIVE CONSTRAINT)
- Existing generalization and polish
- Existing variant tracking (2-3 promising variants)
Phase 8: Re-run + Iterate
- Loop back to Phase 2 with the modified skill/prompt
- Compare against previous iteration's baseline
- Exit when: target pass rate reached, human approves, or stagnation detected
Agent Inventory
Existing agents (keep):
optimizer-discovery— Phase 1eval-candidate— Phase 2 (candidate LLM for agent-mode evals)eval-judge— Phase 3 (enhanced with feat: adopt skill-creator grading patterns in eval-judge (claims extraction, eval critique, evidence format) #570 patterns)optimizer-reflector— Phase 5 (enhanced with feat: eval analyzer pass for weak assertions and flaky scenarios #567 patterns)optimizer-curator— Phase 7optimizer-polish— Phase 7
New agents:
blind-comparator— Phase 4 (from feat: blind A/B comparison with dynamic rubrics and post-comparison analysis #571)comparison-analyzer— Phase 4 (from feat: blind A/B comparison with dynamic rubrics and post-comparison analysis #571, post-comparison unblinding)
Total: 8 agents, dispatched on-demand by the skill. The SKILL.md is the orchestration layer.
What stays separate
- agentv-eval-builder — Creating eval files is a distinct task. Users often create evals without running the lifecycle.
- agentv-trace-analyst — Ad-hoc analysis outside the lifecycle loop. Useful for one-off investigations.
- agentv-chat-to-eval — Conversation conversion, unrelated to the lifecycle.
- agentv-onboarding — Setup, unrelated.
What gets deprecated
- agentv-eval-orchestrator — Its "run evals" capability is absorbed into Phase 2 of the combined skill. The skill file can be kept temporarily with a description pointing to the combined skill, then removed.
Design Latitude
- Choose the skill name: expand
agentv-optimizer(rename toagentv-skill-evalor similar) vs. create a new skill and deprecate the optimizer - Choose how to handle progressive disclosure: the SKILL.md should describe all 8 phases but load agent references on-demand
- Choose which phases are skippable (e.g., Phase 4 blind comparison, Phase 6 human review)
- Choose whether to implement all phases at once or incrementally (add phases to the existing optimizer skill over multiple PRs)
- Phase 2 can delegate to
agentv eval runCLI or use the agent-mode approach — implementer's choice
Relationship to Existing Issues
This issue is the orchestration layer that ties the individual capability issues together:
| Issue | Becomes | Phase |
|---|---|---|
| #570 (eval-judge enhancement) | Enhanced eval-judge agent | Phase 3 |
| #571 (blind comparison) | New comparator + analyzer agents | Phase 4 |
| #567 (eval analyzer) | Enhanced optimizer-reflector agent | Phase 5 |
| #568 (review checkpoint) | Review phase in the skill | Phase 6 |
| #565 (companion artifacts) | Artifact output across phases | Phases 3, 5 |
| #572 (trigger disambiguation) | Simpler — one skill to disambiguate, not four | — |
| #564 (workflow guide) | Documents this skill's workflow for end users | — |
Individual issues can still be implemented as separate PRs (one agent enhancement per PR). This issue adds the SKILL.md that orchestrates them into the lifecycle.
Migration Reference in Skill
The combined skill's references/ directory should include a migration reference (e.g., references/migrating-from-skill-creator.md) that the skill can load when a user asks about migration or is working with skill-creator's evals.json. This reference should cover:
- Drop-in replacement:
agentv eval run evals.jsonruns skill-creator evals directly — no conversion - What you gain: workspace isolation, code judges, tool trajectory, N-way provider comparison, multi-turn evaluation
- Artifact compatibility: grading.json/benchmark.json output is readable by skill-creator's eval-viewer
- Graduating to EVAL.yaml: when and how to convert evals.json to EVAL.yaml for richer evaluation features
- What stays in skill-creator: trigger optimization, .skill packaging, skill authoring — AgentV does not replace these
Acceptance Signals
- A single skill invocation runs the complete agent evaluation lifecycle (run → grade → compare → analyze → review → optimize → re-run)
- The skill dispatches to specialized agents at each phase (not a monolithic prompt)
- Users can enter the lifecycle at any phase (e.g., "I already have results, just analyze them")
- The skill produces companion artifacts (feat: skill-eval companion artifacts (grading, timing, benchmark) #565) at relevant phases (grading.json after Phase 3, benchmark.json after Phase 5)
- Phase 6 (human review) can be skipped with a flag or prompt
- The skill handles both evals.json and EVAL.yaml input formats
- agentv-eval-orchestrator is deprecated or merged
- The trigger description clearly disambiguates from skill-creator (fix: disambiguate agentv eval skill triggers from skill-creator #572)
Non-Goals
- Absorbing agentv-eval-builder (test case creation stays separate)
- Absorbing agentv-trace-analyst (ad-hoc analysis stays separate)
- Adding trigger optimization (future work, needs skill discovery story)
- Adding packaging/distribution (.skill bundles)
- Core runtime changes (this is all skill/agent prompt work)
Source Material
- Anthropic skill-creator SKILL.md — unified lifecycle skill pattern
- Current AgentV optimizer:
plugins/agentv-dev/skills/agentv-optimizer/SKILL.md(467 lines, 5-phase workflow) - Current AgentV eval-orchestrator:
plugins/agentv-dev/skills/agentv-eval-orchestrator/SKILL.md(77 lines)
Critical: eval-orchestrator capabilities that must be preserved
When absorbing eval-orchestrator, the combined skill must NOT lose these capabilities:
- Workspace evaluation — Clone repos, run setup/teardown scripts, evaluate agent behavior in real project contexts. This is AgentV's key differentiator vs. skill-creator.
- Multi-provider targets — Run the same eval against Claude, GPT, Copilot, Gemini, custom CLI agents simultaneously. Not just with-skill vs without-skill.
- Multi-turn conversation evaluation — Test multi-turn agent interactions with
conversation_idtracking, not just single-prompt evaluation. - Code judges — First-class Python/TypeScript evaluator scripts via
defineCodeJudge(), not just LLM-based grading. - Tool trajectory scoring — Evaluate tool call sequences (correct tools used, correct order, no unnecessary calls).
- Workspace file change tracking — Evaluate agent output by diffing workspace files, not just text output.
- All eval formats — EVAL.yaml, evals.json, JSONL datasets. Not just evals.json.
- Agent-mode AND CLI-mode — Agent mode (no API keys, uses eval-candidate + eval-judge agents) and CLI mode (end-to-end with API keys).
The combined skill is an EXPANSION of eval-orchestrator's scope (adding grading, comparison, analysis, review phases), not a REPLACEMENT with narrower skill-creator patterns.
Relationship to Skill-Creator
The combined skill is NOT a replacement for skill-creator. They are complementary:
- Skill-creator writes
evals.jsonfor a skill and runs simple paired evals viaclaude -p - AgentV's lifecycle skill can run the SAME
evals.jsonwith richer evaluation: workspace isolation, code judges, multi-provider comparison, tool trajectory scoring - AgentV outputs grading/benchmark artifacts that skill-creator's
eval-viewercan read
The upgrade path: a user starts with skill-creator's simple eval loop, then graduates to AgentV's lifecycle skill when they need environment-level evaluation. Artifacts are portable between the two.
The combined skill should accept evals.json as input (already supported by eval-orchestrator) and produce artifacts compatible with skill-creator's tooling (ensured by #565's superset schemas).