Skip to content

feat: unified agent-evaluation lifecycle skill (combine eval-orchestrator + optimizer)#583

Merged
christso merged 2 commits intomainfrom
feat/573-unified-lifecycle-skill
Mar 14, 2026
Merged

feat: unified agent-evaluation lifecycle skill (combine eval-orchestrator + optimizer)#583
christso merged 2 commits intomainfrom
feat/573-unified-lifecycle-skill

Conversation

@christso
Copy link
Copy Markdown
Collaborator

Closes #573

Changes

  • Expanded agentv-optimizer from 5-phase to 8-phase unified lifecycle skill
  • Deprecated eval-orchestrator (redirects to unified skill)
  • Added migration reference for skill-creator users

8 Phases

  1. Discovery — optimizer-discovery agent analyzes eval, challenges assumptions, triages failures
  2. Run Baseline — absorbed from eval-orchestrator: workspace eval, multi-provider, multi-turn, code judges, all formats, agent+CLI modes
  3. Grade — enhanced eval-judge with per-assertion evidence, claims extraction, self-critique (feat: adopt skill-creator grading patterns in eval-judge (claims extraction, eval critique, evidence format) #570)
  4. Compare — blind N-way comparison with dynamic rubrics, post-comparison analysis (feat: blind A/B comparison with dynamic rubrics and post-comparison analysis #571)
  5. Analyze — SIMBA/GEPA + deterministic-upgrade suggestions, weak assertion detection, benchmark patterns (feat: eval analyzer pass for weak assertions and flaky scenarios #567)
  6. Review — human review checkpoint with structured feedback.json (feat: human review checkpoint and feedback artifact for skill iteration #568), skippable in CI
  7. Optimize — curator surgical edits + polish generalization, variant tracking
  8. Re-run + Iterate — loop with exit conditions (target pass rate, human approval, stagnation)

Key Capabilities Preserved

All EVAL.yaml workspace evaluation capabilities:

  • Workspace isolation (clone repos, setup/teardown scripts)
  • Multi-provider targets (Claude, GPT, Copilot, Gemini, custom CLI)
  • Multi-turn conversation evaluation
  • Code judges (Python/TypeScript via defineCodeJudge())
  • Tool trajectory scoring
  • Workspace file change tracking
  • All eval formats (EVAL.yaml, evals.json, JSONL)
  • Agent-mode + CLI-mode

New Features

Files Changed

  • plugins/agentv-dev/skills/agentv-optimizer/SKILL.md — rewritten (159→346 lines)
  • plugins/agentv-dev/skills/agentv-eval-orchestrator/SKILL.md — deprecated (77→25 lines)
  • plugins/agentv-dev/skills/agentv-optimizer/references/migrating-from-skill-creator.md — new (96 lines)

Expand agentv-optimizer into 8-phase lifecycle skill covering the full
evaluation improvement loop: discover → run → grade → compare → analyze →
review → optimize → re-run. Absorb eval-orchestrator into Phase 2.
Reference all enhanced agents from Wave 1+2.

Preserves all EVAL.yaml capabilities: workspace isolation, multi-provider,
multi-turn, code judges, tool trajectory, workspace file tracking.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages bot commented Mar 14, 2026

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: ba856de
Status:⚡️  Build in progress...

View logs

@christso christso marked this pull request as ready for review March 14, 2026 05:36
…cycle-skill

# Conflicts:
#	plugins/agentv-dev/skills/agentv-eval-orchestrator/SKILL.md
#	plugins/agentv-dev/skills/agentv-optimizer/SKILL.md
@christso christso merged commit f7b35d3 into main Mar 14, 2026
1 check was pending
@christso christso deleted the feat/573-unified-lifecycle-skill branch March 14, 2026 05:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: unified agent-evaluation lifecycle skill (combine eval-orchestrator + optimizer)

1 participant