diff --git a/plugins/agentv-dev/skills/agentv-eval-orchestrator/SKILL.md b/plugins/agentv-dev/skills/agentv-eval-orchestrator/SKILL.md index 77d8ceec4..4172fcc3b 100644 --- a/plugins/agentv-dev/skills/agentv-eval-orchestrator/SKILL.md +++ b/plugins/agentv-dev/skills/agentv-eval-orchestrator/SKILL.md @@ -1,5 +1,6 @@ --- name: agentv-eval-orchestrator +description: "[DEPRECATED] This skill has been absorbed into the unified agentv-optimizer lifecycle skill. Use agentv-optimizer instead — it covers the full evaluation lifecycle: run → grade → compare → analyze → review → optimize → re-run." description: >- Run AgentV evaluations against EVAL.yaml / .eval.yaml / evals.json files using the `agentv prompt eval` and `agentv eval` CLI commands. Use when asked to run AgentV evals, evaluate agent output quality with AgentV, execute an AgentV evaluation suite, @@ -8,74 +9,23 @@ description: >- or measuring skill-creator performance — those tasks belong to the skill-creator skill. --- -# AgentV Eval Orchestrator +# AgentV Eval Orchestrator — DEPRECATED -Run AgentV evaluations using the orchestration prompt system. +> **This skill has been merged into the unified `agentv-optimizer` lifecycle skill.** +> +> All eval-orchestrator capabilities (workspace evaluation, multi-provider targets, multi-turn conversations, code judges, tool trajectory, agent/CLI modes, all eval formats) are now in **Phase 2 (Run Baseline)** of the `agentv-optimizer` skill. +> +> **Use `agentv-optimizer` instead.** It runs the same evaluations and adds grading, comparison, analysis, human review, and optimization phases on top. -## Supported Formats +## Quick Migration -AgentV accepts evaluation files in multiple formats: +| Before (eval-orchestrator) | After (agentv-optimizer) | +|---------------------------|-------------------------| +| "Run evals on this file" | Same prompt — agentv-optimizer handles it | +| "Evaluate my agent" | Same prompt — starts at Phase 2 automatically | +| `agentv prompt eval ` | Same command — used in Phase 2 | +| `agentv eval run ` | Same command — used in Phase 2 | -- **EVAL YAML** (`.eval.yaml`) — Full-featured AgentV native format -- **JSONL** (`.jsonl`) — One test per line, with optional YAML sidecar -- **Agent Skills evals.json** (`.json`) — Open standard format from Agent Skills +## Why the change -All commands below work with any of these formats. - -## Usage - -```bash -agentv prompt eval -``` - -This outputs a complete orchestration prompt with mode-specific instructions and all test IDs. **Follow its instructions exactly.** - -The orchestration mode is controlled by the `AGENTV_PROMPT_EVAL_MODE` environment variable: - -- **`agent`** (default) — Act as the candidate LLM and judge via two agents (`eval-candidate`, `eval-judge`). No API keys needed. -- **`cli`** — The CLI runs the evaluation end-to-end. Requires API keys. - -## How It Works - -1. Run `agentv prompt eval ` to get orchestration instructions -2. The output tells you exactly what to do based on the current mode -3. Follow the instructions — dispatch agents (agent mode) or run CLI commands (cli mode) -4. Results are written to `.agentv/results/` in JSONL format - -## Agent Skills evals.json - -When running an `evals.json` file, AgentV automatically: - -- Promotes `prompt` → input messages, `expected_output` → reference answer -- Converts `assertions` → llm-judge evaluators -- Resolves `files[]` paths relative to the evals.json directory and copies them into the workspace -- Sets agent mode by default (since evals.json targets agent workflows) - -```bash -# Run directly -agentv prompt eval evals.json - -# Or convert to YAML first for full feature access -agentv convert evals.json -agentv prompt eval evals.eval.yaml -``` - -## Benchmark Output - -After running evaluations, generate an Agent Skills-compatible `benchmark.json` summary: - -```bash -agentv eval evals.json --benchmark-json benchmark.json -``` - -This produces aggregate pass rates, timing, and token statistics in the Agent Skills benchmark format. - -## Converting Formats - -To unlock AgentV-specific features (workspace setup, code judges, rubrics, retry policies), convert evals.json to YAML: - -```bash -agentv convert evals.json -``` - -See the [convert command docs](https://agentv.dev/tools/convert/) for details. +The eval-orchestrator ran evaluations but stopped there. Users had to manually switch to other skills for analysis and optimization. The unified lifecycle skill runs evaluations as part of a complete improvement loop — run, grade, compare, analyze, review, optimize, and re-run — without losing any eval-orchestrator capability. diff --git a/plugins/agentv-dev/skills/agentv-optimizer/SKILL.md b/plugins/agentv-dev/skills/agentv-optimizer/SKILL.md index 24bb46f12..55f0dfb31 100644 --- a/plugins/agentv-dev/skills/agentv-optimizer/SKILL.md +++ b/plugins/agentv-dev/skills/agentv-optimizer/SKILL.md @@ -1,5 +1,6 @@ --- name: agentv-optimizer +description: "Run the full agent-evaluation lifecycle: discover → run → grade → compare → analyze → review → optimize → re-run. Use when asked to evaluate an agent, optimize prompts against evals, run EVAL.yaml or evals.json evaluations, compare agent outputs, analyze eval results, or improve agent performance. Supports workspace evaluation with real repos, multi-provider targets, multi-turn conversations, code judges, tool trajectory scoring, and workspace file change tracking." description: >- Optimize agent task prompts through AgentV evaluation-driven refinement using `agentv prompt eval` and EVAL.yaml files. Five-phase workflow (Discovery → Planning → Optimization → Polish → Handoff) that iteratively improves prompts @@ -10,17 +11,50 @@ description: >- those tasks belong to the skill-creator skill. --- -# AgentV Optimizer +# AgentV Agent-Evaluation Lifecycle ## Overview -An agent **is** its prompts. This skill teaches patterns for agent self-improvement: using AgentV evaluations to iteratively refine the task prompts that drive agent behavior. Unlike static evaluation, this enables continuous agent improvement grounded in actual measurement. +An agent **is** its prompts. This skill orchestrates the complete agent-evaluation lifecycle — from running evaluations through grading, comparison, analysis, human review, optimization, and re-running — in a single invocation. It dispatches specialized agents at each phase and produces structured artifacts throughout. -The workflow is structured into five phases: Discovery, Planning, Optimization, Polish, and Handoff. This ensures the optimizer understands what it is optimizing before touching prompts, keeps the user in control at key decision points, and delivers a professional result rather than a collection of patches. +The workflow is structured into eight phases: **Discovery → Run → Grade → Compare → Analyze → Review → Optimize → Re-run**. Users can enter at any phase (e.g., "I already have results, just analyze them") and skip optional phases (e.g., comparison when there's only one run, or human review in CI mode). + +### How this differs from skill-creator + +| | AgentV Lifecycle Skill (this) | Anthropic skill-creator | +|--|-------------------------------|------------------------| +| **Primary input** | EVAL.yaml (workspace evals) | evals.json (skill evals) | +| **Also accepts** | evals.json, JSONL datasets | — | +| **Environment** | Clone repos, setup/teardown scripts, real project contexts | Isolated single-prompt | +| **Targets** | Multiple providers (Claude, GPT, Copilot, Gemini, custom CLI) | With-skill vs without-skill | +| **Evaluators** | Code judges, tool trajectory, LLM judge, deterministic | LLM judge, deterministic | +| **Conversations** | Multi-turn with conversation_id tracking | Single-turn | +| **Workspace** | File change tracking via workspace diffs | Text output only | +| **Modes** | Agent mode (no API keys) + CLI mode (end-to-end) | CLI only | + +For users migrating from skill-creator, see `references/migrating-from-skill-creator.md`. ## Input Variables -- `eval-path`: Path or glob pattern to the AgentV evaluation file(s) to optimize against + +- `eval-path`: Path or glob pattern to the AgentV evaluation file(s) — EVAL.yaml, evals.json, or JSONL - `optimization-log-path` (optional): Path where optimization progress should be logged +- `entry-phase` (optional): Phase to start from (1-8) — defaults to 1 (Discovery) +- `results-path` (optional): Path to existing results for mid-lifecycle entry (phases 3-6) +- `skip-review` (optional): Skip Phase 6 human review (for CI/automated mode) +- `target-pass-rate` (optional): Exit threshold — stop iterating when reached (default: 100%) +- `max-iterations` (optional): Maximum optimization iterations (default: 10) + +## Mode Detection + +The skill auto-detects the evaluation mode from the input format: + +| Input file | Detected mode | Behavior | +|-----------|---------------|----------| +| `*.eval.yaml` | Workspace/Agent evaluation | Full feature set: workspace isolation, code judges, multi-provider, multi-turn, tool trajectory | +| `evals.json` | Skill evaluation (compat) | Auto-promotes prompt/expected_output/assertions; resolves files[] paths; agent mode default | +| `*.jsonl` | Dataset evaluation | One test per line with optional YAML sidecar | + +All modes flow through the same 8 phases. EVAL.yaml unlocks the richest feature set. ## Evaluation Integrity Constraint @@ -37,129 +71,284 @@ The workflow is structured into five phases: Discovery, Planning, Optimization, - If a prompt file is referenced in both locations, optimize for the task purpose only - Document which prompts were modified in the optimization log -**Why this matters:** Optimizing judge prompts makes your agent *appear* better without actually improving it. Evaluation must remain an independent measure of agent quality. - ## Workflow ### Phase 1: Discovery -Before optimizing, understand what you are working with. +Before running or optimizing, understand what you are working with. **Dispatch the `optimizer-discovery` agent** with the eval path. It will: -1. **Load the Evaluation** — verify `` targets the correct system, read the eval file and all referenced test cases. -2. **Identify Prompt Files** — infer task prompts from `file:` references in `input` fields only, run integrity checks against evaluator configs, and recursively resolve prompt dependencies. -3. **Identify Optimization Log** — use `` if provided, otherwise create `optimization-[timestamp].md` in the eval's parent directory. -4. **Challenge Assumptions** — assess eval quality, flag ambiguous or contradictory test cases, triage failures into must-fix vs nice-to-have, and surface eval issues before proceeding. +1. **Load the Evaluation** — verify `` targets the correct system, read the eval file and all referenced test cases. Supports EVAL.yaml, evals.json, and JSONL formats. +2. **Identify Prompt Files** — infer task prompts from `file:` references in `input` fields only, run integrity checks against evaluator configs, and recursively resolve prompt dependencies. +3. **Identify Optimization Log** — use `` if provided, otherwise create `optimization-[timestamp].md` in the eval's parent directory. +4. **Challenge Assumptions** — assess eval quality, flag ambiguous or contradictory test cases, triage failures into must-fix / nice-to-have / eval-issue, and surface eval issues before proceeding. +5. **Integrity Check** — verify that task prompts referenced in `input` fields are not also present in evaluator configs. Flag any overlap. **Review the discovery report** before moving to Phase 2. If the agent flags eval issues, fix the eval first. -### Phase 2: Planning +### Phase 2: Run Baseline + +Run evaluations to establish baseline measurements. This phase absorbs the functionality of the former `agentv-eval-orchestrator` skill. + +**Execution modes:** + +The mode is controlled by the `AGENTV_PROMPT_EVAL_MODE` environment variable: + +- **`agent`** (default) — Dispatches `eval-candidate` and `eval-judge` agents. No API keys needed. +- **`cli`** — Runs `agentv eval run ` end-to-end. Requires API keys. + +**Steps:** + +1. **Run baseline evaluation:** -Propose a strategy before touching any prompts. + ```bash + # CLI mode + agentv eval run -1. **Run Baseline** - - Execute `agentv prompt eval ` to establish the current pass rate. - - Record baseline score in the optimization log. + # Agent mode — get orchestration prompt and follow it + agentv prompt eval + ``` -2. **Assess Complexity** - - **Simple**: Prompt needs clarification or missing constraints (expect 1-3 iterations). - - **Moderate**: Prompt structure needs reorganization or multiple concerns are entangled (expect 3-6 iterations). - - **Fundamental**: Agent's approach is wrong, needs rethinking (consider whether prompt optimization alone is sufficient). +2. **For multi-target comparison:** Run the same eval against multiple providers/configurations to produce paired results for Phase 4. -3. **Propose Strategy** - - Identify the top failure patterns from the baseline run. - - Propose an optimization approach: which failures to tackle first, what kind of changes to make. - - Identify dependencies and risks (e.g., fixing one failure pattern may break passing tests). +3. **For evals.json input:** AgentV automatically promotes `prompt` → input messages, `expected_output` → reference answer, converts `assertions` → evaluators, and resolves `files[]` paths. -4. **Get User Alignment** - - Present the strategy to the user before proceeding. - - If the agent needs fundamental restructuring, say so — don't just patch. - - Confirm the user is aligned on approach before entering the optimization loop. +4. **Record baseline** in the optimization log: score, pass rate, per-test breakdown, and results file path (`.agentv/results/eval_...jsonl`). -### Phase 3: Optimization Loop +**Capabilities preserved from eval-orchestrator:** +- Workspace isolation — clone repos, run setup/teardown scripts +- Multi-provider targets — same eval against Claude, GPT, Copilot, Gemini, custom CLI agents +- Multi-turn conversation evaluation — conversation_id tracking across turns +- Code judges — Python/TypeScript evaluator scripts via `defineCodeJudge()` +- Tool trajectory scoring — evaluate tool call sequences +- Workspace file change tracking — evaluate by diffing workspace files +- All eval formats — EVAL.yaml, evals.json, JSONL +- Agent-mode AND CLI-mode — agent mode (no API keys) and CLI mode (end-to-end) -Max 10 iterations. This is the core refinement cycle. +**Baseline isolation:** Discovery-phase analysis should not contaminate baseline results. Run the baseline before deep-diving into failure patterns to ensure the optimizer's understanding of failures comes from actual eval data, not assumptions. -1. **Execute (The Generator)** - - Run `agentv prompt eval ` and follow its orchestration instructions. - - *Targeted Run*: If iterating on specific stubborn failures, pass `--test-id ` to filter to specific tests. +### Phase 3: Grade -2. **Analyze — Dispatch `optimizer-reflector` agent** - - Provide the results file path (`.agentv/results/eval_...jsonl`) and the current iteration number. - - The reflector performs self-introspective analysis (SIMBA pattern) and natural language trace reflection (GEPA pattern). - - Returns a structured reflection report with: score, root cause analysis, high-variability cases, strategy, and stagnation check. +Produce structured grading with per-assertion evidence. -3. **Decide** - - If **100% pass**: Proceed to Phase 4 (Polish). - - If **Score decreased**: Revert last change, try different approach. - - If **No improvement** (2x): STOP and report stagnation, or try a fundamentally different approach. - - **Human checkpoint**: At iterations 3, 6, and 9, present progress to the user and confirm direction. Push back if the optimization is going down a bad path (e.g., accumulating contradictory rules, overfitting to specific test cases). - - **Variant tracking**: When a change improves some tests but regresses others, consider maintaining 2-3 promising prompt variants rather than single-path iteration. Compare variants to find the best overall approach before converging. +**Dispatch the `eval-judge` agent** (enhanced with claims extraction, #570). For each test case: -4. **Refine — Dispatch `optimizer-curator` agent** - - Provide the reflector's strategy and the prompt file path(s). - - The curator applies surgical, atomic operations (ADD / UPDATE / DELETE / NEGATIVE CONSTRAINT) to the task prompt. - - Returns a log entry documenting the operation, target, change, trigger, rationale, score, and insight. +1. **Per-assertion structured evidence** — each assertion produces `{text, passed, evidence}` with specific quotes or measurements backing the verdict. +2. **Claims extraction** — extract factual claims from the candidate response and verify each against reference material. +3. **Eval self-critique** — the judge flags its own weak assertions ("this passed, but the assertion is too loose to be meaningful"). +4. **Surface vs substance guards** — detect when a response looks good superficially but fails on substance (format compliance ≠ content quality). +5. **User notes integration** — if the user provided notes or context, incorporate them into grading. -5. **Log Result** - - Append the **Log Entry** returned by the Curator to the optimization log file. +**Output:** Write `grading.json` artifact to `.agentv/artifacts/grading.json` (#565). -### Phase 4: Polish +```json +{ + "eval_path": "", + "timestamp": "", + "results": [ + { + "test_id": "...", + "score": 0.85, + "assertions": [ + {"text": "Response includes error handling", "passed": true, "evidence": "Lines 12-15 contain try/catch block"}, + {"text": "Uses async/await pattern", "passed": false, "evidence": "Uses .then() callback pattern instead"} + ], + "claims": [...], + "self_critique": ["Assertion 'mentions error handling' is too loose — should check for specific error types"] + } + ] +} +``` -Before handing off, clean up the prompt so it reads as a coherent document. +### Phase 4: Compare -**Dispatch the `optimizer-polish` agent** with the prompt file(s) and the optimization log. It will: +Blind N-way comparison when multiple runs exist. **Skip this phase when only one run exists.** -1. **Generalize Patches into Principles** — consolidate specific fixes into broad guidelines. -2. **Remove Redundancy and Contradictions** — eliminate overlapping or conflicting rules. -3. **Ensure Prompt Quality** — verify clear persona, specific task, measurable success criteria, and manageable length (<200 lines). +**Step 1 — Dispatch `blind-comparator` agent** (#571): -**Review the polish report**, then: +1. **Blind presentation** — the comparator receives responses labeled "Response A", "Response B", etc. without knowing which is baseline. +2. **Dynamic rubric generation** — generate task-specific evaluation criteria based on the test case requirements, not a generic rubric. +3. **Multi-dimensional scoring** — evaluate on content quality AND structural quality independently. +4. **N-way comparison** — compare 2+ responses simultaneously, not just binary A/B. +5. **Per-response verdicts** with dimensional breakdowns. -4. **Verify Polish Didn't Regress** - - Run the eval one final time after polish changes. - - If score decreased, revert polish changes and keep the working (if messy) version. +**Step 2 — Dispatch `comparison-analyzer` agent** (#571): -### Phase 5: Handoff +1. **Unblinding** — reveal which response was baseline vs candidate. +2. **Improvement attribution** — identify what specific changes drove improvements or regressions. +3. **Instruction-following scoring** — score how well each response followed the original task instructions. +4. **Actionable suggestions** — produce concrete optimization suggestions from the comparison. -Ensure the user understands what changed and can maintain the optimized agent. +**Output:** Comparison results are included in the grading artifact and fed into Phase 5. -1. **Document All Changes** - - Summarize what was changed and why in the optimization log. - - For each significant change, include the rationale (not just "fixed test X" but "the agent was hallucinating Y because the prompt lacked constraint Z"). +### Phase 5: Analyze -2. **Report Final Results** - - Report final score and comparison to baseline. - - Highlight any test cases that still fail and why. +Deep failure analysis combining existing patterns with new capabilities. -3. **Suggest Future Improvements** - - Identify improvements beyond current eval coverage (v2 ideas). - - Note any areas where the eval itself should be expanded. - - Flag any fragile optimizations that may break with future changes. +**Dispatch `optimizer-reflector` agent** (enhanced with #567 patterns) and `eval-analyzer` agent: -4. **Finalize Optimization Log** - - Add a summary header to the optimization log file indicating session completion, baseline score, final score, and key decisions made. +1. **SIMBA pattern** (existing) — self-introspective failure analysis. For each failure: "What specific instruction or lack of instruction caused this?" +2. **GEPA pattern** (existing) — natural language trace reflection. Compare actual vs expected output, diagnose: knowledge gap, instruction ambiguity, hallucination, or wrong approach. +3. **Deterministic-upgrade suggestions** (new, #567) — identify LLM-judge assertions that could be replaced with deterministic evaluators: + - "Response contains X" → `contains` evaluator + - "Output matches pattern Y" → `regex` evaluator + - "Output is valid JSON" → `is-json` evaluator +4. **Weak assertion identification** (new) — flag assertions that always pass or are too vague to meaningfully test anything. +5. **Benchmark pattern analysis** (new) — detect always-pass tests (assertion too loose), always-fail tests (task impossible or assertion wrong), and flaky tests (non-deterministic). +6. **Trend analysis** (existing) — across iterations, detect improving / plateauing / regressing patterns, stagnation, overfitting. + +**Output:** Write `benchmark.json` artifact to `.agentv/artifacts/benchmark.json` (#565). + +```json +{ + "eval_path": "", + "timestamp": "", + "aggregate": {"pass_rate": 0.82, "total_tests": 11, "passed": 9, "failed": 2}, + "patterns": { + "always_pass": ["test-id-1"], + "always_fail": ["test-id-7"], + "flaky": [], + "deterministic_upgrade_candidates": [ + {"test_id": "test-id-3", "current": "llm-judge", "suggested": "contains", "pattern": "checks for keyword presence"} + ] + }, + "iteration_trend": [{"iteration": 1, "pass_rate": 0.72}, {"iteration": 2, "pass_rate": 0.82}] +} +``` + +### Phase 6: Review + +Human review checkpoint. **Skip this phase when `skip-review` is set or in CI/automated mode.** + +1. **Present results** — show the human reviewer: + - Current pass rate and delta from baseline + - Per-test breakdown (pass/fail with evidence) + - Comparison results (if Phase 4 ran) + - Analysis insights (deterministic upgrade candidates, weak assertions, pattern analysis) + - If the HTML dashboard (#562) is available, reference it for interactive exploration. + +2. **Collect structured feedback** — prompt the human for: + - Approve: continue to optimization + - Reject: stop, the eval or agent needs rethinking + - Redirect: change optimization strategy or focus area + - Notes: free-form feedback to incorporate into subsequent phases + +3. **Output:** Write `feedback.json` artifact to `.agentv/artifacts/feedback.json` (#568). + + ```json + { + "timestamp": "", + "iteration": 2, + "decision": "approve", + "notes": "Focus on test-id-7, the error handling case is critical", + "redirect": null + } + ``` + +4. **Gate:** Do not proceed to Phase 7 without human approval (unless `skip-review` is set). If the reviewer redirects, return to the appropriate phase with updated context. + +### Phase 7: Optimize + +Apply surgical prompt refinements based on analysis. + +**Step 1 — Dispatch `optimizer-curator` agent:** + +1. Provide the reflector's strategy, comparison insights, and human feedback (if any). +2. The curator applies atomic operations to task prompts: + - **ADD** — insert a new rule for a missing constraint + - **UPDATE** — refine an existing rule for clarity or generality + - **DELETE** — remove obsolete, redundant, or harmful rules + - **NEGATIVE CONSTRAINT** — explicitly state what NOT to do +3. Returns a log entry: operation, target, change, trigger, rationale, score, insight. + +**Step 2 — Dispatch `optimizer-polish` agent** (when nearing convergence): + +1. Generalize specific patches into broad principles. +2. Remove redundancy and contradictions. +3. Ensure prompt quality: clear persona, specific task, measurable success criteria, <200 lines. + +**Step 3 — Verify polish didn't regress:** +- Run the eval one final time after polish changes. +- If score decreased, revert polish and keep the working version. + +**Variant tracking:** When a change improves some tests but regresses others, maintain 2-3 promising prompt variants. Compare variants to find the best overall approach before converging. + +**Log result:** Append the curator's log entry to the optimization log file. + +### Phase 8: Re-run + Iterate + +Loop back to Phase 2 with the modified prompts. + +1. **Re-run evaluation** with the optimized prompts. The new results become the comparison baseline for the next iteration. +2. **Compare against previous iteration** — Phase 4 now compares current vs previous iteration (not just original baseline). +3. **Exit conditions** — stop iterating when ANY of: + - `target-pass-rate` is reached + - Human approves the result in Phase 6 + - Stagnation detected (no improvement for 2 consecutive iterations) + - `max-iterations` exhausted +4. **On exit:** Proceed to handoff — document all changes, report final vs baseline score, suggest future improvements, and finalize the optimization log. + +**Human checkpoints:** At iterations 3, 6, and 9, present progress to the user regardless of `skip-review`. Push back if optimization is accumulating contradictory rules or overfitting. + +## Entering Mid-Lifecycle + +Users can start at any phase by providing existing data: + +| Entry point | Required input | Example prompt | +|------------|---------------|----------------| +| Phase 1 (Discovery) | `eval-path` | "Optimize my agent against evals/support.yaml" | +| Phase 2 (Run) | `eval-path` | "Run this eval and show me results" | +| Phase 3 (Grade) | `eval-path` + `results-path` | "Grade these eval results" | +| Phase 4 (Compare) | Two or more result sets | "Compare these two eval runs" | +| Phase 5 (Analyze) | `results-path` | "Analyze why my agent is failing on these results" | +| Phase 6 (Review) | `results-path` + analysis | "Review these eval results before I optimize" | +| Phase 7 (Optimize) | `eval-path` + analysis/strategy | "Apply these optimization suggestions" | + +When entering mid-lifecycle, the skill runs only the requested phase and subsequent phases. It does NOT re-run earlier phases unless the user requests a full loop. ## Agent Dispatch Reference -This skill orchestrates four predefined agents. The skill handles coordination and decision-making; agents handle autonomous work. +This skill orchestrates up to eight specialized agents. The skill handles phase transitions, decision-making, and iteration control; agents handle autonomous work within each phase. -| Agent | Dispatched in | Input | Output | -|-------|--------------|-------|--------| -| `optimizer-discovery` | Phase 1 | Eval path | Discovery report (targets, triage, eval quality) | -| `optimizer-reflector` | Phase 3 (each iteration) | Results JSONL path, iteration number | Reflection report (scores, root causes, strategy) | -| `optimizer-curator` | Phase 3 (each iteration) | Reflector strategy, prompt file path(s) | Log entry (operation, change, rationale) | -| `optimizer-polish` | Phase 4 | Prompt file(s), optimization log | Polish report (changes made, quality assessment) | +| Agent | Phase | Input | Output | +|-------|-------|-------|--------| +| `optimizer-discovery` | 1 (Discovery) | Eval path | Discovery report (targets, triage, eval quality) | +| `eval-candidate` | 2 (Run) | Eval path, test ID | Candidate response (agent mode only) | +| `eval-judge` | 2–3 (Run, Grade) | Eval path, test ID, answer | Structured grading with evidence | +| `blind-comparator` | 4 (Compare) | Blinded responses, task context | Blind dimensional scores | +| `comparison-analyzer` | 4 (Compare) | Blind results, response mapping | Unblinded analysis with suggestions | +| `eval-analyzer` | 5 (Analyze) | Results, eval config | Deterministic-upgrade suggestions, weak assertions, patterns | +| `optimizer-reflector` | 5 (Analyze) | Results JSONL, iteration number | SIMBA/GEPA analysis, strategy, stagnation check | +| `optimizer-curator` | 7 (Optimize) | Strategy, prompt file paths | Log entry (operation, change, rationale) | +| `optimizer-polish` | 7 (Optimize) | Prompt files, optimization log | Polish report (generalizations, quality) | **What the skill handles directly** (not delegated to agents): -- Phase 2 (Planning): Running baseline, assessing complexity, getting user alignment -- Phase 3 (Decide): Evaluating scores, reverting changes, human checkpoints, variant tracking -- Phase 5 (Handoff): Documenting changes, reporting results, suggesting v2 improvements +- Phase 2: Choosing execution mode (agent vs CLI), multi-target orchestration +- Phase 6: Human interaction, collecting feedback, gate decisions +- Phase 8: Iteration control, exit condition evaluation, baseline comparison +- Cross-phase: Artifact collection, optimization log maintenance, variant tracking + +## Companion Artifacts + +The skill produces structured artifacts at key phases (#565): + +| Artifact | Phase | Path | Description | +|----------|-------|------|-------------| +| `grading.json` | 3 (Grade) | `.agentv/artifacts/grading.json` | Per-assertion evidence, claims, self-critique | +| `benchmark.json` | 5 (Analyze) | `.agentv/artifacts/benchmark.json` | Aggregate metrics, patterns, upgrade candidates | +| `feedback.json` | 6 (Review) | `.agentv/artifacts/feedback.json` | Human reviewer decision and notes | +| Results JSONL | 2 (Run) | `.agentv/results/eval_*.jsonl` | Raw per-test results (existing format) | +| Optimization log | All | `` | Running narrative of all changes and decisions | + +Artifacts use schemas compatible with skill-creator's eval-viewer where applicable. ## Guidelines + - **Generalization First**: Prefer broad, principle-based guidelines over specific examples or "hotfixes". Only use specific rules if generalized instructions fail to achieve the desired score. - **Simplicity ("Less is More")**: Avoid overfitting to the test set. If a specific rule doesn't significantly improve the score compared to a general one, choose the general one. -- **Structure**: Maintain existing Markdown headers/sections. +- **Structure**: Maintain existing Markdown headers/sections in optimized prompts. - **Progressive Disclosure**: If the prompt grows too large (>200 lines), consider moving specialized logic into a separate file or skill. - **Quality Criteria**: Ensure the prompt defines a clear persona, specific task, and measurable success criteria. +- **Isolation**: Never let discovery-phase knowledge contaminate baseline runs. Run first, analyze second. +- **Integrity**: Never optimize judge prompts. Evaluation must remain an independent measure of agent quality. diff --git a/plugins/agentv-dev/skills/agentv-optimizer/references/migrating-from-skill-creator.md b/plugins/agentv-dev/skills/agentv-optimizer/references/migrating-from-skill-creator.md new file mode 100644 index 000000000..a562d0502 --- /dev/null +++ b/plugins/agentv-dev/skills/agentv-optimizer/references/migrating-from-skill-creator.md @@ -0,0 +1,96 @@ +# Migrating from Skill-Creator to AgentV Lifecycle Skill + +This reference covers how to use AgentV's unified agent-evaluation lifecycle skill (`agentv-optimizer`) with evals.json files originally created for Anthropic's skill-creator. + +## Drop-in Replacement + +AgentV runs skill-creator's evals.json directly — no conversion required: + +```bash +# Run evals.json with AgentV +agentv eval run evals.json + +# Or in agent mode (no API keys) +agentv prompt eval evals.json +``` + +AgentV automatically: +- Promotes `prompt` → input messages +- Promotes `expected_output` → reference answer +- Converts `assertions` → LLM-judge evaluators +- Resolves `files[]` paths relative to the evals.json directory + +## What You Gain + +Moving from skill-creator's eval loop to AgentV's lifecycle skill gives you: + +| Capability | skill-creator | AgentV lifecycle skill | +|-----------|---------------|----------------------| +| Workspace isolation | ❌ | ✅ Clone repos, run setup/teardown scripts | +| Code judges | ❌ | ✅ Python/TypeScript evaluator scripts via `defineCodeJudge()` | +| Tool trajectory scoring | ❌ | ✅ Evaluate tool call sequences | +| Multi-provider comparison | with-skill vs without-skill | N-way: Claude, GPT, Copilot, Gemini, custom CLI | +| Multi-turn evaluation | ❌ | ✅ Conversation tracking with `conversation_id` | +| Blind comparison | ❌ | ✅ Judge doesn't know which is baseline | +| Deterministic upgrade suggestions | ❌ | ✅ LLM-judge → contains/regex/is-json | +| Human review checkpoint | ❌ | ✅ Structured feedback gate | +| Workspace file tracking | ❌ | ✅ Evaluate by diffing workspace files | +| Agent mode (no API keys) | ❌ | ✅ Uses eval-candidate + eval-judge agents | + +## Artifact Compatibility + +AgentV's companion artifacts are compatible with skill-creator's eval-viewer: + +| Artifact | Format | Compatible with eval-viewer | +|----------|--------|---------------------------| +| `grading.json` | Per-assertion evidence with claims | ✅ Superset of skill-creator's grading format | +| `benchmark.json` | Aggregate pass rates, timing, patterns | ✅ Superset of Agent Skills benchmark format | +| Results JSONL | Per-test results | ✅ Standard JSONL format | + +AgentV's schemas are supersets — they include all fields skill-creator expects, plus additional fields (claims extraction, pattern analysis, deterministic upgrade candidates). Tools that read skill-creator artifacts will read AgentV artifacts correctly, ignoring the extra fields. + +## Graduating to EVAL.yaml + +When evals.json becomes limiting, convert to EVAL.yaml for the full feature set: + +```bash +# Convert evals.json to EVAL.yaml +agentv convert evals.json + +# Edit the generated YAML to add workspace config, code judges, etc. +# Then run with the full lifecycle +agentv prompt eval evals.eval.yaml +``` + +EVAL.yaml unlocks: +- **Workspace setup/teardown** — clone repos, install dependencies, clean up after tests +- **Code judges** — write evaluators in Python or TypeScript, not just LLM prompts +- **Rubric-based grading** — multi-dimensional scoring with weighted criteria +- **Retry policies** — automatic retries for flaky tests with configurable backoff +- **Test groups** — organize tests by category with shared config +- **Multi-turn conversations** — test agent interactions across multiple turns + +## What Stays in Skill-Creator + +AgentV does NOT replace these skill-creator capabilities: + +- **Trigger optimization** — optimizing when/how a skill is triggered +- **.skill packaging** — bundling skills for distribution +- **Skill authoring** — creating new SKILL.md files from scratch +- **Skill discovery** — finding and installing skills + +AgentV focuses on the **evaluation and optimization loop**. Skill-creator focuses on **skill authoring and packaging**. They are complementary — use skill-creator to write the skill, use AgentV to evaluate and optimize it. + +## Example Workflow + +``` +1. Author a skill with skill-creator +2. skill-creator generates evals.json +3. Run evals.json through AgentV's lifecycle skill for richer evaluation: + - Workspace isolation (test in a real repo) + - Multi-provider comparison (does the skill work with GPT too?) + - Blind comparison (is the new version actually better?) + - Deterministic upgrades (replace vague LLM judges with precise checks) +4. Use AgentV's optimization loop to refine the skill's prompts +5. Return to skill-creator for packaging and distribution +```