feat(evals): set default targets so all evals work out of the box by christso · Pull Request #898 · EntityProcess/agentv

christso · 2026-04-01T09:51:46Z

Summary

Every eval file under examples/ and evals/ now declares its own target, so agentv eval run works without a global --target flag
Added copilot, vscode, and copilot-log targets to root .agentv/targets.yaml so matrix evals and specialized evals resolve correctly
Updated the CI workflow (evals.yml) to discover all eval files by default and make --target optional — each eval uses its own target

Details

17 eval files gained target: default (LLM-only evals)
1 eval file gained target: copilot-log (copilot transcript evaluation)
Fixed invalid name field in benchmark-tooling eval (spaces → kebab-case)
Workflow default patterns now cover evals/**/*.eval.yaml, examples/**/*.eval.yaml, examples/**/*.EVAL.yaml, examples/**/EVAL.yaml

Test plan

bun run validate:examples — all 53 example evals valid
Dry-run all 278 tests across all eval files — no target resolution errors
Pre-push hooks pass (build, typecheck, lint, test, validate)

🤖 Generated with Claude Code

cloudflare-workers-and-pages · 2026-04-01T09:52:20Z

Deploying agentv with Cloudflare Pages

Latest commit:	`707761b`
Status:	✅ Deploy successful!
Preview URL:	https://dee5d456.agentv.pages.dev
Branch Preview URL:	https://feat-default-targets.agentv.pages.dev

View logs

Every eval file under examples/ and evals/ now declares its own target, so running `agentv eval run` no longer requires a global --target flag. This lets the CI workflow run all evals without forcing a single target (like copilot-cli) that may not suit every eval. Changes: - Add `target: default` to 17 eval files that were missing a target - Add `target: copilot-log` to the copilot-log eval - Add copilot, vscode, and copilot-log targets to root targets.yaml - Update evals.yml workflow: default patterns cover all eval files, --target is now optional (each eval uses its own) - Fix invalid name in benchmark-tooling eval (spaces → kebab-case) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Every eval file now declares its own target: - `target: default` — LLM-only evals (grading, text generation) - `target: agent` — coding agent evals (env-var-driven via AGENT_PROVIDER + AGENT_MODEL, defaults to copilot-cli) - Specialized targets (mock_agent, copilot-log, batch_cli, etc.) resolve via per-example .agentv/targets.yaml Added env-var-driven `agent` target to root targets.yaml so CI and local dev can control which coding agent runs without editing eval files. Tags: - `tags: [agent]` on evals requiring a coding agent or infrastructure - `tags: [multi-provider]` on multi-model-benchmark (excluded from CI) Workflow changes: - Default patterns discover all eval files across examples/ and evals/ - --target is now optional (each eval uses its own) - AGENT_PROVIDER/AGENT_MODEL written to .env for agent target resolution - Multi-model-benchmark excluded from default CI sweep Other fixes: - Removed deprecated vscode target references - Fixed invalid name in benchmark-tooling eval (spaces → kebab-case) - Converted matrix-evaluation from multi-target to single agent target Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The `default` target in root targets.yaml now resolves via AGENT_PROVIDER + AGENT_MODEL env vars (defaults to copilot-cli in CI). Evals without an explicit target automatically use default, so no target field is needed. Evals with specialized targets (copilot-log, batch_cli, mock_agent, etc.) keep their explicit `execution.target` — these resolve via per-example .agentv/targets.yaml files. Tags: - `tags: [agent]` on evals requiring a coding agent or infrastructure - `tags: [multi-provider]` on multi-model-benchmark (excluded from CI) Workflow: - Default patterns discover all eval files - --target is optional (each eval uses its own or falls back to default) - AGENT_PROVIDER/AGENT_MODEL written to .env - Only multi-model-benchmark excluded from default CI sweep Other: - Removed deprecated vscode target references - Converted matrix-evaluation from multi-target to single default target - Fixed invalid name in benchmark-tooling eval Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The CLI doesn't support !glob negation. List showcase subdirectories explicitly, excluding only multi-model-benchmark. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Patterns prefixed with ! are now treated as exclusions, passed to fast-glob's ignore option. This lets CI workflows exclude specific eval directories: agentv eval run 'examples/**/*.eval.yaml' '!examples/showcase/multi-model-benchmark/**' Updated the evals workflow to use this instead of explicit include lists. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The explicit --targets flag forces the root targets.yaml and prevents per-example targets (batch_cli, mock_agent, etc.) from being found. Let the CLI auto-discover targets.yaml by walking up from each eval file. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The workspace_template field was removed from target definitions. These mock targets relied on it but the eval files already define workspace.template at the eval level. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The psychotherapy evals use target: gemini-llm which needs GOOGLE_GENERATIVE_AI_API_KEY and GEMINI_MODEL_NAME. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Added `llm` target to root targets.yaml (GH Models, no agent binary) - LLM-only evals now set `execution.target: llm` - Agent evals omit target (falls back to default = copilot via env vars) - export-screening uses its per-example mock target (no change needed) - Added pi-cli install to CI workflow - Added Gemini credentials to CI .env Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Changed agent-plugin-review from pi-cli to default target (copilot). Added OPENROUTER credentials to CI .env for evals that need them. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

agent-skills-evals (missing echo.ts), batch-cli (custom runner script), code-grader-sdk and local-cli (need uv + mock_cli.py) all require local setup that isn't available on the CI runner. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Created .agentv/providers/echo.ts for agent-skills-evals (was never committed — convention-based provider that echoes input back) - Installed uv on CI runner so local-cli and code-grader-sdk evals can run their Python mock scripts - Removed CI exclusions for local script evals (all deps now available) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Strengthened system prompts so assertions pass with gpt-5-mini: - JSON evals: explicit "no markdown, no code blocks, raw JSON only" - equals evals: "respond with ONLY the number, nothing else" - starts-with evals: "you MUST start every response with X" - icontains-all evals: system prompt lists required phrases - Removed expected_output where it served no assertion purpose - Changed azure-llm override in basic eval to llm target Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

GH Models rate limits (429) were failing most LLM evals. OpenRouter has higher rate limits and built-in provider fallback. Also excluded code-grader-sdk from CI (needs Azure keys in its per-example targets.yaml). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Per-example targets.yaml files referenced azure-llm or azure_grader as grader targets, requiring Azure API keys. Switched to the root `grader` target (now OpenRouter) so all evals work with a single OPENROUTER_API_KEY. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ching Targets can now use `alias` to redirect to another named target: - name: default alias: ${{ AGENT_TARGET }} # e.g. "copilot-cli" or "claude" provider: mock # placeholder, alias takes precedence Setting AGENT_TARGET=copilot-cli makes `default` resolve to the full copilot-cli target definition (provider, model, auth, grader_target). Switching to claude is just AGENT_TARGET=claude — no config changes. This sets precedent for eval frameworks: one env var switches the entire provider config, unlike promptfoo/LiteLLM which require per-field parameterization that breaks across different auth shapes. Implementation: - Added `alias` field to TargetDefinition interface and BASE_TARGET_SCHEMA - resolveAlias() in CLI follows alias chains (max 5 depth, cycle-safe) - Supports ${{ ENV_VAR }} syntax in alias values - Updated root targets.yaml: default now aliases to AGENT_TARGET - Replaced AGENT_PROVIDER/AGENT_MODEL with single AGENT_TARGET env var Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Targets can delegate to another named target via use_target: - name: default use_target: ${{ AGENT_TARGET }} provider: mock Setting AGENT_TARGET=copilot-cli makes default resolve to the full copilot-cli definition. Consistent with grader_target naming convention. Snake_case only — no camelCase variant (YAML convention). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Both llm and grader now delegate via use_target: ${{ GRADER_TARGET }} instead of hardcoding openrouter. Switch grader provider with one env var: GRADER_TARGET=openrouter or GRADER_TARGET=gemini-llm. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Targets with use_target delegate to another target and don't need their own provider. Removed redundant provider: mock from delegation targets in root targets.yaml. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Updated both the Zod schema (BASE_TARGET_SCHEMA) and the targets validator to accept targets without a provider field when use_target handles delegation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Third place that validated provider as required. This is exactly the brittle duplication that #909 will fix. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

before_all hook crashes entire eval run when workspace-setup.mjs fails. copilot-log-eval also needs copilot session files on disk. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When a before_all hook fails, mark all tests in that eval file as setup errors and continue running remaining eval files. Previously the entire eval run would abort. Closes #910 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The orchestrator's resolveTargetByName() now follows use_target chains before calling resolveTargetDefinition(). This fixes grader resolution when the grader target uses use_target delegation (e.g., grader → GRADER_TARGET → openrouter). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- file-changes, file-changes-graders, functional-grading: added workspace.template to eval files (was previously in target config via the now-removed workspace_template field) - agent-skills-evals: removed broken echo provider — these evals need a real agent (skill-trigger), so they use root default target Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

batch-cli: batch output format mismatch (#911) file-changes-graders: workspace cwd not preserved on retries (#912) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- offline-grader-benchmark: switched grader_target from azure to root grader - file-changes: rm -f instead of rm for idempotent retries - cross-repo-sync: excluded from CI (needs tsx package) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Verbose output was truncating the eval summary. JUnit file wasn't being generated — make that step continue-on-error so it doesn't fail the overall run. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Short flag -o may conflict with positional arg parsing when many glob patterns expand. Use explicit --output flag. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Created scripts/ci-summary.ts that reads JSONL results and outputs markdown with pass rate, mean score, stddev, per-suite breakdown, and collapsible details for failures and errors. Inspired by WiseTechGlobal/sdd#26 ci-summary pattern, ported to TS. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

These azure/openrouter grader definitions were causing warnings and are no longer needed — fixture_replay now uses root grader. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The curl installer was producing corrupted binaries. npm install @github/copilot is more reliable and version-pinnable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot's runtime package blob may require Node 22+. The default ubuntu-latest runner ships Node 20 which causes SyntaxError on the downloaded index.js. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The tee pipe was truncating output — summary never appeared. Temporarily limit to 2 eval sets to verify summary prints. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

rubrics assertion requires criteria as array, not string. Also relaxed contains to icontains for case-insensitive matching. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

skill-trigger is the whole point of agent-skills-evals. Copilot-cli doesn't reliably trigger custom skills, so these evals are tagged [agent, skill-trigger] and excluded from default CI patterns. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…plate The csv-analyzer eval was failing skill-trigger because: 1. The csv-analyzer skill was missing from the workspace template 2. The eval had no workspace: block so the agent couldn't see skills Added csv-analyzer SKILL.md to .claude/, .agents/, .github/ skill directories and added workspace: template: workspace/ to the eval. Verified locally: 1.000 PASS with all assertions including skill-trigger. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Non-deterministic skill-trigger results need log inspection. Added .agentv/logs/ to artifact upload. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The skill now contains a "seasonal weighted revenue formula" that the agent must apply. Without reading the skill, the agent would report raw revenue — which fails the rubrics and icontains assertions. This ensures skill-trigger passes reliably: the agent must read the skill to answer correctly. Verified 3/3 passes locally. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

christso and others added 23 commits April 1, 2026 22:08

fix(ci): use explicit include patterns instead of negated globs

d2102dc

The CLI doesn't support !glob negation. List showcase subdirectories explicitly, excluding only multi-model-benchmark. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(ci): add Gemini credentials to workflow .env

1191250

The psychotherapy evals use target: gemini-llm which needs GOOGLE_GENERATIVE_AI_API_KEY and GEMINI_MODEL_NAME. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(evals): use default (copilot) instead of pi-cli for agent evals

b2c6a78

Changed agent-plugin-review from pi-cli to default target (copilot). Added OPENROUTER credentials to CI .env for evals that need them. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore(ci): increase eval workers from 1 to 3

0b04cf9

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(core): allow provider to be omitted when use_target is set

fddd943

Updated both the Zod schema (BASE_TARGET_SCHEMA) and the targets validator to accept targets without a provider field when use_target handles delegation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(core): allow use_target in targets-file.ts parser

3c39f70

Third place that validated provider as required. This is exactly the brittle duplication that #909 will fix. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(ci): exclude copilot-log-eval from CI

7650b51

before_all hook crashes entire eval run when workspace-setup.mjs fails. copilot-log-eval also needs copilot session files on disk. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

christso force-pushed the feat/default-targets branch from a71b297 to 7650b51 Compare April 1, 2026 22:10

christso and others added 5 commits April 1, 2026 22:14

fix(ci): exclude evals with pre-existing workspace/batch bugs

595fc16

batch-cli: batch output format mismatch (#911) file-changes-graders: workspace cwd not preserved on retries (#912) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

christso and others added 6 commits April 1, 2026 23:51

fix(ci): use --output instead of -o for JUnit path

2cf1004

Short flag -o may conflict with positional arg parsing when many glob patterns expand. Use explicit --output flag. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: remove unused grader targets from offline-grader-benchmark

29ea7c1

These azure/openrouter grader definitions were causing warnings and are no longer needed — fixture_replay now uses root grader. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(ci): use npm package for copilot CLI instead of curl installer

a852e0b

The curl installer was producing corrupted binaries. npm install @github/copilot is more reliable and version-pinnable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(ci): add Node 22 for copilot CLI compatibility

d8c9f8d

Copilot's runtime package blob may require Node 22+. The default ubuntu-latest runner ships Node 20 which causes SyntaxError on the downloaded index.js. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

christso force-pushed the feat/default-targets branch 3 times, most recently from eafd27b to e212546 Compare April 2, 2026 01:27

debug(ci): remove tee pipe and limit to 2 eval sets for debugging

17431c2

The tee pipe was truncating output — summary never appeared. Temporarily limit to 2 eval sets to verify summary prints. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

christso force-pushed the feat/default-targets branch from e212546 to 17431c2 Compare April 2, 2026 01:29

christso and others added 5 commits April 2, 2026 01:35

fix(evals): fix csv-analyzer rubrics criteria format

99c2f33

rubrics assertion requires criteria as array, not string. Also relaxed contains to icontains for case-insensitive matching. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(ci): include copilot logs in artifacts for debugging

61c1b74

Non-deterministic skill-trigger results need log inspection. Added .agentv/logs/ to artifact upload. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

christso marked this pull request as ready for review April 2, 2026 02:16

christso merged commit de04689 into main Apr 2, 2026
6 checks passed

christso deleted the feat/default-targets branch April 2, 2026 02:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evals): set default targets so all evals work out of the box#898

feat(evals): set default targets so all evals work out of the box#898
christso merged 40 commits intomainfrom
feat/default-targets

christso commented Apr 1, 2026

Uh oh!

cloudflare-workers-and-pages bot commented Apr 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

christso commented Apr 1, 2026

Summary

Details

Test plan

Uh oh!

cloudflare-workers-and-pages bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying agentv with Cloudflare Pages

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cloudflare-workers-and-pages bot commented Apr 1, 2026 •

edited

Loading