-
Notifications
You must be signed in to change notification settings - Fork 0
Feat: support custom judges in transpile to evals.json #610
Description
Objective
Support custom judges (code-judge and .agentv/judges/ discovered types) in the EVAL.yaml → evals.json transpiler by emitting executable NL instruction strings and adding composable CLI commands.
Strategy by judge type:
- Code judges (deterministic, zero cost) → execute them directly via new
agentv eval assertsubcommand - LLM judges (expensive) → surface their criteria as NL for the grader agent (no change needed)
Deliverable 1: agentv eval assert <name>
New CLI command that runs a single code-judge assertion from .agentv/judges/. Makes agentv's judge system composable for external consumers.
# Inline mode
agentv eval assert trigger-judge --agent-output "agent said this" --agent-input "user asked this"
# File mode — reads JSON with { output, input }
agentv eval assert trigger-judge --file result.jsonBehavior:
- Resolve judge name → script path via
.agentv/judges/discovery (walks up directories) - Build evaluation payload from
--agent-output/--agent-inputor--file - Execute script via
executeScript, passing full CodeEvaluator-compatible payload via stdin - Print raw script JSON to stdout:
{ "score": 0-1, "reasoning": "..." } - Exit code: 0 on score >= 0.5, 1 otherwise
Input format (--file):
{
"output": "the agent's full response text",
"input": "the original user prompt"
}Deliverable 2: agentv eval prompt eval --grading-brief
Outputs a human-readable summary of grading criteria for a specific test, with type-prefixed assertion tags.
agentv eval prompt eval --grading-brief evals/dataset.eval.yaml --test-id csv-top-monthsOutput:
Input: "I have a CSV of monthly sales data. Find the top 3 months by revenue."
Expected: "The top 3 months by revenue are November ($22,500), September ($20,100), and December ($19,400)."
Criteria:
- Output contains '$22,500'
- [llm-judge] The answer is clear and concise
- [code-judge] format-checker: Validates output CSV format
- [skill-trigger] should_trigger: true for csv-analyzer
Behavior:
- Parse EVAL.yaml, find test by
--test-id - Extract input, expected_output, all assertions
- Render criteria with type prefixes:
[llm-judge],[code-judge],[skill-trigger]
Deliverable 3: Transpiler Changes
File: packages/core/src/evaluation/loaders/eval-yaml-transpiler.ts
assertions stays string[] — no schema change. Code judge assertions become NL instruction strings telling the grader agent to shell out.
Code judge assertion format:
Run `agentv eval assert <judge-name> --agent-output <agent_output> --agent-input <original_prompt>` and check the result. This judge: <description>. The command accepts --agent-output (the agent's full response text) and --agent-input (the original user prompt). It returns JSON on stdout: {"score": 0-1, "reasoning": "..."}. A score >= 0.5 means pass (exit 0); below 0.5 means fail (exit 1).
Logic changes in assertionToNaturalLanguage:
| Assertion type | Current behavior | New behavior |
|---|---|---|
code-judge / code_judge |
"{name}: {description}" |
NL instruction with agentv eval assert command |
Unknown type WITH command |
"{type} assertion" |
Same NL instruction format |
Unknown type WITHOUT command |
fallback to criteria/prompt | No change |
llm-judge |
Prompt string verbatim | No change |
| All other built-in types | NL conversion | No change |
Example — EVAL.yaml:
assertions:
- type: skill-trigger
skill: csv-analyzer
should_trigger: true
- type: rubrics
criteria: "Output identifies November as highest revenue month"
- type: contains
value: "$22,500"
- type: code-judge
name: format-checker
description: "Validates output CSV format"
command: ["bun", "run", ".agentv/judges/format-checker.ts"]
- type: llm-judge
prompt: "The answer is clear and concise"Example — evals.json output:
{
"id": 1,
"prompt": "...",
"should_trigger": true,
"assertions": [
"Output identifies November as highest revenue month",
"Output contains '$22,500'",
"Run `agentv eval assert format-checker --agent-output <agent_output> --agent-input <original_prompt>` and check the result. This judge: Validates output CSV format. The command accepts --agent-output (the agent's full response text) and --agent-input (the original user prompt). It returns JSON on stdout: {\"score\": 0-1, \"reasoning\": \"...\"}. A score >= 0.5 means pass (exit 0); below 0.5 means fail (exit 1).",
"The answer is clear and concise"
]
}Acceptance Signals
agentv eval assert <name>discovers and executes a judge from.agentv/judges/, returns score JSONagentv eval prompt eval --grading-briefoutputs full grading brief with typed criteria- Transpiler emits executable NL instruction strings for code judges
- Transpiler emits NL criteria strings for LLM judges (unchanged)
assertionsremainsstring[]— no schema breaking change- Existing transpiler tests continue to pass
Non-goals
- Running LLM judges from the transpiled output (too expensive; use criteria as NL)
- Changing
run_eval.pyconsumer — it ignores assertion strings it doesn't understand - Supporting
--fileformat beyond{ output, input }in v1
Design latitude
- Exit code threshold (0.5) can be adjusted if a different default makes more sense
--grading-briefoutput format is flexible as long as it's human-readable and includes type prefixes- NL instruction wording for code judge assertions can be refined
Related
- feat: create EVAL.yaml to evals.json transpiler #598 — EVAL.yaml → evals.json transpiler (parent, closed)
- feat(providers,evaluators): claude-cli provider + trigger-judge evaluator #597 — claude-cli provider + trigger-judge code-judge
- feat: promote skill-trigger to built-in evaluator (rename from trigger-judge) #609 — promote skill-trigger to built-in