Skip to content

Feat: support custom judges in transpile to evals.json #610

@christso

Description

@christso

Objective

Support custom judges (code-judge and .agentv/judges/ discovered types) in the EVAL.yaml → evals.json transpiler by emitting executable NL instruction strings and adding composable CLI commands.

Strategy by judge type:

  • Code judges (deterministic, zero cost) → execute them directly via new agentv eval assert subcommand
  • LLM judges (expensive) → surface their criteria as NL for the grader agent (no change needed)

Deliverable 1: agentv eval assert <name>

New CLI command that runs a single code-judge assertion from .agentv/judges/. Makes agentv's judge system composable for external consumers.

# Inline mode
agentv eval assert trigger-judge --agent-output "agent said this" --agent-input "user asked this"

# File mode — reads JSON with { output, input }
agentv eval assert trigger-judge --file result.json

Behavior:

  1. Resolve judge name → script path via .agentv/judges/ discovery (walks up directories)
  2. Build evaluation payload from --agent-output/--agent-input or --file
  3. Execute script via executeScript, passing full CodeEvaluator-compatible payload via stdin
  4. Print raw script JSON to stdout: { "score": 0-1, "reasoning": "..." }
  5. Exit code: 0 on score >= 0.5, 1 otherwise

Input format (--file):

{
  "output": "the agent's full response text",
  "input": "the original user prompt"
}

Deliverable 2: agentv eval prompt eval --grading-brief

Outputs a human-readable summary of grading criteria for a specific test, with type-prefixed assertion tags.

agentv eval prompt eval --grading-brief evals/dataset.eval.yaml --test-id csv-top-months

Output:

Input: "I have a CSV of monthly sales data. Find the top 3 months by revenue."
Expected: "The top 3 months by revenue are November ($22,500), September ($20,100), and December ($19,400)."
Criteria:
  - Output contains '$22,500'
  - [llm-judge] The answer is clear and concise
  - [code-judge] format-checker: Validates output CSV format
  - [skill-trigger] should_trigger: true for csv-analyzer

Behavior:

  1. Parse EVAL.yaml, find test by --test-id
  2. Extract input, expected_output, all assertions
  3. Render criteria with type prefixes: [llm-judge], [code-judge], [skill-trigger]

Deliverable 3: Transpiler Changes

File: packages/core/src/evaluation/loaders/eval-yaml-transpiler.ts

assertions stays string[] — no schema change. Code judge assertions become NL instruction strings telling the grader agent to shell out.

Code judge assertion format:

Run `agentv eval assert <judge-name> --agent-output <agent_output> --agent-input <original_prompt>` and check the result. This judge: <description>. The command accepts --agent-output (the agent's full response text) and --agent-input (the original user prompt). It returns JSON on stdout: {"score": 0-1, "reasoning": "..."}. A score >= 0.5 means pass (exit 0); below 0.5 means fail (exit 1).

Logic changes in assertionToNaturalLanguage:

Assertion type Current behavior New behavior
code-judge / code_judge "{name}: {description}" NL instruction with agentv eval assert command
Unknown type WITH command "{type} assertion" Same NL instruction format
Unknown type WITHOUT command fallback to criteria/prompt No change
llm-judge Prompt string verbatim No change
All other built-in types NL conversion No change

Example — EVAL.yaml:

assertions:
  - type: skill-trigger
    skill: csv-analyzer
    should_trigger: true
  - type: rubrics
    criteria: "Output identifies November as highest revenue month"
  - type: contains
    value: "$22,500"
  - type: code-judge
    name: format-checker
    description: "Validates output CSV format"
    command: ["bun", "run", ".agentv/judges/format-checker.ts"]
  - type: llm-judge
    prompt: "The answer is clear and concise"

Example — evals.json output:

{
  "id": 1,
  "prompt": "...",
  "should_trigger": true,
  "assertions": [
    "Output identifies November as highest revenue month",
    "Output contains '$22,500'",
    "Run `agentv eval assert format-checker --agent-output <agent_output> --agent-input <original_prompt>` and check the result. This judge: Validates output CSV format. The command accepts --agent-output (the agent's full response text) and --agent-input (the original user prompt). It returns JSON on stdout: {\"score\": 0-1, \"reasoning\": \"...\"}. A score >= 0.5 means pass (exit 0); below 0.5 means fail (exit 1).",
    "The answer is clear and concise"
  ]
}

Acceptance Signals

  • agentv eval assert <name> discovers and executes a judge from .agentv/judges/, returns score JSON
  • agentv eval prompt eval --grading-brief outputs full grading brief with typed criteria
  • Transpiler emits executable NL instruction strings for code judges
  • Transpiler emits NL criteria strings for LLM judges (unchanged)
  • assertions remains string[] — no schema breaking change
  • Existing transpiler tests continue to pass

Non-goals

  • Running LLM judges from the transpiled output (too expensive; use criteria as NL)
  • Changing run_eval.py consumer — it ignores assertion strings it doesn't understand
  • Supporting --file format beyond { output, input } in v1

Design latitude

  • Exit code threshold (0.5) can be adjusted if a different default makes more sense
  • --grading-brief output format is flexible as long as it's human-readable and includes type prefixes
  • NL instruction wording for code judge assertions can be refined

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions