Feat: support custom judges in transpile to evals.json

## Objective

Support custom judges (code-judge and `.agentv/judges/` discovered types) in the EVAL.yaml → evals.json transpiler by emitting executable NL instruction strings and adding composable CLI commands.

**Strategy by judge type:**
- **Code judges** (deterministic, zero cost) → execute them directly via new `agentv eval assert` subcommand
- **LLM judges** (expensive) → surface their criteria as NL for the grader agent (no change needed)

## Deliverable 1: `agentv eval assert <name>`

New CLI command that runs a single code-judge assertion from `.agentv/judges/`. Makes agentv's judge system composable for external consumers.

```bash
# Inline mode
agentv eval assert trigger-judge --agent-output "agent said this" --agent-input "user asked this"

# File mode — reads JSON with { output, input }
agentv eval assert trigger-judge --file result.json
```

**Behavior:**
1. Resolve judge name → script path via `.agentv/judges/` discovery (walks up directories)
2. Build evaluation payload from `--agent-output`/`--agent-input` or `--file`
3. Execute script via `executeScript`, passing full CodeEvaluator-compatible payload via stdin
4. Print raw script JSON to stdout: `{ "score": 0-1, "reasoning": "..." }`
5. Exit code: 0 on score >= 0.5, 1 otherwise

**Input format (--file):**
```json
{
  "output": "the agent's full response text",
  "input": "the original user prompt"
}
```

## Deliverable 2: `agentv eval prompt eval --grading-brief`

Outputs a human-readable summary of grading criteria for a specific test, with type-prefixed assertion tags.

```bash
agentv eval prompt eval --grading-brief evals/dataset.eval.yaml --test-id csv-top-months
```

**Output:**
```
Input: "I have a CSV of monthly sales data. Find the top 3 months by revenue."
Expected: "The top 3 months by revenue are November ($22,500), September ($20,100), and December ($19,400)."
Criteria:
  - Output contains '$22,500'
  - [llm-judge] The answer is clear and concise
  - [code-judge] format-checker: Validates output CSV format
  - [skill-trigger] should_trigger: true for csv-analyzer
```

**Behavior:**
1. Parse EVAL.yaml, find test by `--test-id`
2. Extract input, expected_output, all assertions
3. Render criteria with type prefixes: `[llm-judge]`, `[code-judge]`, `[skill-trigger]`

## Deliverable 3: Transpiler Changes

**File:** `packages/core/src/evaluation/loaders/eval-yaml-transpiler.ts`

`assertions` stays `string[]` — no schema change. Code judge assertions become NL instruction strings telling the grader agent to shell out.

**Code judge assertion format:**

```
Run `agentv eval assert <judge-name> --agent-output <agent_output> --agent-input <original_prompt>` and check the result. This judge: <description>. The command accepts --agent-output (the agent's full response text) and --agent-input (the original user prompt). It returns JSON on stdout: {"score": 0-1, "reasoning": "..."}. A score >= 0.5 means pass (exit 0); below 0.5 means fail (exit 1).
```

**Logic changes in `assertionToNaturalLanguage`:**

| Assertion type | Current behavior | New behavior |
|---|---|---|
| `code-judge` / `code_judge` | `"{name}: {description}"` | NL instruction with `agentv eval assert` command |
| Unknown type WITH `command` | `"{type} assertion"` | Same NL instruction format |
| Unknown type WITHOUT `command` | fallback to criteria/prompt | No change |
| `llm-judge` | Prompt string verbatim | No change |
| All other built-in types | NL conversion | No change |

**Example — EVAL.yaml:**
```yaml
assertions:
  - type: skill-trigger
    skill: csv-analyzer
    should_trigger: true
  - type: rubrics
    criteria: "Output identifies November as highest revenue month"
  - type: contains
    value: "$22,500"
  - type: code-judge
    name: format-checker
    description: "Validates output CSV format"
    command: ["bun", "run", ".agentv/judges/format-checker.ts"]
  - type: llm-judge
    prompt: "The answer is clear and concise"
```

**Example — evals.json output:**
```json
{
  "id": 1,
  "prompt": "...",
  "should_trigger": true,
  "assertions": [
    "Output identifies November as highest revenue month",
    "Output contains '$22,500'",
    "Run `agentv eval assert format-checker --agent-output <agent_output> --agent-input <original_prompt>` and check the result. This judge: Validates output CSV format. The command accepts --agent-output (the agent's full response text) and --agent-input (the original user prompt). It returns JSON on stdout: {\"score\": 0-1, \"reasoning\": \"...\"}. A score >= 0.5 means pass (exit 0); below 0.5 means fail (exit 1).",
    "The answer is clear and concise"
  ]
}
```

## Acceptance Signals

- `agentv eval assert <name>` discovers and executes a judge from `.agentv/judges/`, returns score JSON
- `agentv eval prompt eval --grading-brief` outputs full grading brief with typed criteria
- Transpiler emits executable NL instruction strings for code judges
- Transpiler emits NL criteria strings for LLM judges (unchanged)
- `assertions` remains `string[]` — no schema breaking change
- Existing transpiler tests continue to pass

## Non-goals

- Running LLM judges from the transpiled output (too expensive; use criteria as NL)
- Changing `run_eval.py` consumer — it ignores assertion strings it doesn't understand
- Supporting `--file` format beyond `{ output, input }` in v1

## Design latitude

- Exit code threshold (0.5) can be adjusted if a different default makes more sense
- `--grading-brief` output format is flexible as long as it's human-readable and includes type prefixes
- NL instruction wording for code judge assertions can be refined

## Related

- #598 — EVAL.yaml → evals.json transpiler (parent, closed)
- #597 — claude-cli provider + trigger-judge code-judge
- #609 — promote skill-trigger to built-in


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: support custom judges in transpile to evals.json #610

Objective

Deliverable 1: `agentv eval assert <name>`

Deliverable 2: `agentv eval prompt eval --grading-brief`

Deliverable 3: Transpiler Changes

Acceptance Signals

Non-goals

Design latitude

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Assertion type	Current behavior	New behavior
`code-judge` / `code_judge`	`"{name}: {description}"`	NL instruction with `agentv eval assert` command
Unknown type WITH `command`	`"{type} assertion"`	Same NL instruction format
Unknown type WITHOUT `command`	fallback to criteria/prompt	No change
`llm-judge`	Prompt string verbatim	No change
All other built-in types	NL conversion	No change

Feat: support custom judges in transpile to evals.json #610

Description

Objective

Deliverable 1: agentv eval assert <name>

Deliverable 2: agentv eval prompt eval --grading-brief

Deliverable 3: Transpiler Changes

Acceptance Signals

Non-goals

Design latitude

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Deliverable 1: `agentv eval assert <name>`

Deliverable 2: `agentv eval prompt eval --grading-brief`