feat: create EVAL.yaml to evals.json transpiler

## Objective

Build a transpiler from agentv EVAL.yaml → evals.json so that skill trigger evaluation cases authored in agentv's format can be consumed by skill-creator's pipeline.

## Background: Two Formats

### Source — EVAL.yaml (agentv)

Structured, machine-readable assertions. Files are attached via `type: file` content blocks inside a message (there is no `input_files:` shorthand — see sub-issue below):

```yaml
tests:
  - id: csv-top-months
    criteria: Agent finds the top 3 months by revenue
    input:
      - role: user
        content:
          - type: file
            value: evals/files/sales.csv
          - type: text
            value: "I have a CSV of monthly sales data. Find the top 3 months by revenue."
    expected_output: "The top 3 months by revenue are November ($22,500), September ($20,100), and December ($19,400)."
    assert:
      - type: trigger-judge          # from .agentv/judges/trigger-judge.ts
        skill: csv-analyzer
        should_trigger: true
      - type: rubrics
        criteria: "Output identifies November as the highest revenue month"
      - type: contains
        value: "$22,500"

  - id: irrelevant-query
    criteria: Agent does not invoke csv-analyzer for an unrelated query
    input: "What time is it?"
    assert:
      - type: trigger-judge
        skill: csv-analyzer
        should_trigger: false
```

### Target — evals.json (skill-creator / agentv skill pipeline)

```json
{
  "skill_name": "csv-analyzer",
  "evals": [
    {
      "id": 1,
      "prompt": "I have a CSV of monthly sales data. Find the top 3 months by revenue.",
      "expected_output": "The top 3 months by revenue are November ($22,500), September ($20,100), and December ($19,400).",
      "files": ["evals/files/sales.csv"],
      "should_trigger": true,
      "assertions": [
        "Output identifies November as the highest revenue month",
        "Output contains '$22,500'"
      ]
    },
    {
      "id": 2,
      "prompt": "What time is it?",
      "should_trigger": false,
      "assertions": []
    }
  ]
}
```

### skill-creator run_eval.py input format (trigger-only subset)

When feeding into `run_eval.py --eval-set`, only `query` and `should_trigger` are needed — unknown fields are ignored:

```json
[
  { "query": "I have a CSV of monthly sales data...", "should_trigger": true },
  { "query": "What time is it?", "should_trigger": false }
]
```

The transpiler should produce the richer evals.json since `run_eval.py` ignores unknown fields.

## Field Mappings

| EVAL.yaml field | evals.json field | Notes |
|---|---|---|
| `tests[].id` (or index) | `evals[].id` | Use numeric index if id is a string |
| `tests[].input` (string) | `evals[].prompt` | Direct |
| `tests[].input[role=user].content` (string) | `evals[].prompt` | Extract last user text content |
| `tests[].input[role=user].content[type=file][].value` | `evals[].files[]` | Extract file paths from content blocks |
| `tests[].expected_output` (string or message) | `evals[].expected_output` | Flatten to string |
| `tests[].assert[type=trigger-judge].should_trigger` | `evals[].should_trigger` | Default `true` if absent |
| `tests[].assert[type=trigger-judge].skill` | top-level `skill_name` | All cases in one file must share same skill |
| All other `assert` items | `evals[].assertions[]` | Convert to natural language (see below) |
| `tests[].criteria` | prepend to `evals[].assertions[]` | Treat criteria as a natural language assertion |

## File attachment extraction

agentv has **no `input_files:` shorthand** at the test level. Files are embedded as content blocks:

```yaml
input:
  - role: user
    content:
      - type: file        # ← this is how files are attached
        value: path/to/file.csv
      - type: text
        value: "The prompt text"
```

The transpiler must:
1. Walk all `input` messages
2. Extract content blocks where `type === 'file'` → add `block.value` to `files[]`
3. Concatenate remaining `type === 'text'` blocks → `prompt`

A `input_files:` syntactic sugar (expanding to `type: file` blocks) is tracked in #602.

## Natural Language Conversion per Assert Type

Every non-trigger assert becomes a string in the `assertions` array:

| Assert type | Natural language template |
|---|---|
| `rubrics` with `criteria` string | Use criteria string verbatim |
| `contains: { value }` | `"Output contains '{{value}}'"` |
| `regex: { value }` | `"Output matches regex: {{value}}"` |
| `equals: { value }` | `"Output exactly equals: {{value}}"` |
| `is-json` | `"Output is valid JSON"` |
| `llm-judge` with `prompt` | Use prompt string verbatim |
| `agent-judge` with `rubrics[]` | Expand each rubric item to its own assertion |
| `tool-trajectory` | `"Agent called tools in order: {{expected[].tool}}"` |
| `code-judge` | `"{{name or command}}: {{description if present, else omit}}"` |
| `field-accuracy` | `"Fields {{fields[].path}} match expected values"` |
| `latency` | `"Response time under {{threshold}}ms"` |
| `cost` | `"Cost under ${{budget}}"` |
| `token-usage` | `"Token usage within limits"` |
| `execution-metrics` | `"Execution within metric bounds"` |

Root-level `assert` items (applying to all tests) should be appended to every test's `assertions` array.

## trigger-judge special handling

The `trigger-judge` assert type maps to `should_trigger` (bool), NOT to `assertions`. Rules:

- `should_trigger: true` (default) → `"should_trigger": true`
- `should_trigger: false` → `"should_trigger": false`
- If no `trigger-judge` in a test's asserts → **omit** `should_trigger` from that eval entry

## Skill name extraction

`skill_name` (top-level in evals.json) comes from `trigger-judge.skill`. Rules:

- If all tests share the same skill → one evals.json with that `skill_name`
- If tests reference different skills → produce one evals.json **per skill**, each containing only its relevant tests
- If a test has no `trigger-judge` → include in the file for the dominant skill, or in a separate `_no-skill.json`

## Multi-skill / workspace evaluation

When multiple skills coexist in a workspace setup:

- Each skill gets its own evals.json grouped by `skill_name`
- A test with two `trigger-judge` asserts (one `should_trigger: true` for skill A, one `should_trigger: false` for skill B) should appear in **both** files with the appropriate `should_trigger` value
- This lets `run_eval.py` evaluate each skill independently in isolation

## Critical implementation details from skill-creator run_eval.py

1. **UUID isolation**: skill-creator generates `{skill_name}-skill-{uuid[:8]}` as `clean_name` for the temp command file. The transpiler doesn't need to replicate this — it's handled by `run_eval.py` at runtime. The `skill_name` in evals.json is the base name (e.g. `csv-analyzer`).

2. **Trigger detection fields** (what run_eval.py actually checks):
   - `Skill` tool → `tool_input["skill"]` contains `clean_name`
   - `Read` tool → `tool_input["file_path"]` contains `clean_name`
   - First tool call only — any other tool as first = not triggered
   - Matching is **case-sensitive substring** (`clean_name in field_value`)

3. **trigger_threshold**: defaults to `0.5` across `runs_per_query` (default 3). evals.json doesn't encode this — it's a runtime parameter. Consider adding optional `runs_per_query` and `trigger_threshold` to evals.json metadata for passthrough.

4. **agentv trigger-judge** (PR #597) performs post-hoc detection from `toolCalls` using the same logic as the fallback path in `run_eval.py` — first tool only, case-sensitive, `input.skill` / `input.file_path`.

## Acceptance signals

- Given a EVAL.yaml with mixed trigger and quality assertions, the transpiler produces valid evals.json
- `should_trigger` is present and correct on all tests that have a `trigger-judge` assert
- Tests without `trigger-judge` omit `should_trigger`
- `type: file` content blocks are correctly extracted into `files[]`; text blocks are concatenated into `prompt`
- All structured assert types produce readable natural language in `assertions[]`
- Root-level `assert` entries are distributed to all tests
- Multi-skill EVAL.yaml produces one evals.json per skill
- The output evals.json is accepted by `run_eval.py --eval-set` without errors

## Non-goals

- Generating the UUID `clean_name` — that's run_eval.py's responsibility
- Executing the eval — transpiler is purely a format conversion step
- Reversing evals.json back to EVAL.yaml

## Related

- #597 — claude-cli provider + trigger-judge code-judge (`.agentv/judges/`)
- #602 — `input_files:` shorthand (syntactic sugar for `type: file` content blocks)
- #569 — skill-creator eval framework alignment tracking
- skill-creator reference: `scripts/run_eval.py` (trigger detection logic)
- skill-creator reference: `scripts/improve_description.py` (consumes eval results, shows `pass`/`should_trigger` result schema)


EVAL.yaml field	evals.json field	Notes
`tests[].id` (or index)	`evals[].id`	Use numeric index if id is a string
`tests[].input` (string)	`evals[].prompt`	Direct
`tests[].input[role=user].content` (string)	`evals[].prompt`	Extract last user text content
`tests[].input[role=user].content[type=file][].value`	`evals[].files[]`	Extract file paths from content blocks
`tests[].expected_output` (string or message)	`evals[].expected_output`	Flatten to string
`tests[].assert[type=trigger-judge].should_trigger`	`evals[].should_trigger`	Default `true` if absent
`tests[].assert[type=trigger-judge].skill`	top-level `skill_name`	All cases in one file must share same skill
All other `assert` items	`evals[].assertions[]`	Convert to natural language (see below)
`tests[].criteria`	prepend to `evals[].assertions[]`	Treat criteria as a natural language assertion

Assert type	Natural language template
`rubrics` with `criteria` string	Use criteria string verbatim
`contains: { value }`	`"Output contains '{{value}}'"`
`regex: { value }`	`"Output matches regex: {{value}}"`
`equals: { value }`	`"Output exactly equals: {{value}}"`
`is-json`	`"Output is valid JSON"`
`llm-judge` with `prompt`	Use prompt string verbatim
`agent-judge` with `rubrics[]`	Expand each rubric item to its own assertion
`tool-trajectory`	`"Agent called tools in order: {{expected[].tool}}"`
`code-judge`	`"{{name or command}}: {{description if present, else omit}}"`
`field-accuracy`	`"Fields {{fields[].path}} match expected values"`
`latency`	`"Response time under {{threshold}}ms"`
`cost`	`"Cost under ${{budget}}"`
`token-usage`	`"Token usage within limits"`
`execution-metrics`	`"Execution within metric bounds"`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: create EVAL.yaml to evals.json transpiler #598

Objective

Background: Two Formats

Source — EVAL.yaml (agentv)

Target — evals.json (skill-creator / agentv skill pipeline)

skill-creator run_eval.py input format (trigger-only subset)

Field Mappings

File attachment extraction

Natural Language Conversion per Assert Type

trigger-judge special handling

Skill name extraction

Multi-skill / workspace evaluation

Critical implementation details from skill-creator run_eval.py

Acceptance signals

Non-goals

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: create EVAL.yaml to evals.json transpiler #598

Description

Objective

Background: Two Formats

Source — EVAL.yaml (agentv)

Target — evals.json (skill-creator / agentv skill pipeline)

skill-creator run_eval.py input format (trigger-only subset)

Field Mappings

File attachment extraction

Natural Language Conversion per Assert Type

trigger-judge special handling

Skill name extraction

Multi-skill / workspace evaluation

Critical implementation details from skill-creator run_eval.py

Acceptance signals

Non-goals

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions