Skip to content

feat: create EVAL.yaml to evals.json transpiler #598

@christso

Description

@christso

Objective

Build a transpiler from agentv EVAL.yaml → evals.json so that skill trigger evaluation cases authored in agentv's format can be consumed by skill-creator's pipeline.

Background: Two Formats

Source — EVAL.yaml (agentv)

Structured, machine-readable assertions. Files are attached via type: file content blocks inside a message (there is no input_files: shorthand — see sub-issue below):

tests:
  - id: csv-top-months
    criteria: Agent finds the top 3 months by revenue
    input:
      - role: user
        content:
          - type: file
            value: evals/files/sales.csv
          - type: text
            value: "I have a CSV of monthly sales data. Find the top 3 months by revenue."
    expected_output: "The top 3 months by revenue are November ($22,500), September ($20,100), and December ($19,400)."
    assert:
      - type: trigger-judge          # from .agentv/judges/trigger-judge.ts
        skill: csv-analyzer
        should_trigger: true
      - type: rubrics
        criteria: "Output identifies November as the highest revenue month"
      - type: contains
        value: "$22,500"

  - id: irrelevant-query
    criteria: Agent does not invoke csv-analyzer for an unrelated query
    input: "What time is it?"
    assert:
      - type: trigger-judge
        skill: csv-analyzer
        should_trigger: false

Target — evals.json (skill-creator / agentv skill pipeline)

{
  "skill_name": "csv-analyzer",
  "evals": [
    {
      "id": 1,
      "prompt": "I have a CSV of monthly sales data. Find the top 3 months by revenue.",
      "expected_output": "The top 3 months by revenue are November ($22,500), September ($20,100), and December ($19,400).",
      "files": ["evals/files/sales.csv"],
      "should_trigger": true,
      "assertions": [
        "Output identifies November as the highest revenue month",
        "Output contains '$22,500'"
      ]
    },
    {
      "id": 2,
      "prompt": "What time is it?",
      "should_trigger": false,
      "assertions": []
    }
  ]
}

skill-creator run_eval.py input format (trigger-only subset)

When feeding into run_eval.py --eval-set, only query and should_trigger are needed — unknown fields are ignored:

[
  { "query": "I have a CSV of monthly sales data...", "should_trigger": true },
  { "query": "What time is it?", "should_trigger": false }
]

The transpiler should produce the richer evals.json since run_eval.py ignores unknown fields.

Field Mappings

EVAL.yaml field evals.json field Notes
tests[].id (or index) evals[].id Use numeric index if id is a string
tests[].input (string) evals[].prompt Direct
tests[].input[role=user].content (string) evals[].prompt Extract last user text content
tests[].input[role=user].content[type=file][].value evals[].files[] Extract file paths from content blocks
tests[].expected_output (string or message) evals[].expected_output Flatten to string
tests[].assert[type=trigger-judge].should_trigger evals[].should_trigger Default true if absent
tests[].assert[type=trigger-judge].skill top-level skill_name All cases in one file must share same skill
All other assert items evals[].assertions[] Convert to natural language (see below)
tests[].criteria prepend to evals[].assertions[] Treat criteria as a natural language assertion

File attachment extraction

agentv has no input_files: shorthand at the test level. Files are embedded as content blocks:

input:
  - role: user
    content:
      - type: file        # ← this is how files are attached
        value: path/to/file.csv
      - type: text
        value: "The prompt text"

The transpiler must:

  1. Walk all input messages
  2. Extract content blocks where type === 'file' → add block.value to files[]
  3. Concatenate remaining type === 'text' blocks → prompt

A input_files: syntactic sugar (expanding to type: file blocks) is tracked in #602.

Natural Language Conversion per Assert Type

Every non-trigger assert becomes a string in the assertions array:

Assert type Natural language template
rubrics with criteria string Use criteria string verbatim
contains: { value } "Output contains '{{value}}'"
regex: { value } "Output matches regex: {{value}}"
equals: { value } "Output exactly equals: {{value}}"
is-json "Output is valid JSON"
llm-judge with prompt Use prompt string verbatim
agent-judge with rubrics[] Expand each rubric item to its own assertion
tool-trajectory "Agent called tools in order: {{expected[].tool}}"
code-judge "{{name or command}}: {{description if present, else omit}}"
field-accuracy "Fields {{fields[].path}} match expected values"
latency "Response time under {{threshold}}ms"
cost "Cost under ${{budget}}"
token-usage "Token usage within limits"
execution-metrics "Execution within metric bounds"

Root-level assert items (applying to all tests) should be appended to every test's assertions array.

trigger-judge special handling

The trigger-judge assert type maps to should_trigger (bool), NOT to assertions. Rules:

  • should_trigger: true (default) → "should_trigger": true
  • should_trigger: false"should_trigger": false
  • If no trigger-judge in a test's asserts → omit should_trigger from that eval entry

Skill name extraction

skill_name (top-level in evals.json) comes from trigger-judge.skill. Rules:

  • If all tests share the same skill → one evals.json with that skill_name
  • If tests reference different skills → produce one evals.json per skill, each containing only its relevant tests
  • If a test has no trigger-judge → include in the file for the dominant skill, or in a separate _no-skill.json

Multi-skill / workspace evaluation

When multiple skills coexist in a workspace setup:

  • Each skill gets its own evals.json grouped by skill_name
  • A test with two trigger-judge asserts (one should_trigger: true for skill A, one should_trigger: false for skill B) should appear in both files with the appropriate should_trigger value
  • This lets run_eval.py evaluate each skill independently in isolation

Critical implementation details from skill-creator run_eval.py

  1. UUID isolation: skill-creator generates {skill_name}-skill-{uuid[:8]} as clean_name for the temp command file. The transpiler doesn't need to replicate this — it's handled by run_eval.py at runtime. The skill_name in evals.json is the base name (e.g. csv-analyzer).

  2. Trigger detection fields (what run_eval.py actually checks):

    • Skill tool → tool_input["skill"] contains clean_name
    • Read tool → tool_input["file_path"] contains clean_name
    • First tool call only — any other tool as first = not triggered
    • Matching is case-sensitive substring (clean_name in field_value)
  3. trigger_threshold: defaults to 0.5 across runs_per_query (default 3). evals.json doesn't encode this — it's a runtime parameter. Consider adding optional runs_per_query and trigger_threshold to evals.json metadata for passthrough.

  4. agentv trigger-judge (PR feat(providers,evaluators): claude-cli provider + trigger-judge evaluator #597) performs post-hoc detection from toolCalls using the same logic as the fallback path in run_eval.py — first tool only, case-sensitive, input.skill / input.file_path.

Acceptance signals

  • Given a EVAL.yaml with mixed trigger and quality assertions, the transpiler produces valid evals.json
  • should_trigger is present and correct on all tests that have a trigger-judge assert
  • Tests without trigger-judge omit should_trigger
  • type: file content blocks are correctly extracted into files[]; text blocks are concatenated into prompt
  • All structured assert types produce readable natural language in assertions[]
  • Root-level assert entries are distributed to all tests
  • Multi-skill EVAL.yaml produces one evals.json per skill
  • The output evals.json is accepted by run_eval.py --eval-set without errors

Non-goals

  • Generating the UUID clean_name — that's run_eval.py's responsibility
  • Executing the eval — transpiler is purely a format conversion step
  • Reversing evals.json back to EVAL.yaml

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions