-
Notifications
You must be signed in to change notification settings - Fork 0
feat: create EVAL.yaml to evals.json transpiler #598
Description
Objective
Build a transpiler from agentv EVAL.yaml → evals.json so that skill trigger evaluation cases authored in agentv's format can be consumed by skill-creator's pipeline.
Background: Two Formats
Source — EVAL.yaml (agentv)
Structured, machine-readable assertions. Files are attached via type: file content blocks inside a message (there is no input_files: shorthand — see sub-issue below):
tests:
- id: csv-top-months
criteria: Agent finds the top 3 months by revenue
input:
- role: user
content:
- type: file
value: evals/files/sales.csv
- type: text
value: "I have a CSV of monthly sales data. Find the top 3 months by revenue."
expected_output: "The top 3 months by revenue are November ($22,500), September ($20,100), and December ($19,400)."
assert:
- type: trigger-judge # from .agentv/judges/trigger-judge.ts
skill: csv-analyzer
should_trigger: true
- type: rubrics
criteria: "Output identifies November as the highest revenue month"
- type: contains
value: "$22,500"
- id: irrelevant-query
criteria: Agent does not invoke csv-analyzer for an unrelated query
input: "What time is it?"
assert:
- type: trigger-judge
skill: csv-analyzer
should_trigger: falseTarget — evals.json (skill-creator / agentv skill pipeline)
{
"skill_name": "csv-analyzer",
"evals": [
{
"id": 1,
"prompt": "I have a CSV of monthly sales data. Find the top 3 months by revenue.",
"expected_output": "The top 3 months by revenue are November ($22,500), September ($20,100), and December ($19,400).",
"files": ["evals/files/sales.csv"],
"should_trigger": true,
"assertions": [
"Output identifies November as the highest revenue month",
"Output contains '$22,500'"
]
},
{
"id": 2,
"prompt": "What time is it?",
"should_trigger": false,
"assertions": []
}
]
}skill-creator run_eval.py input format (trigger-only subset)
When feeding into run_eval.py --eval-set, only query and should_trigger are needed — unknown fields are ignored:
[
{ "query": "I have a CSV of monthly sales data...", "should_trigger": true },
{ "query": "What time is it?", "should_trigger": false }
]The transpiler should produce the richer evals.json since run_eval.py ignores unknown fields.
Field Mappings
| EVAL.yaml field | evals.json field | Notes |
|---|---|---|
tests[].id (or index) |
evals[].id |
Use numeric index if id is a string |
tests[].input (string) |
evals[].prompt |
Direct |
tests[].input[role=user].content (string) |
evals[].prompt |
Extract last user text content |
tests[].input[role=user].content[type=file][].value |
evals[].files[] |
Extract file paths from content blocks |
tests[].expected_output (string or message) |
evals[].expected_output |
Flatten to string |
tests[].assert[type=trigger-judge].should_trigger |
evals[].should_trigger |
Default true if absent |
tests[].assert[type=trigger-judge].skill |
top-level skill_name |
All cases in one file must share same skill |
All other assert items |
evals[].assertions[] |
Convert to natural language (see below) |
tests[].criteria |
prepend to evals[].assertions[] |
Treat criteria as a natural language assertion |
File attachment extraction
agentv has no input_files: shorthand at the test level. Files are embedded as content blocks:
input:
- role: user
content:
- type: file # ← this is how files are attached
value: path/to/file.csv
- type: text
value: "The prompt text"The transpiler must:
- Walk all
inputmessages - Extract content blocks where
type === 'file'→ addblock.valuetofiles[] - Concatenate remaining
type === 'text'blocks →prompt
A input_files: syntactic sugar (expanding to type: file blocks) is tracked in #602.
Natural Language Conversion per Assert Type
Every non-trigger assert becomes a string in the assertions array:
| Assert type | Natural language template |
|---|---|
rubrics with criteria string |
Use criteria string verbatim |
contains: { value } |
"Output contains '{{value}}'" |
regex: { value } |
"Output matches regex: {{value}}" |
equals: { value } |
"Output exactly equals: {{value}}" |
is-json |
"Output is valid JSON" |
llm-judge with prompt |
Use prompt string verbatim |
agent-judge with rubrics[] |
Expand each rubric item to its own assertion |
tool-trajectory |
"Agent called tools in order: {{expected[].tool}}" |
code-judge |
"{{name or command}}: {{description if present, else omit}}" |
field-accuracy |
"Fields {{fields[].path}} match expected values" |
latency |
"Response time under {{threshold}}ms" |
cost |
"Cost under ${{budget}}" |
token-usage |
"Token usage within limits" |
execution-metrics |
"Execution within metric bounds" |
Root-level assert items (applying to all tests) should be appended to every test's assertions array.
trigger-judge special handling
The trigger-judge assert type maps to should_trigger (bool), NOT to assertions. Rules:
should_trigger: true(default) →"should_trigger": trueshould_trigger: false→"should_trigger": false- If no
trigger-judgein a test's asserts → omitshould_triggerfrom that eval entry
Skill name extraction
skill_name (top-level in evals.json) comes from trigger-judge.skill. Rules:
- If all tests share the same skill → one evals.json with that
skill_name - If tests reference different skills → produce one evals.json per skill, each containing only its relevant tests
- If a test has no
trigger-judge→ include in the file for the dominant skill, or in a separate_no-skill.json
Multi-skill / workspace evaluation
When multiple skills coexist in a workspace setup:
- Each skill gets its own evals.json grouped by
skill_name - A test with two
trigger-judgeasserts (oneshould_trigger: truefor skill A, oneshould_trigger: falsefor skill B) should appear in both files with the appropriateshould_triggervalue - This lets
run_eval.pyevaluate each skill independently in isolation
Critical implementation details from skill-creator run_eval.py
-
UUID isolation: skill-creator generates
{skill_name}-skill-{uuid[:8]}asclean_namefor the temp command file. The transpiler doesn't need to replicate this — it's handled byrun_eval.pyat runtime. Theskill_namein evals.json is the base name (e.g.csv-analyzer). -
Trigger detection fields (what run_eval.py actually checks):
Skilltool →tool_input["skill"]containsclean_nameReadtool →tool_input["file_path"]containsclean_name- First tool call only — any other tool as first = not triggered
- Matching is case-sensitive substring (
clean_name in field_value)
-
trigger_threshold: defaults to
0.5acrossruns_per_query(default 3). evals.json doesn't encode this — it's a runtime parameter. Consider adding optionalruns_per_queryandtrigger_thresholdto evals.json metadata for passthrough. -
agentv trigger-judge (PR feat(providers,evaluators): claude-cli provider + trigger-judge evaluator #597) performs post-hoc detection from
toolCallsusing the same logic as the fallback path inrun_eval.py— first tool only, case-sensitive,input.skill/input.file_path.
Acceptance signals
- Given a EVAL.yaml with mixed trigger and quality assertions, the transpiler produces valid evals.json
should_triggeris present and correct on all tests that have atrigger-judgeassert- Tests without
trigger-judgeomitshould_trigger type: filecontent blocks are correctly extracted intofiles[]; text blocks are concatenated intoprompt- All structured assert types produce readable natural language in
assertions[] - Root-level
assertentries are distributed to all tests - Multi-skill EVAL.yaml produces one evals.json per skill
- The output evals.json is accepted by
run_eval.py --eval-setwithout errors
Non-goals
- Generating the UUID
clean_name— that's run_eval.py's responsibility - Executing the eval — transpiler is purely a format conversion step
- Reversing evals.json back to EVAL.yaml
Related
- feat(providers,evaluators): claude-cli provider + trigger-judge evaluator #597 — claude-cli provider + trigger-judge code-judge (
.agentv/judges/) - feat: input_files shorthand in EVAL.yaml test cases #602 —
input_files:shorthand (syntactic sugar fortype: filecontent blocks) - tracking: Anthropic skill-creator eval framework alignment #569 — skill-creator eval framework alignment tracking
- skill-creator reference:
scripts/run_eval.py(trigger detection logic) - skill-creator reference:
scripts/improve_description.py(consumes eval results, showspass/should_triggerresult schema)