-
Notifications
You must be signed in to change notification settings - Fork 0
feat: adopt skill-creator grading patterns in eval-judge (claims extraction, eval critique, evidence format) #570
Description
Objective
Enhance AgentV's eval-judge agent with three evaluation techniques from Anthropic's skill-creator grader.md that AgentV currently lacks: claims extraction and verification, eval self-critique, and surface-vs-substance guards. These make the judge produce richer, more reliable results without requiring new core runtime features.
Architecture Boundary
external-first (agent prompt enhancement in plugins/agentv-dev/agents/eval-judge.md)
What skill-creator's grader does that eval-judge doesn't
1. Claims extraction and verification
The grader doesn't just check predefined assertions — it extracts implicit claims from the agent's output and independently verifies them:
"Extract implicit claims from the outputs and verify them:
- Factual statements ('The form has 12 fields')
- Process claims ('Used pypdf to fill the form')
- Quality claims ('All fields were filled correctly')
Flag unverifiable claims."
This catches issues that predefined assertions miss. AgentV's eval-judge only evaluates the assertions configured in the eval file.
2. Eval self-critique (inline)
The grader actively critiques the quality of the assertions it's evaluating:
"You have two jobs: grade the outputs, and critique the evals themselves.
A passing grade on a weak assertion is worse than useless —
it creates false confidence."
It produces eval_feedback with concrete suggestions:
{
"eval_feedback": {
"suggestions": [
{
"assertion": "The output includes the name 'John Smith'",
"reason": "A hallucinated document that mentions the name would also pass — consider checking it appears as the primary contact with matching phone and email"
}
],
"overall": "Assertions check presence but not correctness. Consider adding content verification."
}
}AgentV's eval-judge scores but never questions whether the assertions themselves are good.
3. Surface vs substance guards
The grader explicitly guards against surface-level compliance:
"PASS when: The evidence reflects genuine substance, not just surface compliance
(e.g., a file exists AND contains correct content, not just the right filename)"
"FAIL when: The evidence is superficial — the assertion is technically satisfied
but the underlying task outcome is wrong or incomplete"
AgentV's eval-judge doesn't have this discrimination — it treats assertion satisfaction at face value.
4. Structured evidence format
The grader produces per-assertion evidence with cited quotes:
{
"text": "The spreadsheet has a SUM formula in cell B10",
"passed": false,
"evidence": "No spreadsheet was created. The output was a text file."
}AgentV's eval-judge produces {score, hits, misses, reasoning} — the reasoning is a single blob, not per-assertion evidence. This matters for #565's grading.json artifact compatibility.
5. User notes integration
The grader reads {outputs_dir}/user_notes.md if it exists — notes from the executor about uncertainties, workarounds, and issues encountered during execution. This surfaces executor-side concerns that pass/fail scores miss. AgentV's eval-judge doesn't read any executor notes. Include user notes integration in the eval-judge enhancement if executor notes or workspace hook output is available.
What to keep from AgentV's eval-judge
- The
agentv prompt eval judgecommand integration (deterministic + prompt_ready evaluator dispatch) - The weighted average scoring across evaluators
- The JSONL append workflow
- The
mode: "agent"distinction
Design Latitude
- Choose how to integrate the three new capabilities into the existing eval-judge prompt without making it unwieldy
- May use progressive disclosure: core grading always, claims extraction and eval critique as optional depth levels
- Choose whether eval feedback goes into the JSONL result (new field) or a companion artifact
- May add the structured evidence format as a new output shape alongside the existing
{score, hits, misses, reasoning}, or replace it - Choose whether claims extraction runs on every eval or only when explicitly requested
Acceptance Signals
- eval-judge extracts and verifies at least factual and process claims from agent output, beyond predefined assertions
- eval-judge produces
eval_feedbackwith concrete suggestions when assertions are weak (trivially satisfiable, missing important checks, or unverifiable) - eval-judge distinguishes surface compliance from genuine task completion in its verdicts
- Per-assertion evidence with cited quotes is available in the output (compatible with skill-creator's
grading.jsonexpectations format) - Existing eval configs and JSONL format continue working (enhancements are additive)
Non-Goals
- Changing the core evaluation engine or evaluator registry
- Adding new evaluator types
- Modifying deterministic evaluators (contains, regex, etc.)
- Building a standalone analyzer tool (that's feat: eval analyzer pass for weak assertions and flaky scenarios #567)
- Trigger evaluation
Source Material
- Anthropic skill-creator grader.md — full prompt with claims extraction, eval critique, surface/substance guards
- Current AgentV eval-judge:
plugins/agentv-dev/agents/eval-judge.md