Skip to content

feat: adopt skill-creator grading patterns in eval-judge (claims extraction, eval critique, evidence format) #570

@christso

Description

@christso

Objective

Enhance AgentV's eval-judge agent with three evaluation techniques from Anthropic's skill-creator grader.md that AgentV currently lacks: claims extraction and verification, eval self-critique, and surface-vs-substance guards. These make the judge produce richer, more reliable results without requiring new core runtime features.

Architecture Boundary

external-first (agent prompt enhancement in plugins/agentv-dev/agents/eval-judge.md)

What skill-creator's grader does that eval-judge doesn't

1. Claims extraction and verification

The grader doesn't just check predefined assertions — it extracts implicit claims from the agent's output and independently verifies them:

"Extract implicit claims from the outputs and verify them:
 - Factual statements ('The form has 12 fields')
 - Process claims ('Used pypdf to fill the form')
 - Quality claims ('All fields were filled correctly')
 Flag unverifiable claims."

This catches issues that predefined assertions miss. AgentV's eval-judge only evaluates the assertions configured in the eval file.

2. Eval self-critique (inline)

The grader actively critiques the quality of the assertions it's evaluating:

"You have two jobs: grade the outputs, and critique the evals themselves.
 A passing grade on a weak assertion is worse than useless —
 it creates false confidence."

It produces eval_feedback with concrete suggestions:

{
  "eval_feedback": {
    "suggestions": [
      {
        "assertion": "The output includes the name 'John Smith'",
        "reason": "A hallucinated document that mentions the name would also pass — consider checking it appears as the primary contact with matching phone and email"
      }
    ],
    "overall": "Assertions check presence but not correctness. Consider adding content verification."
  }
}

AgentV's eval-judge scores but never questions whether the assertions themselves are good.

3. Surface vs substance guards

The grader explicitly guards against surface-level compliance:

"PASS when: The evidence reflects genuine substance, not just surface compliance
 (e.g., a file exists AND contains correct content, not just the right filename)"

"FAIL when: The evidence is superficial — the assertion is technically satisfied
 but the underlying task outcome is wrong or incomplete"

AgentV's eval-judge doesn't have this discrimination — it treats assertion satisfaction at face value.

4. Structured evidence format

The grader produces per-assertion evidence with cited quotes:

{
  "text": "The spreadsheet has a SUM formula in cell B10",
  "passed": false,
  "evidence": "No spreadsheet was created. The output was a text file."
}

AgentV's eval-judge produces {score, hits, misses, reasoning} — the reasoning is a single blob, not per-assertion evidence. This matters for #565's grading.json artifact compatibility.

5. User notes integration

The grader reads {outputs_dir}/user_notes.md if it exists — notes from the executor about uncertainties, workarounds, and issues encountered during execution. This surfaces executor-side concerns that pass/fail scores miss. AgentV's eval-judge doesn't read any executor notes. Include user notes integration in the eval-judge enhancement if executor notes or workspace hook output is available.

What to keep from AgentV's eval-judge

  • The agentv prompt eval judge command integration (deterministic + prompt_ready evaluator dispatch)
  • The weighted average scoring across evaluators
  • The JSONL append workflow
  • The mode: "agent" distinction

Design Latitude

  • Choose how to integrate the three new capabilities into the existing eval-judge prompt without making it unwieldy
  • May use progressive disclosure: core grading always, claims extraction and eval critique as optional depth levels
  • Choose whether eval feedback goes into the JSONL result (new field) or a companion artifact
  • May add the structured evidence format as a new output shape alongside the existing {score, hits, misses, reasoning}, or replace it
  • Choose whether claims extraction runs on every eval or only when explicitly requested

Acceptance Signals

  • eval-judge extracts and verifies at least factual and process claims from agent output, beyond predefined assertions
  • eval-judge produces eval_feedback with concrete suggestions when assertions are weak (trivially satisfiable, missing important checks, or unverifiable)
  • eval-judge distinguishes surface compliance from genuine task completion in its verdicts
  • Per-assertion evidence with cited quotes is available in the output (compatible with skill-creator's grading.json expectations format)
  • Existing eval configs and JSONL format continue working (enhancements are additive)

Non-Goals

Source Material

  • Anthropic skill-creator grader.md — full prompt with claims extraction, eval critique, surface/substance guards
  • Current AgentV eval-judge: plugins/agentv-dev/agents/eval-judge.md

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions