feat: adopt skill-creator grading patterns in eval-judge (claims extraction, eval critique, evidence format)

## Objective

Enhance AgentV's `eval-judge` agent with three evaluation techniques from Anthropic's skill-creator `grader.md` that AgentV currently lacks: claims extraction and verification, eval self-critique, and surface-vs-substance guards. These make the judge produce richer, more reliable results without requiring new core runtime features.

## Architecture Boundary

external-first (agent prompt enhancement in `plugins/agentv-dev/agents/eval-judge.md`)

## What skill-creator's grader does that eval-judge doesn't

### 1. Claims extraction and verification

The grader doesn't just check predefined assertions — it extracts implicit claims from the agent's output and independently verifies them:

```
"Extract implicit claims from the outputs and verify them:
 - Factual statements ('The form has 12 fields')
 - Process claims ('Used pypdf to fill the form')
 - Quality claims ('All fields were filled correctly')
 Flag unverifiable claims."
```

This catches issues that predefined assertions miss. AgentV's eval-judge only evaluates the assertions configured in the eval file.

### 2. Eval self-critique (inline)

The grader actively critiques the quality of the assertions it's evaluating:

```
"You have two jobs: grade the outputs, and critique the evals themselves.
 A passing grade on a weak assertion is worse than useless —
 it creates false confidence."
```

It produces `eval_feedback` with concrete suggestions:
```json
{
  "eval_feedback": {
    "suggestions": [
      {
        "assertion": "The output includes the name 'John Smith'",
        "reason": "A hallucinated document that mentions the name would also pass — consider checking it appears as the primary contact with matching phone and email"
      }
    ],
    "overall": "Assertions check presence but not correctness. Consider adding content verification."
  }
}
```

AgentV's eval-judge scores but never questions whether the assertions themselves are good.

### 3. Surface vs substance guards

The grader explicitly guards against surface-level compliance:

```
"PASS when: The evidence reflects genuine substance, not just surface compliance
 (e.g., a file exists AND contains correct content, not just the right filename)"

"FAIL when: The evidence is superficial — the assertion is technically satisfied
 but the underlying task outcome is wrong or incomplete"
```

AgentV's eval-judge doesn't have this discrimination — it treats assertion satisfaction at face value.

### 4. Structured evidence format

The grader produces per-assertion evidence with cited quotes:
```json
{
  "text": "The spreadsheet has a SUM formula in cell B10",
  "passed": false,
  "evidence": "No spreadsheet was created. The output was a text file."
}
```

AgentV's eval-judge produces `{score, hits, misses, reasoning}` — the reasoning is a single blob, not per-assertion evidence. This matters for #565's `grading.json` artifact compatibility.

### 5. User notes integration

The grader reads `{outputs_dir}/user_notes.md` if it exists — notes from the executor about uncertainties, workarounds, and issues encountered during execution. This surfaces executor-side concerns that pass/fail scores miss. AgentV's eval-judge doesn't read any executor notes. Include user notes integration in the eval-judge enhancement if executor notes or workspace hook output is available.

## What to keep from AgentV's eval-judge

- The `agentv prompt eval judge` command integration (deterministic + prompt_ready evaluator dispatch)
- The weighted average scoring across evaluators
- The JSONL append workflow
- The `mode: "agent"` distinction

## Design Latitude

- Choose how to integrate the three new capabilities into the existing eval-judge prompt without making it unwieldy
- May use progressive disclosure: core grading always, claims extraction and eval critique as optional depth levels
- Choose whether eval feedback goes into the JSONL result (new field) or a companion artifact
- May add the structured evidence format as a new output shape alongside the existing `{score, hits, misses, reasoning}`, or replace it
- Choose whether claims extraction runs on every eval or only when explicitly requested

## Acceptance Signals

- eval-judge extracts and verifies at least factual and process claims from agent output, beyond predefined assertions
- eval-judge produces `eval_feedback` with concrete suggestions when assertions are weak (trivially satisfiable, missing important checks, or unverifiable)
- eval-judge distinguishes surface compliance from genuine task completion in its verdicts
- Per-assertion evidence with cited quotes is available in the output (compatible with skill-creator's `grading.json` expectations format)
- Existing eval configs and JSONL format continue working (enhancements are additive)

## Non-Goals

- Changing the core evaluation engine or evaluator registry
- Adding new evaluator types
- Modifying deterministic evaluators (contains, regex, etc.)
- Building a standalone analyzer tool (that's #567)
- Trigger evaluation

## Source Material

- [Anthropic skill-creator grader.md](https://github.com/anthropics/skills/blob/main/skills/skill-creator/agents/grader.md) — full prompt with claims extraction, eval critique, surface/substance guards
- Current AgentV eval-judge: `plugins/agentv-dev/agents/eval-judge.md`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: adopt skill-creator grading patterns in eval-judge (claims extraction, eval critique, evidence format) #570

Objective

Architecture Boundary

What skill-creator's grader does that eval-judge doesn't

1. Claims extraction and verification

2. Eval self-critique (inline)

3. Surface vs substance guards

4. Structured evidence format

5. User notes integration

What to keep from AgentV's eval-judge

Design Latitude

Acceptance Signals

Non-Goals

Source Material

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: adopt skill-creator grading patterns in eval-judge (claims extraction, eval critique, evidence format) #570

Description

Objective

Architecture Boundary

What skill-creator's grader does that eval-judge doesn't

1. Claims extraction and verification

2. Eval self-critique (inline)

3. Surface vs substance guards

4. Structured evidence format

5. User notes integration

What to keep from AgentV's eval-judge

Design Latitude

Acceptance Signals

Non-Goals

Source Material

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions