Skip to content

feat: adopt skill-creator grading patterns in eval-judge (claims extraction, eval critique, evidence format)#578

Merged
christso merged 3 commits intomainfrom
feat/570-eval-judge-enhancement
Mar 14, 2026
Merged

feat: adopt skill-creator grading patterns in eval-judge (claims extraction, eval critique, evidence format)#578
christso merged 3 commits intomainfrom
feat/570-eval-judge-enhancement

Conversation

@christso
Copy link
Copy Markdown
Collaborator

Closes #570

Changes

Enhanced eval-judge agent prompt with five capabilities from Anthropic's skill-creator grader:

  1. Claims extraction and verification — Extracts and verifies implicit claims beyond predefined assertions
  2. Eval self-critique — Critiques assertion quality inline, flags weak/trivial assertions
  3. Surface vs substance guards — Distinguishes genuine task completion from superficial compliance
  4. Structured evidence format — Per-assertion {text, passed, evidence} compatible with grading.json
  5. User notes integration — Reads executor notes when available

What's preserved

  • agentv prompt eval judge command integration
  • Weighted average scoring
  • JSONL append workflow
  • Deterministic + prompt_ready evaluator dispatch

Enhance eval-judge with claims extraction, eval self-critique,
surface/substance guards, per-assertion evidence format, and user
notes integration from Anthropic's skill-creator grader patterns.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages bot commented Mar 14, 2026

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: d744fd3
Status:⚡️  Build in progress...

View logs

christso and others added 2 commits March 14, 2026 05:41
Restructure enhanced output fields to use existing schema fields
(reasoning, scores[].reasoning, scores[].details) and extensions
pattern for new data.

- Per-assertion evidence → scores[].reasoning + scores[].details
- Verified claims → structured section in top-level reasoning
- User notes → structured section in top-level reasoning
- Eval feedback, claims, user notes summary → extensions object

Core output shape (score, hits, misses, reasoning, answer, mode,
scores[]) remains unchanged. New structured data is additive via
the extensions pattern, which the JSONL writer serializes via
toSnakeCaseDeep().

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@christso christso marked this pull request as ready for review March 14, 2026 05:42
@christso christso merged commit ef99a1f into main Mar 14, 2026
1 check was pending
@christso christso deleted the feat/570-eval-judge-enhancement branch March 14, 2026 05:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: adopt skill-creator grading patterns in eval-judge (claims extraction, eval critique, evidence format)

1 participant