Skip to content

feat: eval analyzer pass for weak assertions and flaky scenarios#582

Merged
christso merged 3 commits intomainfrom
feat/567-eval-analyzer
Mar 14, 2026
Merged

feat: eval analyzer pass for weak assertions and flaky scenarios#582
christso merged 3 commits intomainfrom
feat/567-eval-analyzer

Conversation

@christso
Copy link
Copy Markdown
Collaborator

Closes #567

Changes

  • New agent: eval-analyzer.md — standalone eval-quality analysis agent
  • New skill: agentv-eval-analyzer/SKILL.md — skill for invoking the analyzer

Capabilities

  • Deterministic-upgrade suggestions: Identifies LLM-judge evaluators doing work a deterministic assertion could handle (contains, regex, is-json, starts-with, etc.)
  • Weak assertion detection: Flags vague, tautological, and compound assertions with specific improvement suggestions
  • Cost/quality flagging: Surfaces always-pass, always-fail, expensive binary checks, and redundant evaluators
  • Multi-provider variance: Flags evaluators with high score variance across targets
  • Works with all evaluator types: code-judge, tool-trajectory, llm-judge, agent-judge, rubrics, composite, and all deterministic types
  • EVAL.yaml aware: Reads both JSONL results and EVAL.yaml config for full-context analysis

Architecture

External-first: new agent in plugins/agentv-dev/agents/ and skill in plugins/agentv-dev/skills/, following existing patterns (eval-judge, agentv-trace-analyst).

…gestions (#567)

Add eval-analyzer agent that identifies LLM-judge evaluations replaceable
with deterministic assertions, flags weak/vague assertions, and surfaces
cost/quality improvement opportunities from JSONL results.

New files:
- agents/eval-analyzer.md: standalone analysis agent
- skills/agentv-eval-analyzer/SKILL.md: skill for invoking the analyzer

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages bot commented Mar 14, 2026

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: 5767436
Status:⚡️  Build in progress...

View logs

christso and others added 2 commits March 14, 2026 05:35
Fix schema accuracy: llm-judge uses prompt not criteria, remove
non-existent types (icontains, starts-with, ends-with, contains-all),
use correct alternatives (contains, regex).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@christso christso marked this pull request as ready for review March 14, 2026 05:42
@christso christso merged commit 976a000 into main Mar 14, 2026
1 check was pending
@christso christso deleted the feat/567-eval-analyzer branch March 14, 2026 05:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: eval analyzer pass for weak assertions and flaky scenarios

1 participant