Find what your AI agent gets wrong — before you have a rubric. Qualitative eval for PMs.
-
Updated
May 26, 2026 - Python
Find what your AI agent gets wrong — before you have a rubric. Qualitative eval for PMs.
Local-first LLM evaluation runner with baselines, caching, markdown reports, and CI-friendly quality, latency, and cost gates.
Binary safety verdicts (SAFE/HELD/LEAK/MISS/BROKE) + persona fan-out for LLM pipeline evals
Open-source evaluation framework for AI agents. Define test suites with rubrics, run your agent, get LLM-as-judge scores against criteria, inspect full execution traces, and diff runs to catch behavioral regressions.
Add a description, image, and links to the eval-framework topic page so that developers can more easily learn about it.
To associate your repository with the eval-framework topic, visit your repo's landing page and select "manage topics."