eval-framework

Here are 4 public repositories matching this topic...

aws-samples / sample-GEDD

Find what your AI agent gets wrong — before you have a rubric. Qualitative eval for PMs.

python product-management ai-agents grounded-theory prompt-engineering ai-testing ai-quality amazon-bedrock llm-evaluation eval-framework

Updated May 26, 2026
Python

abhijeetnardele24-hash / dev-eval-innovator

Star

Local-first LLM evaluation runner with baselines, caching, markdown reports, and CI-friendly quality, latency, and cost gates.

python ci developer-tools prompt-engineering llm-testing llm-evals openai-compatible eval-framework

Updated Apr 13, 2026
Python

svetkis / triage-voice-eval

Star

Binary safety verdicts (SAFE/HELD/LEAK/MISS/BROKE) + persona fan-out for LLM pipeline evals

python testing jailbreak evaluation safety safety-critical guardrails llm prompt-injection crisis-detection eval-framework verdicts

Updated May 5, 2026
Python

Open-source evaluation framework for AI agents. Define test suites with rubrics, run your agent, get LLM-as-judge scores against criteria, inspect full execution traces, and diff runs to catch behavioral regressions.

python open-source ai nextjs agents llm anthropic llm-as-judge agentic-ai agent-evals eval-framework agent-tracing

Updated May 22, 2026
Python

Improve this page

Add a description, image, and links to the eval-framework topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the eval-framework topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly