🐢 Open-Source Evaluation & Testing library for LLM Agents
-
Updated
Nov 18, 2025 - Python
🐢 Open-Source Evaluation & Testing library for LLM Agents
Evaluation and Tracking for LLM Experiments and AI Agents
A single interface to use and evaluate different agent frameworks
Mathematical benchmark exposing the massive performance gap between real agents and LLM wrappers. Rigorous multi-dimensional evaluation with statistical validation (95% CI, Cohen's h) and reproducible methodology. Separates architectural theater from real systems through stress testing, network resilience, and failure analysis.
Tune your AI Agent to best meet its KPI with a cyclic process of analyze, improve and simulate
EvalView: pytest-style test harness for AI agents - YAML scenarios, tool-call checks, cost/latency & safety evals, CI-friendly reports
Intelligent Context Engineering Assistant for Multi-Agent Systems. Analyze, optimize, and enhance your AI agent configurations with AI-powered insights
Multi-agent customer support system with Google ADK & Gemini 2.5 Flash Lite. Kaggle capstone demonstrating 11+ concepts. Automates 80%+ queries, <10s response time.
A minimal sandbox to run, score, and compare AI agent outputs locally.
Neurosim is a Python framework for building, running, and evaluating AI agent systems. It provides core primitives for agent evaluation, cloud storage integration, and an LLM-as-a-judge system for automated scoring.
Benchmark framework for evaluating LLM agent continual learning in stateful environments. Features production-realistic CRM workflows with multi-turn conversations, state mutations, and cross-entity relationships. Extensible to additional domains
Experiments and analysis on reflection timing in reinforcement learning agents — exploring self-evaluation, meta-learning, and adaptive reflection intervals.
Pondera is a lightweight, YAML-first framework to evaluate AI models and agents with pluggable runners and an LLM-as-a-judge.
Qualitative benchmark suite for evaluating AI coding agents and orchestration paradigms on realistic, complex development tasks
A safety-first multi-agent mental health companion with real-time distress tracking, triple-layer guardrails, and evidence-based grounding techniques. Built for Kaggle × Google Agents Intensive 2025 Capstone (Agents for Good Track)
🤖 Benchmark AI agent capabilities, bridging the gap between hype and reality with clear metrics and insights for informed development decisions.
Decision-oriented evaluation toolkit for LLM & agent systems, focusing on trust, failure modes, and deployment readiness in enterprise environments.
Visual dashboard to evaluate multi-agent & RAG-based AI apps. Compare models on accuracy, latency, token usage, and trust metrics - powered by NVIDIA AgentIQ
Browser automation agent for Bunnings website using the browser-use library, orchestrated via the laminar framework, managed with uv for Python environments, and running in Brave Browser for stealth and CAPTCHA bypass.
Add a description, image, and links to the agent-evaluation topic page so that developers can more easily learn about it.
To associate your repository with the agent-evaluation topic, visit your repo's landing page and select "manage topics."