A generic framework for evaluating MCP server implementations side-by-side using Braintrust.
The harness runs an agent loop against each MCP server in a suite, then scores responses on factuality, completeness, and efficiency via Braintrust.
You can test your MCP servers with three different model sets:
- OpenAI models
- Anthropic models
- Anthropic models, with advanced tool use betas.
Models can be specified per-MCP-server from among the below options:
- "opus"
- "sonnet"
- "haiku"
- "sonnet-code"
- "opus-code"
- "gpt-4o"
- "gpt-4o-mini"
- "o3"
- "o4-mini"
All domain-specific content — servers, test cases, system prompt, project name — lives in a suite config directory. The generic infrastructure (agent runners, scorers, eval loop) is shared across suites.
src/
suite.ts # SuiteConfig type + Zod schema, loadSuite(), getTestCasesForServer()
eval.ts # runEvals() — importable eval orchestrator
suites/
index.ts # Auto-generated barrel (do not edit)
stripe/
suite.ts # Stripe servers, 12 test cases
fixtures.json # Stripe CLI fixtures for seeding test data
increase/
suite.ts # Increase servers, 30 test cases
increase-search-docs/
suite.ts # Increase doc search, 20 test cases
gemini-search-docs/
suite.ts # Gemini API doc search, 20 test cases
evals/
e2e.eval.ts # Entry point for Braintrust CLI
run-all.ts # Re-exports e2e.eval.ts
agent/
anthropic-runner.ts # Agent SDK runner (standard Anthropic models)
anthropic-code-runner.ts # Raw SDK runner (code-mode models — defer_loading, tool_search, code_execution)
openai-runner.ts # OpenAI runner (GPT / o-series)
models.ts # Model registry + resolveModel()
types.ts # AgentRunner, AgentResult, ToolCallRecord, ModelConfig, Provider
index.ts # Runner factory + re-exports
scorers/
completeness.ts # Heuristic: checks expected text/fields in output
efficiency.ts # Heuristic: penalizes high turn count / token usage
correctness.ts # LLM-as-judge factuality (via autoevals)
- Braintrust account
- Stripe Account (optional, for the Stripe suite)
- Stripe Secret API key for your sandbox
- Stripe CLI (
brew install stripe/stripe-cli/stripe)
- Increase Account (optional, for Increase suite)
./scripts/bootstrapOr manually: cp .env.example .env and fill in your keys, then npm install.
# Run a built-in suite
./scripts/run-eval stripe
./scripts/run-eval increase
./scripts/run-eval gemini-search-docs
# Run an external suite file from another repo
./scripts/run-eval ../my-repo/suites/my-suite.ts
# Or via npm
EVAL_SUITE=stripe npm run eval
npm run eval:stripe
npm run eval:increase- Create a suite directory at
src/suites/<name>/with asuite.tsthat default-exports aSuiteConfig:
import type { SuiteConfig } from "../../suite.js";
const suite: SuiteConfig = {
projectName: "my-project", // Braintrust project name
systemPrompt: "You are a helpful assistant with access to ...",
setup: "my-cli setup-command", // Optional: command to be run before evals to seed test data
servers: [
{
id: "my-server",
displayName: "My MCP Server",
command: "node",
args: ["path/to/server.js"],
env: { API_KEY: process.env.MY_API_KEY! },
capabilities: { write: true },
mode: "tools",
},
],
testCases: [
{
id: "test-1",
prompt: "How many items are there?",
expected: {
description: "Returns the count of items",
containsText: ["42"],
},
tags: ["read"],
},
],
};
export default suite;-
Optionally add supporting files (e.g.
fixtures.json) in the same directory. You can provide a start-up command to help seed data in your sandbox account if you would like. -
Set the required environment variables for your servers.
-
Run:
EVAL_SUITE=<name> npm run evalThe harness can be imported directly from other repos. Install from git, then use runEvals to run a suite with all scoring/tagging/metrics plumbing handled for you:
import { runEvals } from "mcp-evals-harness/eval";
import type { SuiteConfig } from "mcp-evals-harness";
const suite: SuiteConfig = {
projectName: "my-project",
systemPrompt: "You are a helpful assistant...",
servers: [
{
id: "my-server",
transport: "http",
url: "https://my-mcp-server.example.com",
capabilities: { write: false },
mode: "code",
models: ["sonnet-code"],
},
],
testCases: [
{
id: "test-1",
prompt: "How many items are there?",
expected: {
description: "Returns the count of items",
containsText: ["42"],
},
tags: [],
},
],
};
runEvals(suite, {
tags: (process.env.EVAL_TAGS ?? "").split(",").filter(Boolean),
});Run with npx braintrust eval your-eval-file.ts.
Tags let you label and filter experiments and records in Braintrust. There are three sources:
- Test case tags — defined in
tagson each test case (applied per-record) - Server tags — add
tagsto a server config (applied to the experiment) - CLI tags — set
EVAL_TAGS(comma-separated) to tag all experiments in a run
Server and CLI tags are applied at the experiment level. Test case tags are applied to individual records.
// Test case tags (per-record)
testCases: [
{
id: "test-1",
prompt: "How many items are there?",
expected: { description: "Returns the count of items" },
tags: ["read", "basic"],
},
],// Server tags (experiment-level)
servers: [
{
id: "my-server",
// ...
tags: ["official", "production"],
},
],# CLI tags (experiment-level)
EVAL_TAGS=nightly,regression npm run eval:stripeFor help/bug reports, please reach out to support@stainless.com