Stainless MCP Evaluation Harness

A generic framework for evaluating MCP server implementations side-by-side using Braintrust.

How It Works

The harness runs an agent loop against each MCP server in a suite, then scores responses on factuality, completeness, and efficiency via Braintrust.

You can test your MCP servers with three different model sets:

OpenAI models
Anthropic models
Anthropic models, with advanced tool use betas.

Models can be specified per-MCP-server from among the below options:

"opus"
"sonnet"
"haiku"
"sonnet-code"
"opus-code"
"gpt-4o"
"gpt-4o-mini"
"o3"
"o4-mini"

All domain-specific content — servers, test cases, system prompt, project name — lives in a suite config directory. The generic infrastructure (agent runners, scorers, eval loop) is shared across suites.

src/
  suite.ts                        # SuiteConfig type + Zod schema, loadSuite(), getTestCasesForServer()
  eval.ts                         # runEvals() — importable eval orchestrator
  suites/
    index.ts                      # Auto-generated barrel (do not edit)
    stripe/
      suite.ts                    # Stripe servers, 12 test cases
      fixtures.json               # Stripe CLI fixtures for seeding test data
    increase/
      suite.ts                    # Increase servers, 30 test cases
    increase-search-docs/
      suite.ts                    # Increase doc search, 20 test cases
    gemini-search-docs/
      suite.ts                    # Gemini API doc search, 20 test cases
  evals/
    e2e.eval.ts                   # Entry point for Braintrust CLI
    run-all.ts                    # Re-exports e2e.eval.ts
  agent/
    anthropic-runner.ts           # Agent SDK runner (standard Anthropic models)
    anthropic-code-runner.ts      # Raw SDK runner (code-mode models — defer_loading, tool_search, code_execution)
    openai-runner.ts              # OpenAI runner (GPT / o-series)
    models.ts                     # Model registry + resolveModel()
    types.ts                      # AgentRunner, AgentResult, ToolCallRecord, ModelConfig, Provider
    index.ts                      # Runner factory + re-exports
  scorers/
    completeness.ts               # Heuristic: checks expected text/fields in output
    efficiency.ts                 # Heuristic: penalizes high turn count / token usage
    correctness.ts                # LLM-as-judge factuality (via autoevals)

Prerequisites

Braintrust account
Stripe Account (optional, for the Stripe suite)
- Stripe Secret API key for your sandbox
- Stripe CLI (brew install stripe/stripe-cli/stripe)
Increase Account (optional, for Increase suite)

Setup

./scripts/bootstrap

Or manually: cp .env.example .env and fill in your keys, then npm install.

Run

# Run a built-in suite
./scripts/run-eval stripe
./scripts/run-eval increase
./scripts/run-eval gemini-search-docs

# Run an external suite file from another repo
./scripts/run-eval ../my-repo/suites/my-suite.ts

# Or via npm
EVAL_SUITE=stripe npm run eval
npm run eval:stripe
npm run eval:increase

Adding a New Suite

Create a suite directory at src/suites/<name>/ with a suite.ts that default-exports a SuiteConfig:

import type { SuiteConfig } from "../../suite.js";

const suite: SuiteConfig = {
  projectName: "my-project", // Braintrust project name
  systemPrompt: "You are a helpful assistant with access to ...",
  setup: "my-cli setup-command", // Optional: command to be run before evals to seed test data
  servers: [
    {
      id: "my-server",
      displayName: "My MCP Server",
      command: "node",
      args: ["path/to/server.js"],
      env: { API_KEY: process.env.MY_API_KEY! },
      capabilities: { write: true },
      mode: "tools",
    },
  ],
  testCases: [
    {
      id: "test-1",
      prompt: "How many items are there?",
      expected: {
        description: "Returns the count of items",
        containsText: ["42"],
      },
      tags: ["read"],
    },
  ],
};

export default suite;

Optionally add supporting files (e.g. fixtures.json) in the same directory. You can provide a start-up command to help seed data in your sandbox account if you would like.
Set the required environment variables for your servers.
Run:

EVAL_SUITE=<name> npm run eval

Using as a Library

The harness can be imported directly from other repos. Install from git, then use runEvals to run a suite with all scoring/tagging/metrics plumbing handled for you:

import { runEvals } from "mcp-evals-harness/eval";
import type { SuiteConfig } from "mcp-evals-harness";

const suite: SuiteConfig = {
  projectName: "my-project",
  systemPrompt: "You are a helpful assistant...",
  servers: [
    {
      id: "my-server",
      transport: "http",
      url: "https://my-mcp-server.example.com",
      capabilities: { write: false },
      mode: "code",
      models: ["sonnet-code"],
    },
  ],
  testCases: [
    {
      id: "test-1",
      prompt: "How many items are there?",
      expected: {
        description: "Returns the count of items",
        containsText: ["42"],
      },
      tags: [],
    },
  ],
};

runEvals(suite, {
  tags: (process.env.EVAL_TAGS ?? "").split(",").filter(Boolean),
});

Run with npx braintrust eval your-eval-file.ts.

Support

For help/bug reports, please reach out to support@stainless.com

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.github/workflows		.github/workflows
results		results
scripts		scripts
src		src
.env.example		.env.example
.gitignore		.gitignore
.prettierignore		.prettierignore
.prettierrc		.prettierrc
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stainless MCP Evaluation Harness

How It Works

Prerequisites

Setup

Run

Adding a New Suite

Using as a Library

Tags

Support

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

License

stainless-api/mcp-evals-harness

Folders and files

Latest commit

History

Repository files navigation

Stainless MCP Evaluation Harness

How It Works

Prerequisites

Setup

Run

Adding a New Suite

Using as a Library

Tags

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages