Skip to content

stainless-api/mcp-evals-harness

Repository files navigation

Stainless MCP Evaluation Harness

A generic framework for evaluating MCP server implementations side-by-side using Braintrust.

How It Works

The harness runs an agent loop against each MCP server in a suite, then scores responses on factuality, completeness, and efficiency via Braintrust.

You can test your MCP servers with three different model sets:

  • OpenAI models
  • Anthropic models
  • Anthropic models, with advanced tool use betas.

Models can be specified per-MCP-server from among the below options:

  • "opus"
  • "sonnet"
  • "haiku"
  • "sonnet-code"
  • "opus-code"
  • "gpt-4o"
  • "gpt-4o-mini"
  • "o3"
  • "o4-mini"

All domain-specific content — servers, test cases, system prompt, project name — lives in a suite config directory. The generic infrastructure (agent runners, scorers, eval loop) is shared across suites.

src/
  suite.ts                        # SuiteConfig type + Zod schema, loadSuite(), getTestCasesForServer()
  eval.ts                         # runEvals() — importable eval orchestrator
  suites/
    index.ts                      # Auto-generated barrel (do not edit)
    stripe/
      suite.ts                    # Stripe servers, 12 test cases
      fixtures.json               # Stripe CLI fixtures for seeding test data
    increase/
      suite.ts                    # Increase servers, 30 test cases
    increase-search-docs/
      suite.ts                    # Increase doc search, 20 test cases
    gemini-search-docs/
      suite.ts                    # Gemini API doc search, 20 test cases
  evals/
    e2e.eval.ts                   # Entry point for Braintrust CLI
    run-all.ts                    # Re-exports e2e.eval.ts
  agent/
    anthropic-runner.ts           # Agent SDK runner (standard Anthropic models)
    anthropic-code-runner.ts      # Raw SDK runner (code-mode models — defer_loading, tool_search, code_execution)
    openai-runner.ts              # OpenAI runner (GPT / o-series)
    models.ts                     # Model registry + resolveModel()
    types.ts                      # AgentRunner, AgentResult, ToolCallRecord, ModelConfig, Provider
    index.ts                      # Runner factory + re-exports
  scorers/
    completeness.ts               # Heuristic: checks expected text/fields in output
    efficiency.ts                 # Heuristic: penalizes high turn count / token usage
    correctness.ts                # LLM-as-judge factuality (via autoevals)

Prerequisites

  • Braintrust account
  • Stripe Account (optional, for the Stripe suite)
    • Stripe Secret API key for your sandbox
    • Stripe CLI (brew install stripe/stripe-cli/stripe)
  • Increase Account (optional, for Increase suite)

Setup

./scripts/bootstrap

Or manually: cp .env.example .env and fill in your keys, then npm install.

Run

# Run a built-in suite
./scripts/run-eval stripe
./scripts/run-eval increase
./scripts/run-eval gemini-search-docs

# Run an external suite file from another repo
./scripts/run-eval ../my-repo/suites/my-suite.ts

# Or via npm
EVAL_SUITE=stripe npm run eval
npm run eval:stripe
npm run eval:increase

Adding a New Suite

  1. Create a suite directory at src/suites/<name>/ with a suite.ts that default-exports a SuiteConfig:
import type { SuiteConfig } from "../../suite.js";

const suite: SuiteConfig = {
  projectName: "my-project", // Braintrust project name
  systemPrompt: "You are a helpful assistant with access to ...",
  setup: "my-cli setup-command", // Optional: command to be run before evals to seed test data
  servers: [
    {
      id: "my-server",
      displayName: "My MCP Server",
      command: "node",
      args: ["path/to/server.js"],
      env: { API_KEY: process.env.MY_API_KEY! },
      capabilities: { write: true },
      mode: "tools",
    },
  ],
  testCases: [
    {
      id: "test-1",
      prompt: "How many items are there?",
      expected: {
        description: "Returns the count of items",
        containsText: ["42"],
      },
      tags: ["read"],
    },
  ],
};

export default suite;
  1. Optionally add supporting files (e.g. fixtures.json) in the same directory. You can provide a start-up command to help seed data in your sandbox account if you would like.

  2. Set the required environment variables for your servers.

  3. Run:

EVAL_SUITE=<name> npm run eval

Using as a Library

The harness can be imported directly from other repos. Install from git, then use runEvals to run a suite with all scoring/tagging/metrics plumbing handled for you:

import { runEvals } from "mcp-evals-harness/eval";
import type { SuiteConfig } from "mcp-evals-harness";

const suite: SuiteConfig = {
  projectName: "my-project",
  systemPrompt: "You are a helpful assistant...",
  servers: [
    {
      id: "my-server",
      transport: "http",
      url: "https://my-mcp-server.example.com",
      capabilities: { write: false },
      mode: "code",
      models: ["sonnet-code"],
    },
  ],
  testCases: [
    {
      id: "test-1",
      prompt: "How many items are there?",
      expected: {
        description: "Returns the count of items",
        containsText: ["42"],
      },
      tags: [],
    },
  ],
};

runEvals(suite, {
  tags: (process.env.EVAL_TAGS ?? "").split(",").filter(Boolean),
});

Run with npx braintrust eval your-eval-file.ts.

Tags

Tags let you label and filter experiments and records in Braintrust. There are three sources:

  • Test case tags — defined in tags on each test case (applied per-record)
  • Server tags — add tags to a server config (applied to the experiment)
  • CLI tags — set EVAL_TAGS (comma-separated) to tag all experiments in a run

Server and CLI tags are applied at the experiment level. Test case tags are applied to individual records.

// Test case tags (per-record)
testCases: [
  {
    id: "test-1",
    prompt: "How many items are there?",
    expected: { description: "Returns the count of items" },
    tags: ["read", "basic"],
  },
],
// Server tags (experiment-level)
servers: [
  {
    id: "my-server",
    // ...
    tags: ["official", "production"],
  },
],
# CLI tags (experiment-level)
EVAL_TAGS=nightly,regression npm run eval:stripe

Support

For help/bug reports, please reach out to support@stainless.com

About

Evals using Braintrust for Stripe MCP servers

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •