agentsnap

Deterministic snapshot testing for AI agents.

agentsnap wraps your agent's LLM and tool calls during a golden run to produce a committed snapshot file. On every subsequent run it replays the same inputs and compares the new trace against the snapshot across three dimensions:

Dimension	What it checks	How
Structural	Tool call names and order	Levenshtein edit distance on the tool sequence
Arguments	Tool call arguments	`deepdiff` (if installed) or plain dict diff, with configurable ignored fields
Semantic	LLM responses and final output	Cosine similarity via `all-MiniLM-L6-v2`, or an LLM judge for higher accuracy

If any dimension drifts beyond its threshold, agentsnap raises AgentRegressionError with a structured diff report.

3-minute quickstart

1 — Install

pip install agentsnap

2 — Wrap your client and record a golden run

from agentsnap import AgentRecorder
from agentsnap.adapters.anthropic import AnthropicAdapter
from agentsnap.adapters.tool import ToolAdapter
import anthropic

def search(query: str) -> str:
    return f"Results for: {query}"

client = AnthropicAdapter(anthropic.Anthropic())
search_tool = ToolAdapter(search, name="search")

with AgentRecorder("my_agent") as rec:
    result = my_agent(client, search_tool, input="What is Python?")
    rec.output = result
# Writes __agent_snapshots__/my_agent.json

Commit the snapshot file. It is the contract for what the agent does.

3 — Assert on future runs

from agentsnap import AgentAsserter

with AgentAsserter("my_agent") as a:
    result = my_agent(client, search_tool, input="What is Python?")
    a.output = result
# Raises AgentRegressionError if behavior drifted

4 — Use the pytest fixture

def test_my_agent(snapshot):
    with snapshot.assert_agent("my_agent") as a:
        result = my_agent(client, search_tool, input="What is Python?")
        a.output = result

pytest

Supported providers

Provider	Adapter	Intercepts
Anthropic	`AnthropicAdapter`	`.messages.create()`
OpenAI	`OpenAIAdapter`	`.chat.completions.create()`
Google Gemini	`GeminiAdapter`	`.models.generate_content()`
Cohere	`CohereAdapter`	`.chat()`
Mistral	`MistralAdapter`	`.chat.complete()`
Groq	`GroqAdapter`	`.chat.completions.create()`
OpenRouter	`OpenRouterAdapter`	`.chat.completions.create()`
LangGraph	`LangGraphAdapter`	`.invoke()`
Any callable	`ToolAdapter`	direct call

Install provider SDKs as needed:

pip install agentsnap[google]    # google-genai
pip install agentsnap[cohere]    # cohere
pip install agentsnap[mistral]   # mistralai
pip install agentsnap[groq]      # groq
pip install agentsnap[all-providers]

Configuration

API key for the LLM judge (optional)

The LLM judge uses a small language model to compare outputs instead of embeddings — more accurate for factual content.

agentsnap resolves the API key automatically — you do not need a separate key. It checks in this order:

AGENTSNAP_JUDGE_API_KEY — explicit override, always wins
The provider-specific key that matches judge_base_url:

`judge_base_url` contains	Key used automatically
`openrouter.ai` (default)	`OPENROUTER_API_KEY`
`api.openai.com`	`OPENAI_API_KEY`
`anthropic.com`	`ANTHROPIC_API_KEY`
`api.groq.com`	`GROQ_API_KEY`
`api.mistral.ai`	`MISTRAL_API_KEY`
`api.cohere.com`	`COHERE_API_KEY`

So if you already have OPENROUTER_API_KEY in your environment and the default judge_base_url is set, the judge works with zero additional config.

To use a different provider, change judge_base_url in pyproject.toml and set the matching env var:

# Use OpenAI directly instead of OpenRouter
export OPENAI_API_KEY=sk-...

[tool.agentsnap]
judge_base_url = "https://api.openai.com/v1"
judge_model    = "gpt-4o-mini"

Once any matching key is found, the snapshot pytest fixture enables the LLM judge automatically — no code changes needed in tests.

Project settings (`pyproject.toml`)

[tool.agentsnap]
judge_model        = "openai/gpt-4o-mini"
judge_base_url     = "https://openrouter.ai/api/v1"
semantic_threshold = 0.92   # final agent output (strict)
llm_threshold      = 0.75   # intermediate LLM responses (tolerant)

These can also be set as pytest ini options:

[tool.pytest.ini_options]
agentsnap_judge_model        = "openai/gpt-4o-mini"
agentsnap_judge_base_url     = "https://openrouter.ai/api/v1"
agentsnap_semantic_threshold = "0.92"
agentsnap_llm_threshold      = "0.75"

API reference

`AgentRecorder(test_name, snapshot_dir="__agent_snapshots__", model="unknown")`

Context manager. Intercepts all adapter calls and writes a snapshot on clean exit.

with AgentRecorder("name", model="claude-haiku-4-5") as rec:
    rec.input_data = {"query": "hello"}   # optional metadata
    result = my_agent(wrapped_client, ...)
    rec.output = result

`AgentAsserter(test_name, snapshot_dir, semantic_threshold, llm_threshold, ignored_fields, embed_fn, judge)`

Context manager. Reads the snapshot, intercepts calls, runs the three-layer diff on exit.

Parameter	Default	Description
`semantic_threshold`	`0.92`	Min similarity for final output
`llm_threshold`	`0.75`	Min similarity for intermediate LLM responses
`ignored_fields`	`None`	Tool arg keys to exclude from argument diff
`embed_fn`	`None`	Custom embedding function (for testing)
`judge`	`None`	`LLMJudge` instance; overrides embedding comparison

with AgentAsserter("name", semantic_threshold=0.95, ignored_fields=["timestamp"]) as a:
    result = my_agent(wrapped_client, ...)
    a.output = result

`LLMJudge(api_key, model, base_url)`

Uses an LLM to score semantic equivalence instead of embeddings. Returns a 0.0–1.0 score and a one-sentence reason explaining any difference.

from agentsnap import LLMJudge

# Explicit construction
judge = LLMJudge(api_key="sk-or-...", model="openai/gpt-4o-mini")

# From environment / pyproject.toml
judge = LLMJudge.from_env()  # returns None if AGENTSNAP_JUDGE_API_KEY is not set

with AgentAsserter("name", judge=judge) as a:
    ...

`snapshot` pytest fixture

Auto-wired from [tool.agentsnap] and AGENTSNAP_JUDGE_API_KEY. No imports needed.

def test_agent(snapshot):
    # Record
    with snapshot.record_agent("name") as rec:
        rec.output = run_agent(...)

    # Assert — judge enabled automatically if API key is set
    with snapshot.assert_agent("name") as a:
        a.output = run_agent(...)

    # Override per-test
    with snapshot.assert_agent("name", judge=False) as a:      # force embeddings
        a.output = run_agent(...)

    with snapshot.assert_agent("name", semantic_threshold=0.98) as a:  # tighter
        a.output = run_agent(...)

Exceptions

Exception	When raised
`AgentRegressionError(message, diff_report)`	Behavior drifted beyond threshold
`SnapshotNotFoundError(test_name)`	No snapshot found — record first
`AdapterNotWrappedError`	Unwrapped client used inside a recording context

AgentRegressionError.diff_report is a DiffReport dataclass with structural_diff, argument_diffs, semantic_scores, semantic_reasons, and failed_checks.

CLI

agentsnap list                        # list all snapshots
agentsnap diff __agent_snapshots__/my_agent.json   # pretty-print a snapshot
agentsnap update my_agent            # approve last run as new golden
agentsnap record <test_file>         # run file in record mode
agentsnap run <test_file>            # run file in assert mode

Snapshot format

{
  "version": "1.0",
  "recorded_at": "2026-01-01T00:00:00+00:00",
  "model": "claude-haiku-4-5",
  "input": { "query": "What is Python?" },
  "trace": [
    { "step": 0, "type": "llm_call", "messages": [...], "response": "...", "tokens": 350 },
    { "step": 1, "type": "tool_call", "name": "search", "args": {"query": "Python"}, "result": "..." }
  ],
  "output": "Python is a high-level programming language..."
}

Golden snapshots live in __agent_snapshots__/ and are committed to git. The .last_run/ subdirectory is written on every assert run and should be gitignored — it is only used by agentsnap update.

CI integration (GitHub Actions)

name: Agent regression tests

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
          cache: pip

      - name: Install
        run: pip install -e ".[dev]"

      - name: Run agent snapshot tests
        run: pytest tests/ -v
        env:
          # Optional: enables LLM judge for higher-accuracy semantic comparison
          AGENTSNAP_JUDGE_API_KEY: ${{ secrets.AGENTSNAP_JUDGE_API_KEY }}

Snapshots are committed to the repo. CI only runs the asserter — no real agent API calls needed unless your tests explicitly make them.

How to approve an intentional regression

When you intentionally change agent behavior (new prompt, model upgrade, new tool):

# 1. Run tests — they fail, new trace saved to .last_run/
pytest tests/test_my_agent.py

# 2. Inspect what changed
agentsnap diff __agent_snapshots__/my_agent.json

# 3. Approve — promote last run to golden
agentsnap update my_agent

# 4. Commit the new baseline
git add __agent_snapshots__/my_agent.json
git commit -m "approve: updated golden after Sonnet upgrade"

Thresholds

Two independent thresholds control the semantic layer:

Threshold	Default	Applies to
`semantic_threshold`	`0.92`	Final `output` — the agent's actual answer
`llm_threshold`	`0.75`	Intermediate `llm_call[n]` responses — tolerates natural phrasing variance

Tune per-test:

# Critical factual agent — hold output tightly
with AgentAsserter("rag_agent", semantic_threshold=0.97) as a: ...

# Creative agent — allow more paraphrasing
with AgentAsserter("writer_agent", semantic_threshold=0.75) as a: ...

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
agentsnap		agentsnap
examples		examples
tests		tests
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
README.md		README.md
conftest.py		conftest.py
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

agentsnap

3-minute quickstart

1 — Install

2 — Wrap your client and record a golden run

3 — Assert on future runs

4 — Use the pytest fixture

Supported providers

Configuration

API key for the LLM judge (optional)

Project settings (`pyproject.toml`)

API reference

`AgentRecorder(test_name, snapshot_dir="__agent_snapshots__", model="unknown")`

`AgentAsserter(test_name, snapshot_dir, semantic_threshold, llm_threshold, ignored_fields, embed_fn, judge)`

`LLMJudge(api_key, model, base_url)`

`snapshot` pytest fixture

Exceptions

CLI

Snapshot format

CI integration (GitHub Actions)

How to approve an intentional regression

Thresholds

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

agentsnap

3-minute quickstart

1 — Install

2 — Wrap your client and record a golden run

3 — Assert on future runs

4 — Use the pytest fixture

Supported providers

Configuration

API key for the LLM judge (optional)

Project settings (pyproject.toml)

API reference

AgentRecorder(test_name, snapshot_dir="__agent_snapshots__", model="unknown")

AgentAsserter(test_name, snapshot_dir, semantic_threshold, llm_threshold, ignored_fields, embed_fn, judge)

LLMJudge(api_key, model, base_url)

snapshot pytest fixture

Exceptions

CLI

Snapshot format

CI integration (GitHub Actions)

How to approve an intentional regression

Thresholds

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Project settings (`pyproject.toml`)

`AgentRecorder(test_name, snapshot_dir="__agent_snapshots__", model="unknown")`

`AgentAsserter(test_name, snapshot_dir, semantic_threshold, llm_threshold, ignored_fields, embed_fn, judge)`

`LLMJudge(api_key, model, base_url)`

`snapshot` pytest fixture

Packages