CacheEval

A 2,000-row pair-equivalence benchmark for LLM request-level prompt caches.

CacheEval measures whether a cache makes the correct HIT-vs-MISS decision across 10 domains — not just whether it has a high hit rate. The safety-critical False-Hit Rate (FHR) is the headline number: how often the cache returns a wrong cached answer for a query that only looks similar to a previous one.

CacheEval is the evaluation / benchmarking suite for the BudCache project. It is self-contained: the dataset, the scoring harness, the reference baselines, and the methodology notes. It does not contain the BudCache cache implementation or its trained models — only the benchmark used to grade them.

Why CacheEval

The default "semantic cache" failure mode is to silently return wrong answers on lexically-similar-but-semantically-different queries. GPTCache-style caches score 60–70% precision on Quora paraphrases and collapse to ~30% on PAWS-style adversarial pairs. The danger isn't a low hit rate — it's a confident wrong hit.

No existing public benchmark exercises all the failure axes at once:

Axis	Covered today by	CacheEval
Per-embedding threshold variance	vCache	✅
Tool-intent canonicalization	W5H2	✅
Adversarial robustness	SAFE-CACHE	✅
Domain-specific contrastive pairs	LangCache-Embed	✅
Multi-turn context	ContextCache	✅

CacheEval is the first benchmark that exercises all five axes in one suite, with a 5-class equivalence taxonomy and a deterministic-first verification protocol that keeps LLM-judge usage under ~5% of rows.

The core insight: domain-aware equivalence

The anchor example: in math, x + 2 = 8 and a + 2 = 8 are semantically equivalent, but a prompt cache must not collapse them — because the response will reference the variable name. This rule generalizes:

If the response will contain a token from the prompt, that token is answer-determining and must match exactly.

So the following are all MISS, even at ~1.0 embedding similarity:

Variable / identifier rename in math, attached code, or named narratives
Direction swap (NYC→FL vs FL→NYC)
Operator swap (+/−, encode/decode, sort asc/desc)
Operand swap (25% vs 20%)
Cross-language ("sort a list" in Python vs JavaScript)
Cross-user (same query, different system_prompt / user identity)
Same final message, different prior turns (multi-turn context matters)
Mutations (delete_user, send_email) → policy: never serve from cache
Time-sensitive (weather, stock, news) → policy: short TTL or never cache

CacheEval dedicates ~410 ADVERSARIAL rows to these patterns specifically.

Quickstart

CacheEval has no dependencies for scoring — the harness is pure stdlib. You only need extra packages to re-build the dataset or run the GPTCache reference baseline (see requirements.txt, which is layered).

1. Score a built-in baseline

python scripts/eval_harness.py --bench cachebench.jsonl --baseline always_miss  --out baseline_floor.md
python scripts/eval_harness.py --bench cachebench.jsonl --baseline always_hit   --out baseline_ceiling.md
python scripts/eval_harness.py --bench cachebench.jsonl --baseline exact_match   --out exact_match.md

always_miss is the FHR floor (0%), always_hit is the recall ceiling (and the worst-case FHR), and exact_match is the trivial verbatim-only cache. Pre-computed copies live in baseline_always_miss.md, baseline_always_hit.md, and baseline_exact_match.md.

2. Score your own cache

Implement a callable Row -> Verdict and hand it to evaluate:

from scripts.eval_harness import evaluate, load_bench, Verdict, Row  # run from repo root

def my_cache(row: Row) -> Verdict:
    # Decide whether the cache would serve query_a's answer for query_b.
    is_hit = my_similarity(row.query_a, row.query_b) > 0.85
    return Verdict(is_hit=is_hit, confidence=0.9, tier_used="embedding")

rows   = load_bench("cachebench.jsonl")
report = evaluate(my_cache, rows, cache_name="my_cache")
print(report.to_markdown())

Or point the CLI at a module that exposes a cache callable (or a class):

python scripts/eval_harness.py --bench cachebench.jsonl \
    --cache-module my_pkg.my_cache --cache-class MyCache \
    --cache-name "my_cache v2" --out my_report.md

The harness wraps each call in try/except (--on-error miss by default), so a cache that raises is scored as a MISS and the error surfaces in the tier report rather than aborting the run.

3. Run the tests

pip install pytest
pytest tests/    # 21 tests: metric algebra, schema/quota, policy collapse, roundtrip

The test_roundtrip.py check enforces that the committed cachebench.jsonl is bit-identical to a fresh assemble.py rebuild.

What's in a row

Each row is a (query_a, query_b, label) triple with metadata. Full spec in SCHEMA.md.

{
  "id": "cb-math-0007",
  "domain": "math",
  "subcategory": "variable_rename",
  "label": "ADVERSARIAL",
  "binary_label": "MISS",
  "difficulty": "hard",
  "query_a": "Solve x + 2 = 8",
  "query_b": "Solve a + 2 = 8",
  "context_a": null, "context_b": null,
  "system_prompt_a": null, "system_prompt_b": null,
  "tools_a": null, "tools_b": null,
  "expected_answer_a": "x = 6", "expected_answer_b": "a = 6",
  "rationale": "Answer cites the variable name, so the rename is answer-determining.",
  "verification_method": "sympy",
  "verification_payload": {"tolerance": 0.001},
  "construction_method": "adversarial_perturbation",
  "source": "curator",
  "real_world_grounding": "Students reuse the same template with different letters."
}

The 5-class label (with binary collapse)

Label	Binary	Meaning
`EQUIV`	HIT	Same question, same answer expected.
`PARA_SAFE`	HIT	Paraphrase / surface variation; answer still equivalent.
`RELATED_UNSAFE`	MISS	Topically related but the answer differs materially.
`ADVERSARIAL`	MISS	High lexical/embedding similarity, different answer — the hard cases.
`UNRELATED`	MISS	Obviously different questions.

A policy override can flip an otherwise-HIT pair to MISS:

if verification_method == "policy_no_cache":
    binary_label = "MISS"            # creative output, mutations, cross-user personalized
else:
    binary_label = "HIT" if label in {"EQUIV", "PARA_SAFE"} else "MISS"

Post-v1.1 class balance: 675 HIT (33.75%) / 1325 MISS (66.25%).

Sampling matrix (2,000 rows)

Domain ↓ \ Label →	EQUIV	PARA_SAFE	RELATED_UNSAFE	ADVERSARIAL	UNRELATED	Total
qa_factual	50	50	50	40	40	230
qa_open	60	60	50	40	50	260
math	30	30	30	50	20	160
code	30	30	30	50	20	160
conversational	60	60	50	40	50	260
tool	60	60	60	50	50	280
creative	40	40	40	30	30	180
personalized	30	30	30	30	20	140
multi_turn	30	30	50	40	30	180
multilingual	30	20	30	40	30	150
Total	420	410	420	410	340	2,000

qa_factual + qa_open ship together in domains/qa.jsonl (490 rows).

Why 2,000? Statistical power. A 95% Wilson binomial CI on a 2k-pair test set has half-width ~2.2pp at p=0.5 and ~1.0pp at p=0.95 — enough to resolve the inter-system gaps reported in the literature (vCache 12.5×, MeanCache +17% F1, SAFE-CACHE 38pp ASR, LangCache +20% precision) at p<0.01. It matches PAWS-X per-language (2k), MT-Bench (2,400), and AmbigQA test (2,002).

Difficulty (empirically re-derived in v1.1)

Difficulty is the per-row k-of-5 score from an ensemble of five bi-encoders (MiniLM-L6, mpnet-base, bge-small, e5-small, multi-qa-MiniLM):

Easy (≥4/5 embedders correct) — mostly UNRELATED + obvious EQUIV
Medium (2–3/5 correct) — most PARA_SAFE / RELATED_UNSAFE
Hard (0–1/5 correct) — concentrated in ADVERSARIAL + tool + multi-turn

Agreement with a held-out model (gte-small): κ = 0.532 (was 0.055 in v1.0). See audits/CALIBRATION_NOTES.md.

Construction methods

Method	Count	Role
Real-traffic mined	520	Highest external validity (WildChat / Arena / ShareGPT / Claude Code)
Intent-bottom-up	338	Deterministic labels (MASSIVE / W5H2-style)
Adversarial perturbation	473	Only source of ADVERSARIAL (PAWS / MeTMaP / operator-swap)
LM-generated	490	Domain-balancing floor (LangCache-Embed style)
Existing benchmark	179	Backward-comparable subscores (QQP/PAWS/MRPC/AmbigQA subsets)

Splits

train + dev (public, labeled): 1,500
test (public queries, hidden labels — leaderboard auto-grading): 500
adversarial-evolving slice: 50/quarter, regenerated against current SOTA to keep the benchmark from saturating (Dynabench-style).

Metrics

Headline (binary HIT/MISS classification):

Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1
False-Hit Rate (FHR) = FP / (FP + TN) — the safety-critical number

Stratified:

Per-domain F1 + FHR (10 domains)
Per-label FHR on the MISS classes: RELATED_UNSAFE, ADVERSARIAL (the hardest test), UNRELATED
Per-difficulty FHR

Performance & cost:

Latency p50 / p95 / p99 of the decision
Cost per 1k decisions (includes any LLM-judge call)
Verification-tier usage — how many decisions used each tier (exact / programmatic / rubric / judge), exposing a cache's judge dependency

The harness emits all of these plus a Wilson-CI'd accuracy via report.to_markdown(). Full reporting template in SCHEMA.md.

Verification routing (LLM-judge minimization)

Each domain dictates a deterministic check before any LLM-judge is consulted (see design/judge-minimization.md):

Domain	Primary	Secondary	Last-resort
qa_factual	exact_match (alias set)	regex	LLM-judge (rare)
qa_open	rubric (3 atomic criteria)	–	LLM-judge (≤5%)
math	sympy equivalence + numeric tolerance	regex	LLM-judge
code	AST diff + sandboxed pytest	regex	LLM-judge
tool	json_schema + canonical-form match (W5H2)	–	–
conversational	rubric	–	LLM-judge (rare)
creative	policy_no_cache	–	–
personalized	policy_no_cache	–	–
multi_turn	exact_match + context-equiv rubric	–	LLM-judge
multilingual	exact_match on translated answer	LaBSE similarity	LLM-judge

Expected tier distribution: ~70% deterministic, ~25% programmatic/rubric, ~5% LLM-judge.

Reference results

GPTCache baseline — the bar to beat

Real evaluation of GPTCache 0.1.44 on all 2,000 rows of v1.1 (Protocol A-prime, per-pair isolated, faithfully reproducing GPTCache's rank_threshold = (max−min) × similarity_threshold decision rule). Full detail in reports/SUMMARY.md; per-row scores in reports/gptcache_*.csv.

Config	F1	FHR	ADVERSARIAL-FHR	Precision	Recall	p95 ms
minilm-l6 @ 0.80 (realistic default)	0.571	0.682	0.963	0.411	0.933	44
minilm-l6 @ 0.85	0.582	0.614	0.912	0.429	0.905	40
mpnet-base @ 0.80 (better embedder)	0.575	0.674	0.963	0.415	0.938	234
minilm-l6 + ONNX cross-encoder	0.508	0.280	0.405	0.490	0.527	159
onnx-default (out-of-box)	0.571	0.619	0.929	0.421	0.884	154

Takeaways:

GPTCache's realistic default has a 68% false-hit rate — and 96% on adversarial pairs.
A better embedding model does not fix it (mpnet ≈ minilm). The bottleneck is the decision rule, not the embedder.
Only the cross-encoder evaluator meaningfully helps (FHR 28%), but it sacrifices ~40% of recall and costs 3.4× more.
No similarity threshold escapes the trade-off: even at 0.95, FHR is still ~35% while recall collapses to ~0.66 (see the threshold sweep in reports/SUMMARY.md). This confirms vCache's thesis that static-threshold semantic caching cannot be made safe.
Catastrophic domains: personalized FHR = 100% (cross-user leakage), tool 80%, math 79%, creative 62%.

BudCache head-to-head

CacheEval was built to drive an accuracy-first cache. On the same 2,000 rows, BudCache (MiniLM-L6 recall gate + cheap structured/lexical features + LightGBM verifier, reported on leakage-free 5-fold out-of-fold predictions at the FHR≤5% operating point) reaches:

System	Accuracy	F1	FHR	ADVERSARIAL-FHR	Precision	Recall
BudCache	0.911	0.864	0.049	0.080	0.896	0.834
GPTCache (minilm-l6 @ 0.80)	0.526	0.571	0.682	0.963	0.411	0.933

A ~14× reduction in dangerous wrong-answer hits, without the recall collapse the cross-encoder suffers. Full per-domain breakdown and live latency numbers in reports/BUDCACHE_vs_GPTCACHE.md.

Repository layout

CacheEval/
├── cachebench.jsonl          ← the merged 2,000-row dataset (built by assemble.py)
├── SCHEMA.md                 ← row schema + sampling matrix + metric spec
├── CHANGELOG.md              ← version history (v1.0 → v1.1)
├── LICENSE                   ← CC BY-SA 4.0 (dataset); scripts are MIT
├── ATTRIBUTION.md            ← upstream datasets/papers + license propagation
├── requirements.txt          ← layered, pinned deps (scoring needs none)
│
├── domains/                  ← per-domain JSONL shards (assemble.py merges these)
│   ├── qa.jsonl (490)  tool.jsonl (280)  conversational.jsonl (260)
│   ├── multi_turn.jsonl (180)  creative.jsonl (180)  code.jsonl (160)
│   └── math.jsonl (160)  multilingual.jsonl (150)  personalized.jsonl (140)
│
├── scripts/
│   ├── eval_harness.py       ← run any cache against the benchmark (stdlib only)
│   ├── assemble.py           ← merge shards + validate schema / quotas / provenance
│   ├── gen_*.py, qa_build.py ← per-domain generators (reproducibility)
│   ├── gptcache_adapter.py   ← GPTCache reference baseline adapter (Protocol A-prime)
│   ├── run_gptcache.py       ← driver for the 5-config GPTCache sweep
│   ├── embed_rows.py, recalibrate_difficulty.py, repair_adversarial.py
│   └── improve_*.py, fix_*.py, stage1_fixes.py  ← v1.1 remediation passes
│
├── design/                   ← methodology notes that informed the benchmark
│   ├── pair-eval-methodology.md   judge-minimization.md   domain-semantics.md
│   ├── eval-best-practices.md     discrimination-litreview.md
│   ├── traffic-distributions.md
│   └── scripts/              ← WildChat / Arena / ShareGPT traffic analysis
│
├── sources/                  ← fixtures used by the generators
├── tests/                    ← 21-test pytest suite (metrics, schema, roundtrip)
├── audits/                   ← calibration trail (difficulty, adversarial sims, leakage)
├── AUDIT_*.md                ← the four v1.0 brutal-review audits + master summary
├── baseline_*.md             ← always_hit / always_miss / exact_match reports
└── reports/                  ← GPTCache 5-config baseline + BudCache head-to-head

Re-building the dataset (scripts/gen_*.py, assemble.py) requires the upstream corpora (TriviaQA, QQP, PAWS-X, BFCL, xLAM, BANKING77, WildChat, …), which are not redistributed here. The generators currently hardcode local dataset paths — see AUDIT_code.md for the known reproducibility caveats and the proposed --datasets-dir / $CACHEBENCH_DATASETS fix. Scoring an existing cache against the shipped cachebench.jsonl needs none of this.

Audit status (v1.1, post-remediation)

CacheEval went through four independent brutal-review audits (master synthesis in AUDIT_SUMMARY.md), then a full 10-stage remediation pass (recorded in CHANGELOG.md).

Axis	v1.0	v1.1	What changed
Integrity (`AUDIT_integrity.md`)	A−	A	736/736 tool schemas JSON-Schema-valid; sympy answers parseable; sentinels nulled
Code (`AUDIT_code.md`)	B−	A−	reproducibility fixed; construction-method audit added; 21-test pytest suite
Content (`AUDIT_dataset_content.md`)	C+	A−	174 case-flips fixed; monoculture rationales eliminated; 60 identities; honest source tags
Coverage (`AUDIT_coverage.md`)	D	B+	difficulty empirically re-derived (κ 0.03 → 0.53); adversarial slice verified strong

Deferred to v1.2: tighter construction-method targets, a per-generator --datasets-dir CLI flag, and expanding n≤30 cells to n=40 for tighter per-cell CIs.

License

The dataset (cachebench.jsonl, domains/) is licensed under CC BY-SA 4.0 — the share-alike clause is inherited from PAWS-X (CC BY-SA 4.0). The generation/eval scripts under scripts/ are additionally MIT-licensed (scripts/LICENSE-MIT).

See ATTRIBUTION.md for the full upstream source list.

Commercial use: ~80 rows derived from Quora QQP (restrictive ToS) and ~30 from xLAM-irrelevance (CC BY-NC 4.0) are tagged in each row's source field. Filter them out for commercial deployments. See ATTRIBUTION.md for the exact rows.

Citation

@misc{cacheeval-v1,
  title  = {CacheEval: A Pair-Equivalence Benchmark for LLM Prompt Caches},
  year   = {2026},
  note   = {A 2,000-row benchmark across 10 domains with 5-class equivalence labels.}
}

Methodology builds on PAWS (Zhang et al. 2019), MASSIVE (FitzGerald et al. 2023), AmbigQA (Min et al. 2020), MT-Bench (Zheng et al. 2023), vCache (2502.03771), W5H2 (2602.18922), MeanCache (2403.02694), SCALM, LangCache-Embed (2504.02268), MeTMaP (2402.14480), SAFE-CACHE, ContextCache (2506.22791), Krites/Async Verified (2602.13165), SemCacheOLAP (2602.19811), and GenCache (2511.17565). Full citations in ATTRIBUTION.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CacheEval

Why CacheEval

The core insight: domain-aware equivalence

Quickstart

1. Score a built-in baseline

2. Score your own cache

3. Run the tests

What's in a row

The 5-class label (with binary collapse)

Sampling matrix (2,000 rows)

Difficulty (empirically re-derived in v1.1)

Construction methods

Splits

Metrics

Verification routing (LLM-judge minimization)

Reference results

GPTCache baseline — the bar to beat

BudCache head-to-head

Repository layout

Audit status (v1.1, post-remediation)

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
audits		audits
design		design
domains		domains
reports		reports
scripts		scripts
sources/code_fixtures		sources/code_fixtures
tests		tests
.gitignore		.gitignore
ATTRIBUTION.md		ATTRIBUTION.md
AUDIT_SUMMARY.md		AUDIT_SUMMARY.md
AUDIT_code.md		AUDIT_code.md
AUDIT_coverage.md		AUDIT_coverage.md
AUDIT_dataset_content.md		AUDIT_dataset_content.md
AUDIT_integrity.md		AUDIT_integrity.md
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
SCHEMA.md		SCHEMA.md
baseline_always_hit.md		baseline_always_hit.md
baseline_always_miss.md		baseline_always_miss.md
baseline_exact_match.md		baseline_exact_match.md
cachebench.jsonl		cachebench.jsonl
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

CacheEval

Why CacheEval

The core insight: domain-aware equivalence

Quickstart

1. Score a built-in baseline

2. Score your own cache

3. Run the tests

What's in a row

The 5-class label (with binary collapse)

Sampling matrix (2,000 rows)

Difficulty (empirically re-derived in v1.1)

Construction methods

Splits

Metrics

Verification routing (LLM-judge minimization)

Reference results

GPTCache baseline — the bar to beat

BudCache head-to-head

Repository layout

Audit status (v1.1, post-remediation)

License

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages