A 2,000-row pair-equivalence benchmark for LLM request-level prompt caches.
CacheEval measures whether a cache makes the correct HIT-vs-MISS decision across 10 domains — not just whether it has a high hit rate. The safety-critical False-Hit Rate (FHR) is the headline number: how often the cache returns a wrong cached answer for a query that only looks similar to a previous one.
CacheEval is the evaluation / benchmarking suite for the BudCache project. It is self-contained: the dataset, the scoring harness, the reference baselines, and the methodology notes. It does not contain the BudCache cache implementation or its trained models — only the benchmark used to grade them.
The default "semantic cache" failure mode is to silently return wrong answers on lexically-similar-but-semantically-different queries. GPTCache-style caches score 60–70% precision on Quora paraphrases and collapse to ~30% on PAWS-style adversarial pairs. The danger isn't a low hit rate — it's a confident wrong hit.
No existing public benchmark exercises all the failure axes at once:
| Axis | Covered today by | CacheEval |
|---|---|---|
| Per-embedding threshold variance | vCache | ✅ |
| Tool-intent canonicalization | W5H2 | ✅ |
| Adversarial robustness | SAFE-CACHE | ✅ |
| Domain-specific contrastive pairs | LangCache-Embed | ✅ |
| Multi-turn context | ContextCache | ✅ |
CacheEval is the first benchmark that exercises all five axes in one suite, with a 5-class equivalence taxonomy and a deterministic-first verification protocol that keeps LLM-judge usage under ~5% of rows.
The anchor example: in math, x + 2 = 8 and a + 2 = 8 are semantically
equivalent, but a prompt cache must not collapse them — because the response
will reference the variable name. This rule generalizes:
If the response will contain a token from the prompt, that token is answer-determining and must match exactly.
So the following are all MISS, even at ~1.0 embedding similarity:
- Variable / identifier rename in math, attached code, or named narratives
- Direction swap (
NYC→FLvsFL→NYC) - Operator swap (
+/−, encode/decode, sort asc/desc) - Operand swap (25% vs 20%)
- Cross-language ("sort a list" in Python vs JavaScript)
- Cross-user (same query, different
system_prompt/ user identity) - Same final message, different prior turns (multi-turn context matters)
- Mutations (
delete_user,send_email) → policy: never serve from cache - Time-sensitive (weather, stock, news) → policy: short TTL or never cache
CacheEval dedicates ~410 ADVERSARIAL rows to these patterns specifically.
CacheEval has no dependencies for scoring — the harness is pure stdlib. You only
need extra packages to re-build the dataset or run the GPTCache reference baseline
(see requirements.txt, which is layered).
python scripts/eval_harness.py --bench cachebench.jsonl --baseline always_miss --out baseline_floor.md
python scripts/eval_harness.py --bench cachebench.jsonl --baseline always_hit --out baseline_ceiling.md
python scripts/eval_harness.py --bench cachebench.jsonl --baseline exact_match --out exact_match.mdalways_miss is the FHR floor (0%), always_hit is the recall ceiling (and the
worst-case FHR), and exact_match is the trivial verbatim-only cache. Pre-computed
copies live in baseline_always_miss.md,
baseline_always_hit.md, and
baseline_exact_match.md.
Implement a callable Row -> Verdict and hand it to evaluate:
from scripts.eval_harness import evaluate, load_bench, Verdict, Row # run from repo root
def my_cache(row: Row) -> Verdict:
# Decide whether the cache would serve query_a's answer for query_b.
is_hit = my_similarity(row.query_a, row.query_b) > 0.85
return Verdict(is_hit=is_hit, confidence=0.9, tier_used="embedding")
rows = load_bench("cachebench.jsonl")
report = evaluate(my_cache, rows, cache_name="my_cache")
print(report.to_markdown())Or point the CLI at a module that exposes a cache callable (or a class):
python scripts/eval_harness.py --bench cachebench.jsonl \
--cache-module my_pkg.my_cache --cache-class MyCache \
--cache-name "my_cache v2" --out my_report.mdThe harness wraps each call in try/except (--on-error miss by default), so a
cache that raises is scored as a MISS and the error surfaces in the tier report
rather than aborting the run.
pip install pytest
pytest tests/ # 21 tests: metric algebra, schema/quota, policy collapse, roundtripThe test_roundtrip.py check enforces that the committed cachebench.jsonl is
bit-identical to a fresh assemble.py rebuild.
Each row is a (query_a, query_b, label) triple with metadata. Full spec in
SCHEMA.md.
{
"id": "cb-math-0007",
"domain": "math",
"subcategory": "variable_rename",
"label": "ADVERSARIAL",
"binary_label": "MISS",
"difficulty": "hard",
"query_a": "Solve x + 2 = 8",
"query_b": "Solve a + 2 = 8",
"context_a": null, "context_b": null,
"system_prompt_a": null, "system_prompt_b": null,
"tools_a": null, "tools_b": null,
"expected_answer_a": "x = 6", "expected_answer_b": "a = 6",
"rationale": "Answer cites the variable name, so the rename is answer-determining.",
"verification_method": "sympy",
"verification_payload": {"tolerance": 0.001},
"construction_method": "adversarial_perturbation",
"source": "curator",
"real_world_grounding": "Students reuse the same template with different letters."
}| Label | Binary | Meaning |
|---|---|---|
EQUIV |
HIT | Same question, same answer expected. |
PARA_SAFE |
HIT | Paraphrase / surface variation; answer still equivalent. |
RELATED_UNSAFE |
MISS | Topically related but the answer differs materially. |
ADVERSARIAL |
MISS | High lexical/embedding similarity, different answer — the hard cases. |
UNRELATED |
MISS | Obviously different questions. |
A policy override can flip an otherwise-HIT pair to MISS:
if verification_method == "policy_no_cache":
binary_label = "MISS" # creative output, mutations, cross-user personalized
else:
binary_label = "HIT" if label in {"EQUIV", "PARA_SAFE"} else "MISS"Post-v1.1 class balance: 675 HIT (33.75%) / 1325 MISS (66.25%).
| Domain ↓ \ Label → | EQUIV | PARA_SAFE | RELATED_UNSAFE | ADVERSARIAL | UNRELATED | Total |
|---|---|---|---|---|---|---|
| qa_factual | 50 | 50 | 50 | 40 | 40 | 230 |
| qa_open | 60 | 60 | 50 | 40 | 50 | 260 |
| math | 30 | 30 | 30 | 50 | 20 | 160 |
| code | 30 | 30 | 30 | 50 | 20 | 160 |
| conversational | 60 | 60 | 50 | 40 | 50 | 260 |
| tool | 60 | 60 | 60 | 50 | 50 | 280 |
| creative | 40 | 40 | 40 | 30 | 30 | 180 |
| personalized | 30 | 30 | 30 | 30 | 20 | 140 |
| multi_turn | 30 | 30 | 50 | 40 | 30 | 180 |
| multilingual | 30 | 20 | 30 | 40 | 30 | 150 |
| Total | 420 | 410 | 420 | 410 | 340 | 2,000 |
qa_factual + qa_open ship together in domains/qa.jsonl (490 rows).
Why 2,000? Statistical power. A 95% Wilson binomial CI on a 2k-pair test set has half-width ~2.2pp at p=0.5 and ~1.0pp at p=0.95 — enough to resolve the inter-system gaps reported in the literature (vCache 12.5×, MeanCache +17% F1, SAFE-CACHE 38pp ASR, LangCache +20% precision) at p<0.01. It matches PAWS-X per-language (2k), MT-Bench (2,400), and AmbigQA test (2,002).
Difficulty is the per-row k-of-5 score from an ensemble of five bi-encoders
(MiniLM-L6, mpnet-base, bge-small, e5-small, multi-qa-MiniLM):
- Easy (≥4/5 embedders correct) — mostly UNRELATED + obvious EQUIV
- Medium (2–3/5 correct) — most PARA_SAFE / RELATED_UNSAFE
- Hard (0–1/5 correct) — concentrated in ADVERSARIAL + tool + multi-turn
Agreement with a held-out model (gte-small): κ = 0.532 (was 0.055 in v1.0).
See audits/CALIBRATION_NOTES.md.
| Method | Count | Role |
|---|---|---|
| Real-traffic mined | 520 | Highest external validity (WildChat / Arena / ShareGPT / Claude Code) |
| Intent-bottom-up | 338 | Deterministic labels (MASSIVE / W5H2-style) |
| Adversarial perturbation | 473 | Only source of ADVERSARIAL (PAWS / MeTMaP / operator-swap) |
| LM-generated | 490 | Domain-balancing floor (LangCache-Embed style) |
| Existing benchmark | 179 | Backward-comparable subscores (QQP/PAWS/MRPC/AmbigQA subsets) |
- train + dev (public, labeled): 1,500
- test (public queries, hidden labels — leaderboard auto-grading): 500
- adversarial-evolving slice: 50/quarter, regenerated against current SOTA to keep the benchmark from saturating (Dynabench-style).
Headline (binary HIT/MISS classification):
- Precision = TP / (TP + FP)
- Recall = TP / (TP + FN)
- F1
- False-Hit Rate (FHR) = FP / (FP + TN) — the safety-critical number
Stratified:
- Per-domain F1 + FHR (10 domains)
- Per-label FHR on the MISS classes:
RELATED_UNSAFE,ADVERSARIAL(the hardest test),UNRELATED - Per-difficulty FHR
Performance & cost:
- Latency p50 / p95 / p99 of the decision
- Cost per 1k decisions (includes any LLM-judge call)
- Verification-tier usage — how many decisions used each tier (exact / programmatic / rubric / judge), exposing a cache's judge dependency
The harness emits all of these plus a Wilson-CI'd accuracy via
report.to_markdown(). Full reporting template in SCHEMA.md.
Each domain dictates a deterministic check before any LLM-judge is consulted
(see design/judge-minimization.md):
| Domain | Primary | Secondary | Last-resort |
|---|---|---|---|
| qa_factual | exact_match (alias set) | regex | LLM-judge (rare) |
| qa_open | rubric (3 atomic criteria) | – | LLM-judge (≤5%) |
| math | sympy equivalence + numeric tolerance | regex | LLM-judge |
| code | AST diff + sandboxed pytest | regex | LLM-judge |
| tool | json_schema + canonical-form match (W5H2) | – | – |
| conversational | rubric | – | LLM-judge (rare) |
| creative | policy_no_cache | – | – |
| personalized | policy_no_cache | – | – |
| multi_turn | exact_match + context-equiv rubric | – | LLM-judge |
| multilingual | exact_match on translated answer | LaBSE similarity | LLM-judge |
Expected tier distribution: ~70% deterministic, ~25% programmatic/rubric, ~5% LLM-judge.
Real evaluation of GPTCache 0.1.44 on all 2,000 rows of v1.1 (Protocol A-prime,
per-pair isolated, faithfully reproducing GPTCache's
rank_threshold = (max−min) × similarity_threshold decision rule). Full detail in
reports/SUMMARY.md; per-row scores in reports/gptcache_*.csv.
| Config | F1 | FHR | ADVERSARIAL-FHR | Precision | Recall | p95 ms |
|---|---|---|---|---|---|---|
| minilm-l6 @ 0.80 (realistic default) | 0.571 | 0.682 | 0.963 | 0.411 | 0.933 | 44 |
| minilm-l6 @ 0.85 | 0.582 | 0.614 | 0.912 | 0.429 | 0.905 | 40 |
| mpnet-base @ 0.80 (better embedder) | 0.575 | 0.674 | 0.963 | 0.415 | 0.938 | 234 |
| minilm-l6 + ONNX cross-encoder | 0.508 | 0.280 | 0.405 | 0.490 | 0.527 | 159 |
| onnx-default (out-of-box) | 0.571 | 0.619 | 0.929 | 0.421 | 0.884 | 154 |
Takeaways:
- GPTCache's realistic default has a 68% false-hit rate — and 96% on adversarial pairs.
- A better embedding model does not fix it (mpnet ≈ minilm). The bottleneck is the decision rule, not the embedder.
- Only the cross-encoder evaluator meaningfully helps (FHR 28%), but it sacrifices ~40% of recall and costs 3.4× more.
- No similarity threshold escapes the trade-off: even at 0.95, FHR is still ~35% while recall collapses to ~0.66 (see the threshold sweep in
reports/SUMMARY.md). This confirms vCache's thesis that static-threshold semantic caching cannot be made safe. - Catastrophic domains: personalized FHR = 100% (cross-user leakage), tool 80%, math 79%, creative 62%.
CacheEval was built to drive an accuracy-first cache. On the same 2,000 rows, BudCache (MiniLM-L6 recall gate + cheap structured/lexical features + LightGBM verifier, reported on leakage-free 5-fold out-of-fold predictions at the FHR≤5% operating point) reaches:
| System | Accuracy | F1 | FHR | ADVERSARIAL-FHR | Precision | Recall |
|---|---|---|---|---|---|---|
| BudCache | 0.911 | 0.864 | 0.049 | 0.080 | 0.896 | 0.834 |
| GPTCache (minilm-l6 @ 0.80) | 0.526 | 0.571 | 0.682 | 0.963 | 0.411 | 0.933 |
A ~14× reduction in dangerous wrong-answer hits, without the recall collapse the
cross-encoder suffers. Full per-domain breakdown and live latency numbers in
reports/BUDCACHE_vs_GPTCACHE.md.
CacheEval/
├── cachebench.jsonl ← the merged 2,000-row dataset (built by assemble.py)
├── SCHEMA.md ← row schema + sampling matrix + metric spec
├── CHANGELOG.md ← version history (v1.0 → v1.1)
├── LICENSE ← CC BY-SA 4.0 (dataset); scripts are MIT
├── ATTRIBUTION.md ← upstream datasets/papers + license propagation
├── requirements.txt ← layered, pinned deps (scoring needs none)
│
├── domains/ ← per-domain JSONL shards (assemble.py merges these)
│ ├── qa.jsonl (490) tool.jsonl (280) conversational.jsonl (260)
│ ├── multi_turn.jsonl (180) creative.jsonl (180) code.jsonl (160)
│ └── math.jsonl (160) multilingual.jsonl (150) personalized.jsonl (140)
│
├── scripts/
│ ├── eval_harness.py ← run any cache against the benchmark (stdlib only)
│ ├── assemble.py ← merge shards + validate schema / quotas / provenance
│ ├── gen_*.py, qa_build.py ← per-domain generators (reproducibility)
│ ├── gptcache_adapter.py ← GPTCache reference baseline adapter (Protocol A-prime)
│ ├── run_gptcache.py ← driver for the 5-config GPTCache sweep
│ ├── embed_rows.py, recalibrate_difficulty.py, repair_adversarial.py
│ └── improve_*.py, fix_*.py, stage1_fixes.py ← v1.1 remediation passes
│
├── design/ ← methodology notes that informed the benchmark
│ ├── pair-eval-methodology.md judge-minimization.md domain-semantics.md
│ ├── eval-best-practices.md discrimination-litreview.md
│ ├── traffic-distributions.md
│ └── scripts/ ← WildChat / Arena / ShareGPT traffic analysis
│
├── sources/ ← fixtures used by the generators
├── tests/ ← 21-test pytest suite (metrics, schema, roundtrip)
├── audits/ ← calibration trail (difficulty, adversarial sims, leakage)
├── AUDIT_*.md ← the four v1.0 brutal-review audits + master summary
├── baseline_*.md ← always_hit / always_miss / exact_match reports
└── reports/ ← GPTCache 5-config baseline + BudCache head-to-head
Re-building the dataset (
scripts/gen_*.py,assemble.py) requires the upstream corpora (TriviaQA, QQP, PAWS-X, BFCL, xLAM, BANKING77, WildChat, …), which are not redistributed here. The generators currently hardcode local dataset paths — seeAUDIT_code.mdfor the known reproducibility caveats and the proposed--datasets-dir/$CACHEBENCH_DATASETSfix. Scoring an existing cache against the shippedcachebench.jsonlneeds none of this.
CacheEval went through four independent brutal-review audits (master synthesis in
AUDIT_SUMMARY.md), then a full 10-stage remediation pass
(recorded in CHANGELOG.md).
| Axis | v1.0 | v1.1 | What changed |
|---|---|---|---|
Integrity (AUDIT_integrity.md) |
A− | A | 736/736 tool schemas JSON-Schema-valid; sympy answers parseable; sentinels nulled |
Code (AUDIT_code.md) |
B− | A− | reproducibility fixed; construction-method audit added; 21-test pytest suite |
Content (AUDIT_dataset_content.md) |
C+ | A− | 174 case-flips fixed; monoculture rationales eliminated; 60 identities; honest source tags |
Coverage (AUDIT_coverage.md) |
D | B+ | difficulty empirically re-derived (κ 0.03 → 0.53); adversarial slice verified strong |
Deferred to v1.2: tighter construction-method targets, a per-generator
--datasets-dir CLI flag, and expanding n≤30 cells to n=40 for tighter per-cell CIs.
The dataset (cachebench.jsonl, domains/) is licensed under
CC BY-SA 4.0 — the share-alike clause is inherited from PAWS-X
(CC BY-SA 4.0). The generation/eval scripts under scripts/ are additionally
MIT-licensed (scripts/LICENSE-MIT).
See ATTRIBUTION.md for the full upstream source list.
Commercial use: ~80 rows derived from Quora QQP (restrictive ToS) and ~30 from xLAM-irrelevance (CC BY-NC 4.0) are tagged in each row's
sourcefield. Filter them out for commercial deployments. SeeATTRIBUTION.mdfor the exact rows.
@misc{cacheeval-v1,
title = {CacheEval: A Pair-Equivalence Benchmark for LLM Prompt Caches},
year = {2026},
note = {A 2,000-row benchmark across 10 domains with 5-class equivalence labels.}
}Methodology builds on PAWS (Zhang et al. 2019), MASSIVE (FitzGerald et al. 2023),
AmbigQA (Min et al. 2020), MT-Bench (Zheng et al. 2023), vCache (2502.03771),
W5H2 (2602.18922), MeanCache (2403.02694), SCALM, LangCache-Embed (2504.02268),
MeTMaP (2402.14480), SAFE-CACHE, ContextCache (2506.22791), Krites/Async Verified
(2602.13165), SemCacheOLAP (2602.19811), and GenCache (2511.17565). Full citations
in ATTRIBUTION.md.