Skip to content

BudEcosystem/CacheEval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CacheEval

A 2,000-row pair-equivalence benchmark for LLM request-level prompt caches.

CacheEval measures whether a cache makes the correct HIT-vs-MISS decision across 10 domains — not just whether it has a high hit rate. The safety-critical False-Hit Rate (FHR) is the headline number: how often the cache returns a wrong cached answer for a query that only looks similar to a previous one.

CacheEval is the evaluation / benchmarking suite for the BudCache project. It is self-contained: the dataset, the scoring harness, the reference baselines, and the methodology notes. It does not contain the BudCache cache implementation or its trained models — only the benchmark used to grade them.


Why CacheEval

The default "semantic cache" failure mode is to silently return wrong answers on lexically-similar-but-semantically-different queries. GPTCache-style caches score 60–70% precision on Quora paraphrases and collapse to ~30% on PAWS-style adversarial pairs. The danger isn't a low hit rate — it's a confident wrong hit.

No existing public benchmark exercises all the failure axes at once:

Axis Covered today by CacheEval
Per-embedding threshold variance vCache
Tool-intent canonicalization W5H2
Adversarial robustness SAFE-CACHE
Domain-specific contrastive pairs LangCache-Embed
Multi-turn context ContextCache

CacheEval is the first benchmark that exercises all five axes in one suite, with a 5-class equivalence taxonomy and a deterministic-first verification protocol that keeps LLM-judge usage under ~5% of rows.


The core insight: domain-aware equivalence

The anchor example: in math, x + 2 = 8 and a + 2 = 8 are semantically equivalent, but a prompt cache must not collapse them — because the response will reference the variable name. This rule generalizes:

If the response will contain a token from the prompt, that token is answer-determining and must match exactly.

So the following are all MISS, even at ~1.0 embedding similarity:

  • Variable / identifier rename in math, attached code, or named narratives
  • Direction swap (NYC→FL vs FL→NYC)
  • Operator swap (+/, encode/decode, sort asc/desc)
  • Operand swap (25% vs 20%)
  • Cross-language ("sort a list" in Python vs JavaScript)
  • Cross-user (same query, different system_prompt / user identity)
  • Same final message, different prior turns (multi-turn context matters)
  • Mutations (delete_user, send_email) → policy: never serve from cache
  • Time-sensitive (weather, stock, news) → policy: short TTL or never cache

CacheEval dedicates ~410 ADVERSARIAL rows to these patterns specifically.


Quickstart

CacheEval has no dependencies for scoring — the harness is pure stdlib. You only need extra packages to re-build the dataset or run the GPTCache reference baseline (see requirements.txt, which is layered).

1. Score a built-in baseline

python scripts/eval_harness.py --bench cachebench.jsonl --baseline always_miss  --out baseline_floor.md
python scripts/eval_harness.py --bench cachebench.jsonl --baseline always_hit   --out baseline_ceiling.md
python scripts/eval_harness.py --bench cachebench.jsonl --baseline exact_match   --out exact_match.md

always_miss is the FHR floor (0%), always_hit is the recall ceiling (and the worst-case FHR), and exact_match is the trivial verbatim-only cache. Pre-computed copies live in baseline_always_miss.md, baseline_always_hit.md, and baseline_exact_match.md.

2. Score your own cache

Implement a callable Row -> Verdict and hand it to evaluate:

from scripts.eval_harness import evaluate, load_bench, Verdict, Row  # run from repo root

def my_cache(row: Row) -> Verdict:
    # Decide whether the cache would serve query_a's answer for query_b.
    is_hit = my_similarity(row.query_a, row.query_b) > 0.85
    return Verdict(is_hit=is_hit, confidence=0.9, tier_used="embedding")

rows   = load_bench("cachebench.jsonl")
report = evaluate(my_cache, rows, cache_name="my_cache")
print(report.to_markdown())

Or point the CLI at a module that exposes a cache callable (or a class):

python scripts/eval_harness.py --bench cachebench.jsonl \
    --cache-module my_pkg.my_cache --cache-class MyCache \
    --cache-name "my_cache v2" --out my_report.md

The harness wraps each call in try/except (--on-error miss by default), so a cache that raises is scored as a MISS and the error surfaces in the tier report rather than aborting the run.

3. Run the tests

pip install pytest
pytest tests/    # 21 tests: metric algebra, schema/quota, policy collapse, roundtrip

The test_roundtrip.py check enforces that the committed cachebench.jsonl is bit-identical to a fresh assemble.py rebuild.


What's in a row

Each row is a (query_a, query_b, label) triple with metadata. Full spec in SCHEMA.md.

{
  "id": "cb-math-0007",
  "domain": "math",
  "subcategory": "variable_rename",
  "label": "ADVERSARIAL",
  "binary_label": "MISS",
  "difficulty": "hard",
  "query_a": "Solve x + 2 = 8",
  "query_b": "Solve a + 2 = 8",
  "context_a": null, "context_b": null,
  "system_prompt_a": null, "system_prompt_b": null,
  "tools_a": null, "tools_b": null,
  "expected_answer_a": "x = 6", "expected_answer_b": "a = 6",
  "rationale": "Answer cites the variable name, so the rename is answer-determining.",
  "verification_method": "sympy",
  "verification_payload": {"tolerance": 0.001},
  "construction_method": "adversarial_perturbation",
  "source": "curator",
  "real_world_grounding": "Students reuse the same template with different letters."
}

The 5-class label (with binary collapse)

Label Binary Meaning
EQUIV HIT Same question, same answer expected.
PARA_SAFE HIT Paraphrase / surface variation; answer still equivalent.
RELATED_UNSAFE MISS Topically related but the answer differs materially.
ADVERSARIAL MISS High lexical/embedding similarity, different answer — the hard cases.
UNRELATED MISS Obviously different questions.

A policy override can flip an otherwise-HIT pair to MISS:

if verification_method == "policy_no_cache":
    binary_label = "MISS"            # creative output, mutations, cross-user personalized
else:
    binary_label = "HIT" if label in {"EQUIV", "PARA_SAFE"} else "MISS"

Post-v1.1 class balance: 675 HIT (33.75%) / 1325 MISS (66.25%).


Sampling matrix (2,000 rows)

Domain ↓ \ Label → EQUIV PARA_SAFE RELATED_UNSAFE ADVERSARIAL UNRELATED Total
qa_factual 50 50 50 40 40 230
qa_open 60 60 50 40 50 260
math 30 30 30 50 20 160
code 30 30 30 50 20 160
conversational 60 60 50 40 50 260
tool 60 60 60 50 50 280
creative 40 40 40 30 30 180
personalized 30 30 30 30 20 140
multi_turn 30 30 50 40 30 180
multilingual 30 20 30 40 30 150
Total 420 410 420 410 340 2,000

qa_factual + qa_open ship together in domains/qa.jsonl (490 rows).

Why 2,000? Statistical power. A 95% Wilson binomial CI on a 2k-pair test set has half-width ~2.2pp at p=0.5 and ~1.0pp at p=0.95 — enough to resolve the inter-system gaps reported in the literature (vCache 12.5×, MeanCache +17% F1, SAFE-CACHE 38pp ASR, LangCache +20% precision) at p<0.01. It matches PAWS-X per-language (2k), MT-Bench (2,400), and AmbigQA test (2,002).

Difficulty (empirically re-derived in v1.1)

Difficulty is the per-row k-of-5 score from an ensemble of five bi-encoders (MiniLM-L6, mpnet-base, bge-small, e5-small, multi-qa-MiniLM):

  • Easy (≥4/5 embedders correct) — mostly UNRELATED + obvious EQUIV
  • Medium (2–3/5 correct) — most PARA_SAFE / RELATED_UNSAFE
  • Hard (0–1/5 correct) — concentrated in ADVERSARIAL + tool + multi-turn

Agreement with a held-out model (gte-small): κ = 0.532 (was 0.055 in v1.0). See audits/CALIBRATION_NOTES.md.

Construction methods

Method Count Role
Real-traffic mined 520 Highest external validity (WildChat / Arena / ShareGPT / Claude Code)
Intent-bottom-up 338 Deterministic labels (MASSIVE / W5H2-style)
Adversarial perturbation 473 Only source of ADVERSARIAL (PAWS / MeTMaP / operator-swap)
LM-generated 490 Domain-balancing floor (LangCache-Embed style)
Existing benchmark 179 Backward-comparable subscores (QQP/PAWS/MRPC/AmbigQA subsets)

Splits

  • train + dev (public, labeled): 1,500
  • test (public queries, hidden labels — leaderboard auto-grading): 500
  • adversarial-evolving slice: 50/quarter, regenerated against current SOTA to keep the benchmark from saturating (Dynabench-style).

Metrics

Headline (binary HIT/MISS classification):

  • Precision = TP / (TP + FP)
  • Recall = TP / (TP + FN)
  • F1
  • False-Hit Rate (FHR) = FP / (FP + TN) — the safety-critical number

Stratified:

  • Per-domain F1 + FHR (10 domains)
  • Per-label FHR on the MISS classes: RELATED_UNSAFE, ADVERSARIAL (the hardest test), UNRELATED
  • Per-difficulty FHR

Performance & cost:

  • Latency p50 / p95 / p99 of the decision
  • Cost per 1k decisions (includes any LLM-judge call)
  • Verification-tier usage — how many decisions used each tier (exact / programmatic / rubric / judge), exposing a cache's judge dependency

The harness emits all of these plus a Wilson-CI'd accuracy via report.to_markdown(). Full reporting template in SCHEMA.md.


Verification routing (LLM-judge minimization)

Each domain dictates a deterministic check before any LLM-judge is consulted (see design/judge-minimization.md):

Domain Primary Secondary Last-resort
qa_factual exact_match (alias set) regex LLM-judge (rare)
qa_open rubric (3 atomic criteria) LLM-judge (≤5%)
math sympy equivalence + numeric tolerance regex LLM-judge
code AST diff + sandboxed pytest regex LLM-judge
tool json_schema + canonical-form match (W5H2)
conversational rubric LLM-judge (rare)
creative policy_no_cache
personalized policy_no_cache
multi_turn exact_match + context-equiv rubric LLM-judge
multilingual exact_match on translated answer LaBSE similarity LLM-judge

Expected tier distribution: ~70% deterministic, ~25% programmatic/rubric, ~5% LLM-judge.


Reference results

GPTCache baseline — the bar to beat

Real evaluation of GPTCache 0.1.44 on all 2,000 rows of v1.1 (Protocol A-prime, per-pair isolated, faithfully reproducing GPTCache's rank_threshold = (max−min) × similarity_threshold decision rule). Full detail in reports/SUMMARY.md; per-row scores in reports/gptcache_*.csv.

Config F1 FHR ADVERSARIAL-FHR Precision Recall p95 ms
minilm-l6 @ 0.80 (realistic default) 0.571 0.682 0.963 0.411 0.933 44
minilm-l6 @ 0.85 0.582 0.614 0.912 0.429 0.905 40
mpnet-base @ 0.80 (better embedder) 0.575 0.674 0.963 0.415 0.938 234
minilm-l6 + ONNX cross-encoder 0.508 0.280 0.405 0.490 0.527 159
onnx-default (out-of-box) 0.571 0.619 0.929 0.421 0.884 154

Takeaways:

  1. GPTCache's realistic default has a 68% false-hit rate — and 96% on adversarial pairs.
  2. A better embedding model does not fix it (mpnet ≈ minilm). The bottleneck is the decision rule, not the embedder.
  3. Only the cross-encoder evaluator meaningfully helps (FHR 28%), but it sacrifices ~40% of recall and costs 3.4× more.
  4. No similarity threshold escapes the trade-off: even at 0.95, FHR is still ~35% while recall collapses to ~0.66 (see the threshold sweep in reports/SUMMARY.md). This confirms vCache's thesis that static-threshold semantic caching cannot be made safe.
  5. Catastrophic domains: personalized FHR = 100% (cross-user leakage), tool 80%, math 79%, creative 62%.

BudCache head-to-head

CacheEval was built to drive an accuracy-first cache. On the same 2,000 rows, BudCache (MiniLM-L6 recall gate + cheap structured/lexical features + LightGBM verifier, reported on leakage-free 5-fold out-of-fold predictions at the FHR≤5% operating point) reaches:

System Accuracy F1 FHR ADVERSARIAL-FHR Precision Recall
BudCache 0.911 0.864 0.049 0.080 0.896 0.834
GPTCache (minilm-l6 @ 0.80) 0.526 0.571 0.682 0.963 0.411 0.933

A ~14× reduction in dangerous wrong-answer hits, without the recall collapse the cross-encoder suffers. Full per-domain breakdown and live latency numbers in reports/BUDCACHE_vs_GPTCACHE.md.


Repository layout

CacheEval/
├── cachebench.jsonl          ← the merged 2,000-row dataset (built by assemble.py)
├── SCHEMA.md                 ← row schema + sampling matrix + metric spec
├── CHANGELOG.md              ← version history (v1.0 → v1.1)
├── LICENSE                   ← CC BY-SA 4.0 (dataset); scripts are MIT
├── ATTRIBUTION.md            ← upstream datasets/papers + license propagation
├── requirements.txt          ← layered, pinned deps (scoring needs none)
│
├── domains/                  ← per-domain JSONL shards (assemble.py merges these)
│   ├── qa.jsonl (490)  tool.jsonl (280)  conversational.jsonl (260)
│   ├── multi_turn.jsonl (180)  creative.jsonl (180)  code.jsonl (160)
│   └── math.jsonl (160)  multilingual.jsonl (150)  personalized.jsonl (140)
│
├── scripts/
│   ├── eval_harness.py       ← run any cache against the benchmark (stdlib only)
│   ├── assemble.py           ← merge shards + validate schema / quotas / provenance
│   ├── gen_*.py, qa_build.py ← per-domain generators (reproducibility)
│   ├── gptcache_adapter.py   ← GPTCache reference baseline adapter (Protocol A-prime)
│   ├── run_gptcache.py       ← driver for the 5-config GPTCache sweep
│   ├── embed_rows.py, recalibrate_difficulty.py, repair_adversarial.py
│   └── improve_*.py, fix_*.py, stage1_fixes.py  ← v1.1 remediation passes
│
├── design/                   ← methodology notes that informed the benchmark
│   ├── pair-eval-methodology.md   judge-minimization.md   domain-semantics.md
│   ├── eval-best-practices.md     discrimination-litreview.md
│   ├── traffic-distributions.md
│   └── scripts/              ← WildChat / Arena / ShareGPT traffic analysis
│
├── sources/                  ← fixtures used by the generators
├── tests/                    ← 21-test pytest suite (metrics, schema, roundtrip)
├── audits/                   ← calibration trail (difficulty, adversarial sims, leakage)
├── AUDIT_*.md                ← the four v1.0 brutal-review audits + master summary
├── baseline_*.md             ← always_hit / always_miss / exact_match reports
└── reports/                  ← GPTCache 5-config baseline + BudCache head-to-head

Re-building the dataset (scripts/gen_*.py, assemble.py) requires the upstream corpora (TriviaQA, QQP, PAWS-X, BFCL, xLAM, BANKING77, WildChat, …), which are not redistributed here. The generators currently hardcode local dataset paths — see AUDIT_code.md for the known reproducibility caveats and the proposed --datasets-dir / $CACHEBENCH_DATASETS fix. Scoring an existing cache against the shipped cachebench.jsonl needs none of this.


Audit status (v1.1, post-remediation)

CacheEval went through four independent brutal-review audits (master synthesis in AUDIT_SUMMARY.md), then a full 10-stage remediation pass (recorded in CHANGELOG.md).

Axis v1.0 v1.1 What changed
Integrity (AUDIT_integrity.md) A− A 736/736 tool schemas JSON-Schema-valid; sympy answers parseable; sentinels nulled
Code (AUDIT_code.md) B− A− reproducibility fixed; construction-method audit added; 21-test pytest suite
Content (AUDIT_dataset_content.md) C+ A− 174 case-flips fixed; monoculture rationales eliminated; 60 identities; honest source tags
Coverage (AUDIT_coverage.md) D B+ difficulty empirically re-derived (κ 0.03 → 0.53); adversarial slice verified strong

Deferred to v1.2: tighter construction-method targets, a per-generator --datasets-dir CLI flag, and expanding n≤30 cells to n=40 for tighter per-cell CIs.


License

The dataset (cachebench.jsonl, domains/) is licensed under CC BY-SA 4.0 — the share-alike clause is inherited from PAWS-X (CC BY-SA 4.0). The generation/eval scripts under scripts/ are additionally MIT-licensed (scripts/LICENSE-MIT).

See ATTRIBUTION.md for the full upstream source list.

Commercial use: ~80 rows derived from Quora QQP (restrictive ToS) and ~30 from xLAM-irrelevance (CC BY-NC 4.0) are tagged in each row's source field. Filter them out for commercial deployments. See ATTRIBUTION.md for the exact rows.


Citation

@misc{cacheeval-v1,
  title  = {CacheEval: A Pair-Equivalence Benchmark for LLM Prompt Caches},
  year   = {2026},
  note   = {A 2,000-row benchmark across 10 domains with 5-class equivalence labels.}
}

Methodology builds on PAWS (Zhang et al. 2019), MASSIVE (FitzGerald et al. 2023), AmbigQA (Min et al. 2020), MT-Bench (Zheng et al. 2023), vCache (2502.03771), W5H2 (2602.18922), MeanCache (2403.02694), SCALM, LangCache-Embed (2504.02268), MeTMaP (2402.14480), SAFE-CACHE, ContextCache (2506.22791), Krites/Async Verified (2602.13165), SemCacheOLAP (2602.19811), and GenCache (2511.17565). Full citations in ATTRIBUTION.md.

About

CacheEval is a prompt level caching accuracy evaluation system which is intended to check the accuracy and performance of a prompt caching system like GPTCache etc.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages