Kakeya KV cache: bug fix, Gemma 4 benchmark standard, cross-model reports (4 models) by FluffyAIcode · Pull Request #1 · FluffyAIcode/LLM-KV--Cache-compress

FluffyAIcode · 2026-04-17T02:57:06Z

Summary

This PR contains the full body of work on the Kakeya KV cache compression layer:

Critical correctness fix in KakeyaKVCache that made it a no-op on Gemma 4's default config.
Generalized cache + factory (build_kakeya_cache(model)) that works for any HF transformers decoder-only model — Gemma 2/3/4, Llama, Mistral, Qwen 2/3, SmolLM2, Cohere2, and similar.
Benchmark harness (kakeya_benchmark.py) with chunked prefill, analytic baseline, and per-layer JSON reports.
Byte-exact extrapolator (kakeya_extrapolate.py) that projects any report to arbitrary context lengths (cross-validated ≤ 0.003 absolute ratio error).
Benchmark standard + Gemma 4 reference run (reports/STANDARD.md): measured 2k–32k plus projected 64k–256k.
Cross-model reports for four open-source models (reports/{gemma4_e2b, qwen2_5_0_5b, smollm2_1_7b, qwen3_0_6b}/), each with a narrative REPORT.md and all raw JSON.
Cross-model comparison (reports/CROSS_MODEL.md) and top-level README.md.

Headline numbers

Under the standard codec preset (block_size=512, residual_length=256, d_res=8, K=16, variance_ratio=0.95), at 131 072 token context with bf16 store projection:

Model	Baseline KV	Kakeya KV	Total ratio	Absolute saved
`Qwen/Qwen3-0.6B`	14.00 GiB	3.10 GiB	4.51×	10.90 GiB
`google/gemma-4-E2B-it`	774 MiB	180 MiB	4.29×	594 MiB
`HuggingFaceTB/SmolLM2-1.7B-Instruct`	24.00 GiB	10.65 GiB	2.25×	13.35 GiB
`Qwen/Qwen2.5-0.5B-Instruct`	1.50 GiB	714 MiB	2.15×	786 MiB

All four models use the same build_kakeya_cache(model) factory and the same run_all_benchmarks.sh orchestrator. No model-specific code.

What changed in the codec

Bug fix

Gemma4TextConfig.num_kv_shared_layers defaults to 0, which made the old code do layer_types[:-0] → empty list → cache with zero layers. Guarded with > 0.

Model-agnostic factory

Layer-plan inference moved to _resolve_layer_plan(config) and handles all the standard HF patterns:

Config pattern	Handling
`layer_types` exists (Gemma 2/3/4, Cohere2, Qwen3 hybrid)	Per-layer dispatch
`num_kv_shared_layers > 0` (Gemma 4)	Strip trailing shared layers
No `layer_types`, `sliding_window=None` (Llama, SmolLM2, Qwen2, Qwen3-dense)	Every layer is full attention
`sliding_window` only / `attention_chunk_size` (Llama 4)	Every layer is sliding

What the benchmark does

kakeya_benchmark.py loads the target model, runs a long-prompt prefill through both DynamicCache (baseline) and KakeyaKVCache, and reports byte counts + ratios per layer and per model section (full-attn vs sliding-window). It supports:

--skip-baseline-prefill: compute the baseline bytes analytically (exact, since it's determined by config + seq_len); saves half the CPU budget for long contexts.
--prefill-chunk N: chunked prefill to bound activation memory on CPU.
--skip-generation: skip greedy-decode sanity check (for very long contexts).

What the extrapolator does

For fixed codec params, per-block skeleton bytes and per-token encoded bytes are deterministic constants — they do not depend on context length. kakeya_extrapolate.py reads any measured report and scales it exactly to arbitrary target contexts. Cross-validation summary:

Prediction	Predicted	Measured	Abs error
8k → 16k full (f32 / bf16)	2.151 / 4.030	2.149 / 4.027	≤ 0.003
8k → 32k full (f32 / bf16)	2.191 / 4.237	2.189 / 4.234	≤ 0.003
16k → 32k full (f32 / bf16)	2.189 / 4.234	2.189 / 4.234	0.000

So the 64k / 128k / 256k rows in every report are byte-exact under the preset, not statistical fits.

Benchmark standard

reports/STANDARD.md is the formal methodology doc. Key points:

Fixed codec preset across all runs (for direct comparability).
Measured sweep: 2 048 / 4 096 / 8 192 tokens (real prefill + decode).
Projected sweep: 16 384 / 32 768 / 65 536 / 131 072 / 262 144 tokens (byte-exact extrapolation).
Two reported ratios per context: "f32 store" (as the codec is implemented today) and "bf16 store" (projected with the same codec storing compressed tensors in bf16).
Headline number for cross-model comparisons: total_ratio_bf16_store at 32k tokens.

Reproducing any row

pip install -U torch transformers accelerate huggingface_hub

python3 -c "from huggingface_hub import snapshot_download; snapshot_download('Qwen/Qwen3-0.6B', local_dir='models/Qwen3-0.6B', allow_patterns=['*.json','*.safetensors','tokenizer*','*.model'])"

./run_all_benchmarks.sh models/Qwen3-0.6B qwen3_0_6b

Produces:

reports/qwen3_0_6b/bench_2048.json
reports/qwen3_0_6b/bench_4096.json
reports/qwen3_0_6b/bench_8192.json
reports/qwen3_0_6b/extrapolation.json  # contains the 16k–256k rows

Environment used

CPU-only x86_64, 15 GB RAM, no GPU, torch==2.11.0+cu130, transformers==5.5.4. Public (not gated) open-source models only.

Follow-up

The only pending optimization explicitly called out in the reports is the bf16-store conversion inside kakeya_kv_codec.py (currently compressed tensors are float32 on CPU). The reports make this visible by always showing the "bf16 store, projected" column next to the "f32 store" column, so it's clear what absolute gain that optimization would unlock.

- layer_types[:-0] yields an empty list, so the default Gemma 4 config (num_kv_shared_layers=0) produced a Cache with zero layers and broke every attention update during generation. Guard the slice so it only applies when some layers are actually shared. - Make the residual index tensor contiguous in the d_res >= d_eff branch so scatter_ avoids an implicit copy from the expanded stride pattern. - Clarify shared_layers comment: Gemma 4 builds a per-forward shared_kv_states dict inside the model, so the cache attribute exists only for compatibility and stays empty. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

The benchmark (kakeya_benchmark.py) downloads/loads google/gemma-4-E2B-it, runs a prefill twice (DynamicCache baseline vs KakeyaKVCache) on the same long prompt, and measures exact bytes per layer. It reports: - baseline vs kakeya total KV bytes, split into full-attention and sliding-window layers (sliding layers are capped by sliding_window, so the codec intentionally leaves them alone) - compression ratio both as the codec currently stores things (float32 on CPU) and with a dtype-matched projection (bf16 store, which an optimized version would use) - sanity-check generation comparing greedy outputs Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Benchmark additions: - chunked prefill to bound activation memory (--prefill-chunk) - --skip-baseline-prefill: derive the DynamicCache byte count analytically (it's a deterministic function of context length + config, so we do not need to waste 20 minutes re-running it on CPU for every context length) - --skip-generation: skip the sanity-check greedy decode when we only want the byte numbers - analytic baseline summarizer that produces the same report shape New tool: kakeya_extrapolate.py - Reads an existing benchmark report - Extracts per-block skeleton bytes and per-token encoded/tail bytes - Projects byte count and compression ratio to arbitrary target context lengths (exact under fixed codec params, no statistical fitting) - Cross-validated against measured 16k and 32k reports: max error 0.1%. Measured compression ratios on gemma-4-E2B-it (bf16, CPU): ctx full f32 full bf16 total f32 total bf16 2048 1.61x 2.30x 1.34x 1.60x 4096 1.88x 3.05x 1.60x 2.16x 8192 2.07x 3.67x 1.85x 2.83x 16384 2.15x 4.03x 2.01x 3.42x 32768 2.19x 4.23x 2.11x 3.86x Projected (byte-exact under the same codec params): 65536 2.21x 4.35x 2.17x 4.13x 131072 2.22x 4.40x 2.20x 4.29x 262144 2.23x 4.43x 2.22x 4.37x Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Codec changes (kakeya_kv_codec.py): - Factor out layer-plan inference into _resolve_layer_plan(config), which handles: models with layer_types (Gemma 2/3/4, Cohere2, SmolLM3), pure full-attention models (Llama, Mistral, Qwen2, SmolLM2), models with only sliding_window / attention_chunk_size, and Gemma 4's num_kv_shared_layers convention. - Rename KakeyaKVCache's constructor to use that helper so it works uniformly across model families. - Add build_kakeya_cache(model, ...) as the new canonical factory. build_gemma4_kakeya_cache is kept as an alias for backward compatibility. Benchmark changes (kakeya_benchmark.py): - Drop hard-coded Gemma 4 expectations in main(): fall back to inferring layer_types from num_hidden_layers + sliding_window when the config does not expose it (Llama-family). - Robust head_dim fallback to hidden_size // num_attention_heads. - Richer model metadata in the JSON report (head counts, head_dim, global_head_dim, sliding_window) so downstream tools can reconstruct the architecture without reloading the model. - --model-name flag for clean per-report labeling. Orchestration (run_all_benchmarks.sh): - Runs the standard 2k/4k/8k measured sweep + 16k-256k projection with the reference codec preset for any model path. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Per-model reports under reports/<model>/: - bench_2048.json / bench_4096.json / bench_8192.json: real measured runs on the model weights, both full + total ratios, both f32-store and bf16-store projections. - bench_16384.json / bench_32768.json (Gemma 4 only): real measured long-context runs that cross-validate the extrapolator. - extrapolation.json: byte-exact projections to 16k / 32k / 64k / 128k / 256k tokens, derived from the 8k per-block statistics. - REPORT.md: per-model narrative with tables + architecture notes. Headline 128k (bf16 store) totals: Qwen3-0.6B 4.51x (14.00 GiB -> 3.10 GiB) Gemma 4 E2B 4.29x (774 MiB -> 180 MiB) SmolLM2-1.7B 2.25x (24.00 GiB -> 10.65 GiB) Qwen2.5-0.5B 2.15x (1.50 GiB -> 714 MiB) All four models run through the same build_kakeya_cache(model) factory and the same run_all_benchmarks.sh orchestrator with no model-specific code. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

- README.md: project overview, quick-start snippet, headline results, repository layout. - reports/STANDARD.md: formal benchmark methodology (codec preset, context sweep, metrics, extrapolator validation) with the Gemma 4 reference numbers. - reports/CROSS_MODEL.md: single-table comparison of all four models plus per-architecture analysis of what drives the ratio differences. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…L floor Investigating the catastrophic e2e PPL finding from PR #12. Two distinct issues identified, ranked by quantitative impact: Issue 1 (real bug, fixed in this commit): WHT scaling inconsistency =================================================================== The codec's rotate() function uses an UNNORMALIZED Walsh-Hadamard transform (butterfly with no 1/sqrt(N) factor), so ||rotated||^2 = wht_len * ||res||^2 But encode_block was computing scale = sqrt(wht_len) / ||res||, which gave the Lloyd-Max quantiser input values with per-coord variance wht_len, not 1 as the N(0,1) codebook expects. For d_eff=26, wht_len=32: scaled values had per-coord std ~5.66, with 21 of 32 coords outside the b=3 Lloyd-Max max centroid of +/-2.15. Almost all residual coordinates were saturating to the extreme centroid, losing essentially all information. Fix: scale = 1.0 / res_norm in codec.rs line 249 (was sqrt(wht_len) / res_norm). Decoder unchanged (inv_scale = 1/scale already stored correctly). All 153 existing tests still pass. Effect on stage-4 K-stream reconstruction of real Qwen2.5-0.5B layer-5 K tensor: b=3 exact: SNR 10.1x -> 50.0x (correl 0.950 -> 0.990) b=2 exact: SNR 8.4x -> 32.7x (correl 0.939 -> 0.985) b=2 rsvd : SNR 8.4x -> 32.6x (correl 0.939 -> 0.985) V stream essentially unchanged (residuals were small enough to stay within the Lloyd-Max range even pre-fix). Issue 2 (structural, NOT fixable by parameters): per-layer PPL floor =================================================================== Even after fix #1, end-to-end PPL on real WikiText-103 shows that the codec is not PPL-ACCEPT at any parameter setting. Depth compounding test on Qwen2.5-0.5B (24 layers): k layers compressed | paper default | v1.2 b=3 exact | max fidelity --------------------|--------------:|---------------:|-------------: 1 | +3.9% | +3.7% | +2.5% 4 | +35.5% | +39.6% | +38.2% 8 | +147.9% | +149.4% | +141.5% 16 | +846.4% | +927.5% | +1169.0% 24 | +9341.0% | +6671.8% | +15647.5% Even max fidelity (b=4, vr=1.0 so d_eff=D, exact PCA, no RSVD truncation) incurs +2.5% PPL per layer. Across 24 layers this compounds super-linearly to +15 648%. The MSE-based ACCEPT framework cannot predict this because the MSE-to-PPL relationship at multi-layer compounding is non-monotone in the low-noise regime. Candidate causes (each probably ~0.5-1% of the 2.5% floor): - bf16 PCA basis storage (~0.1% per coord, accumulates across d_eff ~ 10-30 basis vectors) - fp16 t-scalar in k-means - shared / pooled PCA basis not matching per-block structure - universal Lloyd-Max codebook not adapted to per-block residual distribution This means the codec cannot be saved to PPL-ACCEPT by parameter changes. Full details in reports/v1_3_rsvd_rope/codec_root_cause/DIAGNOSIS.md New tooling added: - kakeyaturbo/src/bin/stage-by-stage-decode.rs : emits per-stage reconstructions so error can be attributed to PCA / kmeans / WHT / Lloyd-Max stages. - benchmarks/stage_ablation_driver.py : Python driver for the above, on real captured KV tensors. - benchmarks/depth_compounding_test.py : measures per-layer PPL inflation at increasing compression depth. Remediation options documented in DIAGNOSIS.md: A. Architectural replacement on K (e.g. KIVI-style per-channel int4/int8), keep skeleton+residual only for V. B. Fine-tuning adapter per layer (abandons training-free claim). C. Per-block adaptive codebook (replace universal Lloyd-Max). D. Withdraw compression-with-ACCEPT claim from paper. Recommend A or a combination of A + C. Until a remediation lands, the paper's quality claims must not be promoted. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Buckets on the HF (+7.82%) vs vLLM (+35.33%) 27-pp gap: #1 Engine baseline shift ~10 pp (clean-model PPL disagreement; 0.145 KL; 18% top-1 disagreement) #2 Codec residual magnitude ~0 (codec is engine- agnostic; mse ratio 1.01) #3 Noise-sensitivity curve HF MORE sensitive per \u03c3 in linear regime; not the cause #4 Boundary layers already skipped +69 pp saved by SPRINT_CLOSEOUT boundary policy #5 Cross-layer non-linear compound +39 pp (joint-cell - \u03a3 singletons over 22 quiet layers) Localised root cause: vLLM's single-forward bf16 residual-stream accumulation through Flash-Attention compounds per-layer codec residuals ~39 pp above their sum, while HF eager's f32-accumulate + teacher-force over DynamicCache compounds them less aggressively. Each per-layer residual is small on both engines (Phase 4 matched); what differs is the accumulation path. Deployment recommendations: 1. Extend vLLM boundary skip to {2, 6, 11} on top of the existing {0,1,7,14,26,27}; cuts ~10-15 pp off the joint Delta-ppl. 2. Adaptive per-layer bit-width: K b=4 on the hot layers, b=3 elsewhere; preserves 19/28 of the ratio benefit. Phase 3 ran only on vLLM (reused production harness); the HF per- layer curve is left as a follow-up if someone wants to confirm that HF's cross-layer interaction is the ~+10 pp we infer here. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…ady text Per user instruction, committing directly to main (no PR) with three items from the 2026-04-27 retrieval audit: 1. README.md — top-of-file restructure: - New one-line tagline matches the description users will paste into repo Settings: 'Drop-in transformers.DynamicCache subclass. 2.4-2.8x KV cache compression at <1% perplexity loss on Qwen3, Llama-3, DeepSeek, GLM-4, Gemma.' - PyPI badge, GitHub stars badge, HF Space badge. - 'If this helps your inference stack, please give us a star' CTA immediately after the badges. This is the single highest-ROI copy change given the audit finding that 4 stars is a hard GEO-authority drag and the PyPI already has 103 dl/week that could convert to stars if prompted. - Kept the existing 'GPU-native lattice-quantisation codec for transformer KV caches. Measured across 4 open-source model families...' paragraph below unchanged so the body numbers continue to match the public JSON. 2. docs/GEO_AUDIT_2026-04-27.md — full retrieval audit across 10 queries in 3 tiers: * Tier A (name lookup): passes on 3 of 4 queries via the libraries.io PyPI mirror. One failure: Perplexity misattributed 'KakeyaLattice' to arxiv.org/pdf/2504.05646 (a different 'Lattice' paper on orthogonal state recurrence). This is the strongest argument for submitting our own arXiv. * Tier B (topic lookup, user doesn't know the name): we are absent from 'best LLM KV cache compression library Python 2026' (top 5 is all NVIDIA/kvpress) and 'E8 lattice KV cache vLLM' (top 5 is NexusQuant + vLLM FP8 docs). We rank #1 on the narrow 'nested lattice KV compression Qwen3 DeepSeek' but only via the libraries.io PyPI mirror, not our GitHub repo. * Tier C (peer comparison): AI engines list us with correct numbers ('2.4-3x better than TurboQuant') but the hyperlinks they emit point at NexusQuant/TurboQuant, not us. * Landing-surface probe: PyPI OK, Libraries.io OK, HF Space OK; GitHub repo page has a typo in description ('efficent'), empty topics field, and 4 stars — all three penalising. * Competitive snapshot: NexusQuant is running a full launch playbook and has taken the 'lattice KV' query cluster. Their one gap is 'no arXiv preprint', which is the one lever we can use to overtake. * Three ROI-ranked fixes with exact paste-ready text for the GitHub repo Settings (description, homepage, 11 topics) plus the equivalent gh CLI one-liner. * Baseline for a 2026-05-25 re-audit with expected deltas. 3. No new data claims. All numeric references in the audit point back to existing reports under reports/v1_4_release/ and reports/v1_5_release/. The one GEO fix this commit cannot perform — editing the GitHub repo's Settings (description + homepage + topics) — is NOT a repo file; it is UI / API state on github.com. The cloud agent's gh CLI is read-only. The audit file has the exact text to paste; expected completion time is under 5 minutes in the Settings UI. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

cursoragent and others added 3 commits April 17, 2026 02:56

Add 4k and 8k token compression benchmark reports

a511bac

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

cursor Bot changed the title ~~Fix empty KV cache when num_kv_shared_layers is 0~~ Fix empty KV cache bug and add real-weights Gemma 4 compression benchmark Apr 17, 2026

cursor Bot changed the title ~~Fix empty KV cache bug and add real-weights Gemma 4 compression benchmark~~ Fix empty KV cache bug, benchmark Kakeya on real Gemma 4, extrapolate to 128k Apr 17, 2026

cursoragent and others added 3 commits April 17, 2026 05:03

cursor Bot changed the title ~~Fix empty KV cache bug, benchmark Kakeya on real Gemma 4, extrapolate to 128k~~ Kakeya KV cache: bug fix, Gemma 4 benchmark standard, cross-model reports (4 models) Apr 17, 2026

cursor Bot marked this pull request as ready for review April 17, 2026 05:05

cursor Bot merged commit f2f3e35 into main Apr 17, 2026

FluffyAIcode mentioned this pull request Apr 17, 2026

Add DeepSeek and GLM benchmark reports (+ cleanup of legacy top-level reports) #2

Merged

cursor Bot mentioned this pull request Apr 21, 2026

Outlier compensation + Besicovitch-product skeleton — diagnostic sprint #13

Closed

FluffyAIcode deleted the cursor/kakeya-kv-review-12f5 branch April 23, 2026 15:52

This was referenced Apr 24, 2026

paper: fix protocol self-contradiction + bit-accounting inconsistencies #41

Merged

HF Space demo: fix compression display + repetition-loop prompt, add Docker SDK manifest #53

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kakeya KV cache: bug fix, Gemma 4 benchmark standard, cross-model reports (4 models)#1

Kakeya KV cache: bug fix, Gemma 4 benchmark standard, cross-model reports (4 models)#1
cursor[bot] merged 7 commits into
mainfrom
cursor/kakeya-kv-review-12f5

FluffyAIcode commented Apr 17, 2026 •

edited by cursor Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented Apr 17, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Headline numbers

What changed in the codec

Bug fix

Model-agnostic factory

What the benchmark does

What the extrapolator does

Benchmark standard

Reproducing any row

Environment used

Follow-up

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FluffyAIcode commented Apr 17, 2026 •

edited by cursor Bot

Loading