Kakeya KV cache: bug fix, Gemma 4 benchmark standard, cross-model reports (4 models)#1
Merged
Merged
Conversation
- layer_types[:-0] yields an empty list, so the default Gemma 4 config (num_kv_shared_layers=0) produced a Cache with zero layers and broke every attention update during generation. Guard the slice so it only applies when some layers are actually shared. - Make the residual index tensor contiguous in the d_res >= d_eff branch so scatter_ avoids an implicit copy from the expanded stride pattern. - Clarify shared_layers comment: Gemma 4 builds a per-forward shared_kv_states dict inside the model, so the cache attribute exists only for compatibility and stays empty. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
The benchmark (kakeya_benchmark.py) downloads/loads google/gemma-4-E2B-it,
runs a prefill twice (DynamicCache baseline vs KakeyaKVCache) on the
same long prompt, and measures exact bytes per layer.
It reports:
- baseline vs kakeya total KV bytes, split into full-attention and
sliding-window layers (sliding layers are capped by sliding_window,
so the codec intentionally leaves them alone)
- compression ratio both as the codec currently stores things
(float32 on CPU) and with a dtype-matched projection (bf16 store,
which an optimized version would use)
- sanity-check generation comparing greedy outputs
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Benchmark additions:
- chunked prefill to bound activation memory (--prefill-chunk)
- --skip-baseline-prefill: derive the DynamicCache byte count analytically
(it's a deterministic function of context length + config, so we do not
need to waste 20 minutes re-running it on CPU for every context length)
- --skip-generation: skip the sanity-check greedy decode when we only
want the byte numbers
- analytic baseline summarizer that produces the same report shape
New tool: kakeya_extrapolate.py
- Reads an existing benchmark report
- Extracts per-block skeleton bytes and per-token encoded/tail bytes
- Projects byte count and compression ratio to arbitrary target context
lengths (exact under fixed codec params, no statistical fitting)
- Cross-validated against measured 16k and 32k reports: max error 0.1%.
Measured compression ratios on gemma-4-E2B-it (bf16, CPU):
ctx full f32 full bf16 total f32 total bf16
2048 1.61x 2.30x 1.34x 1.60x
4096 1.88x 3.05x 1.60x 2.16x
8192 2.07x 3.67x 1.85x 2.83x
16384 2.15x 4.03x 2.01x 3.42x
32768 2.19x 4.23x 2.11x 3.86x
Projected (byte-exact under the same codec params):
65536 2.21x 4.35x 2.17x 4.13x
131072 2.22x 4.40x 2.20x 4.29x
262144 2.23x 4.43x 2.22x 4.37x
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Codec changes (kakeya_kv_codec.py):
- Factor out layer-plan inference into _resolve_layer_plan(config),
which handles: models with layer_types (Gemma 2/3/4, Cohere2, SmolLM3),
pure full-attention models (Llama, Mistral, Qwen2, SmolLM2), models
with only sliding_window / attention_chunk_size, and Gemma 4's
num_kv_shared_layers convention.
- Rename KakeyaKVCache's constructor to use that helper so it works
uniformly across model families.
- Add build_kakeya_cache(model, ...) as the new canonical factory.
build_gemma4_kakeya_cache is kept as an alias for backward
compatibility.
Benchmark changes (kakeya_benchmark.py):
- Drop hard-coded Gemma 4 expectations in main(): fall back to
inferring layer_types from num_hidden_layers + sliding_window when
the config does not expose it (Llama-family).
- Robust head_dim fallback to hidden_size // num_attention_heads.
- Richer model metadata in the JSON report (head counts, head_dim,
global_head_dim, sliding_window) so downstream tools can reconstruct
the architecture without reloading the model.
- --model-name flag for clean per-report labeling.
Orchestration (run_all_benchmarks.sh):
- Runs the standard 2k/4k/8k measured sweep + 16k-256k projection
with the reference codec preset for any model path.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Per-model reports under reports/<model>/:
- bench_2048.json / bench_4096.json / bench_8192.json: real measured
runs on the model weights, both full + total ratios, both
f32-store and bf16-store projections.
- bench_16384.json / bench_32768.json (Gemma 4 only): real measured
long-context runs that cross-validate the extrapolator.
- extrapolation.json: byte-exact projections to 16k / 32k / 64k /
128k / 256k tokens, derived from the 8k per-block statistics.
- REPORT.md: per-model narrative with tables + architecture notes.
Headline 128k (bf16 store) totals:
Qwen3-0.6B 4.51x (14.00 GiB -> 3.10 GiB)
Gemma 4 E2B 4.29x (774 MiB -> 180 MiB)
SmolLM2-1.7B 2.25x (24.00 GiB -> 10.65 GiB)
Qwen2.5-0.5B 2.15x (1.50 GiB -> 714 MiB)
All four models run through the same build_kakeya_cache(model) factory
and the same run_all_benchmarks.sh orchestrator with no model-specific
code.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
- README.md: project overview, quick-start snippet, headline results, repository layout. - reports/STANDARD.md: formal benchmark methodology (codec preset, context sweep, metrics, extrapolator validation) with the Gemma 4 reference numbers. - reports/CROSS_MODEL.md: single-table comparison of all four models plus per-architecture analysis of what drives the ratio differences. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
cursor Bot
pushed a commit
that referenced
this pull request
Apr 20, 2026
…L floor Investigating the catastrophic e2e PPL finding from PR #12. Two distinct issues identified, ranked by quantitative impact: Issue 1 (real bug, fixed in this commit): WHT scaling inconsistency =================================================================== The codec's rotate() function uses an UNNORMALIZED Walsh-Hadamard transform (butterfly with no 1/sqrt(N) factor), so ||rotated||^2 = wht_len * ||res||^2 But encode_block was computing scale = sqrt(wht_len) / ||res||, which gave the Lloyd-Max quantiser input values with per-coord variance wht_len, not 1 as the N(0,1) codebook expects. For d_eff=26, wht_len=32: scaled values had per-coord std ~5.66, with 21 of 32 coords outside the b=3 Lloyd-Max max centroid of +/-2.15. Almost all residual coordinates were saturating to the extreme centroid, losing essentially all information. Fix: scale = 1.0 / res_norm in codec.rs line 249 (was sqrt(wht_len) / res_norm). Decoder unchanged (inv_scale = 1/scale already stored correctly). All 153 existing tests still pass. Effect on stage-4 K-stream reconstruction of real Qwen2.5-0.5B layer-5 K tensor: b=3 exact: SNR 10.1x -> 50.0x (correl 0.950 -> 0.990) b=2 exact: SNR 8.4x -> 32.7x (correl 0.939 -> 0.985) b=2 rsvd : SNR 8.4x -> 32.6x (correl 0.939 -> 0.985) V stream essentially unchanged (residuals were small enough to stay within the Lloyd-Max range even pre-fix). Issue 2 (structural, NOT fixable by parameters): per-layer PPL floor =================================================================== Even after fix #1, end-to-end PPL on real WikiText-103 shows that the codec is not PPL-ACCEPT at any parameter setting. Depth compounding test on Qwen2.5-0.5B (24 layers): k layers compressed | paper default | v1.2 b=3 exact | max fidelity --------------------|--------------:|---------------:|-------------: 1 | +3.9% | +3.7% | +2.5% 4 | +35.5% | +39.6% | +38.2% 8 | +147.9% | +149.4% | +141.5% 16 | +846.4% | +927.5% | +1169.0% 24 | +9341.0% | +6671.8% | +15647.5% Even max fidelity (b=4, vr=1.0 so d_eff=D, exact PCA, no RSVD truncation) incurs +2.5% PPL per layer. Across 24 layers this compounds super-linearly to +15 648%. The MSE-based ACCEPT framework cannot predict this because the MSE-to-PPL relationship at multi-layer compounding is non-monotone in the low-noise regime. Candidate causes (each probably ~0.5-1% of the 2.5% floor): - bf16 PCA basis storage (~0.1% per coord, accumulates across d_eff ~ 10-30 basis vectors) - fp16 t-scalar in k-means - shared / pooled PCA basis not matching per-block structure - universal Lloyd-Max codebook not adapted to per-block residual distribution This means the codec cannot be saved to PPL-ACCEPT by parameter changes. Full details in reports/v1_3_rsvd_rope/codec_root_cause/DIAGNOSIS.md New tooling added: - kakeyaturbo/src/bin/stage-by-stage-decode.rs : emits per-stage reconstructions so error can be attributed to PCA / kmeans / WHT / Lloyd-Max stages. - benchmarks/stage_ablation_driver.py : Python driver for the above, on real captured KV tensors. - benchmarks/depth_compounding_test.py : measures per-layer PPL inflation at increasing compression depth. Remediation options documented in DIAGNOSIS.md: A. Architectural replacement on K (e.g. KIVI-style per-channel int4/int8), keep skeleton+residual only for V. B. Fine-tuning adapter per layer (abandons training-free claim). C. Per-block adaptive codebook (replace universal Lloyd-Max). D. Withdraw compression-with-ACCEPT claim from paper. Recommend A or a combination of A + C. Until a remediation lands, the paper's quality claims must not be promoted. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
cursor Bot
pushed a commit
that referenced
this pull request
Apr 22, 2026
Buckets on the HF (+7.82%) vs vLLM (+35.33%) 27-pp gap: #1 Engine baseline shift ~10 pp (clean-model PPL disagreement; 0.145 KL; 18% top-1 disagreement) #2 Codec residual magnitude ~0 (codec is engine- agnostic; mse ratio 1.01) #3 Noise-sensitivity curve HF MORE sensitive per \u03c3 in linear regime; not the cause #4 Boundary layers already skipped +69 pp saved by SPRINT_CLOSEOUT boundary policy #5 Cross-layer non-linear compound +39 pp (joint-cell - \u03a3 singletons over 22 quiet layers) Localised root cause: vLLM's single-forward bf16 residual-stream accumulation through Flash-Attention compounds per-layer codec residuals ~39 pp above their sum, while HF eager's f32-accumulate + teacher-force over DynamicCache compounds them less aggressively. Each per-layer residual is small on both engines (Phase 4 matched); what differs is the accumulation path. Deployment recommendations: 1. Extend vLLM boundary skip to {2, 6, 11} on top of the existing {0,1,7,14,26,27}; cuts ~10-15 pp off the joint Delta-ppl. 2. Adaptive per-layer bit-width: K b=4 on the hot layers, b=3 elsewhere; preserves 19/28 of the ratio benefit. Phase 3 ran only on vLLM (reused production harness); the HF per- layer curve is left as a follow-up if someone wants to confirm that HF's cross-layer interaction is the ~+10 pp we infer here. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
This was referenced Apr 24, 2026
cursor Bot
pushed a commit
that referenced
this pull request
Apr 27, 2026
…ady text
Per user instruction, committing directly to main (no PR) with three
items from the 2026-04-27 retrieval audit:
1. README.md — top-of-file restructure:
- New one-line tagline matches the description users will paste into
repo Settings: 'Drop-in transformers.DynamicCache subclass.
2.4-2.8x KV cache compression at <1% perplexity loss on Qwen3,
Llama-3, DeepSeek, GLM-4, Gemma.'
- PyPI badge, GitHub stars badge, HF Space badge.
- 'If this helps your inference stack, please give us a star' CTA
immediately after the badges. This is the single highest-ROI
copy change given the audit finding that 4 stars is a hard
GEO-authority drag and the PyPI already has 103 dl/week that
could convert to stars if prompted.
- Kept the existing 'GPU-native lattice-quantisation codec for
transformer KV caches. Measured across 4 open-source model
families...' paragraph below unchanged so the body numbers
continue to match the public JSON.
2. docs/GEO_AUDIT_2026-04-27.md — full retrieval audit across 10
queries in 3 tiers:
* Tier A (name lookup): passes on 3 of 4 queries via the
libraries.io PyPI mirror. One failure: Perplexity misattributed
'KakeyaLattice' to arxiv.org/pdf/2504.05646 (a different
'Lattice' paper on orthogonal state recurrence). This is the
strongest argument for submitting our own arXiv.
* Tier B (topic lookup, user doesn't know the name): we are
absent from 'best LLM KV cache compression library Python 2026'
(top 5 is all NVIDIA/kvpress) and 'E8 lattice KV cache vLLM'
(top 5 is NexusQuant + vLLM FP8 docs). We rank #1 on the narrow
'nested lattice KV compression Qwen3 DeepSeek' but only via the
libraries.io PyPI mirror, not our GitHub repo.
* Tier C (peer comparison): AI engines list us with correct
numbers ('2.4-3x better than TurboQuant') but the hyperlinks
they emit point at NexusQuant/TurboQuant, not us.
* Landing-surface probe: PyPI OK, Libraries.io OK, HF Space OK;
GitHub repo page has a typo in description ('efficent'), empty
topics field, and 4 stars — all three penalising.
* Competitive snapshot: NexusQuant is running a full launch
playbook and has taken the 'lattice KV' query cluster. Their
one gap is 'no arXiv preprint', which is the one lever we can
use to overtake.
* Three ROI-ranked fixes with exact paste-ready text for the
GitHub repo Settings (description, homepage, 11 topics) plus
the equivalent gh CLI one-liner.
* Baseline for a 2026-05-25 re-audit with expected deltas.
3. No new data claims. All numeric references in the audit point
back to existing reports under reports/v1_4_release/ and
reports/v1_5_release/.
The one GEO fix this commit cannot perform — editing the GitHub
repo's Settings (description + homepage + topics) — is NOT a repo
file; it is UI / API state on github.com. The cloud agent's gh CLI
is read-only. The audit file has the exact text to paste; expected
completion time is under 5 minutes in the Settings UI.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR contains the full body of work on the Kakeya KV cache compression layer:
KakeyaKVCachethat made it a no-op on Gemma 4's default config.build_kakeya_cache(model)) that works for any HF transformers decoder-only model — Gemma 2/3/4, Llama, Mistral, Qwen 2/3, SmolLM2, Cohere2, and similar.kakeya_benchmark.py) with chunked prefill, analytic baseline, and per-layer JSON reports.kakeya_extrapolate.py) that projects any report to arbitrary context lengths (cross-validated ≤ 0.003 absolute ratio error).reports/STANDARD.md): measured 2k–32k plus projected 64k–256k.reports/{gemma4_e2b, qwen2_5_0_5b, smollm2_1_7b, qwen3_0_6b}/), each with a narrativeREPORT.mdand all raw JSON.reports/CROSS_MODEL.md) and top-levelREADME.md.Headline numbers
Under the standard codec preset (
block_size=512, residual_length=256, d_res=8, K=16, variance_ratio=0.95), at 131 072 token context with bf16 store projection:Qwen/Qwen3-0.6Bgoogle/gemma-4-E2B-itHuggingFaceTB/SmolLM2-1.7B-InstructQwen/Qwen2.5-0.5B-InstructAll four models use the same
build_kakeya_cache(model)factory and the samerun_all_benchmarks.shorchestrator. No model-specific code.What changed in the codec
Bug fix
Gemma4TextConfig.num_kv_shared_layersdefaults to0, which made the old code dolayer_types[:-0]→ empty list → cache with zero layers. Guarded with> 0.Model-agnostic factory
Layer-plan inference moved to
_resolve_layer_plan(config)and handles all the standard HF patterns:layer_typesexists (Gemma 2/3/4, Cohere2, Qwen3 hybrid)num_kv_shared_layers > 0(Gemma 4)layer_types,sliding_window=None(Llama, SmolLM2, Qwen2, Qwen3-dense)sliding_windowonly /attention_chunk_size(Llama 4)What the benchmark does
kakeya_benchmark.pyloads the target model, runs a long-prompt prefill through bothDynamicCache(baseline) andKakeyaKVCache, and reports byte counts + ratios per layer and per model section (full-attn vs sliding-window). It supports:--skip-baseline-prefill: compute the baseline bytes analytically (exact, since it's determined by config + seq_len); saves half the CPU budget for long contexts.--prefill-chunk N: chunked prefill to bound activation memory on CPU.--skip-generation: skip greedy-decode sanity check (for very long contexts).What the extrapolator does
For fixed codec params, per-block skeleton bytes and per-token encoded bytes are deterministic constants — they do not depend on context length.
kakeya_extrapolate.pyreads any measured report and scales it exactly to arbitrary target contexts. Cross-validation summary:So the 64k / 128k / 256k rows in every report are byte-exact under the preset, not statistical fits.
Benchmark standard
reports/STANDARD.mdis the formal methodology doc. Key points:total_ratio_bf16_storeat 32k tokens.Reproducing any row
pip install -U torch transformers accelerate huggingface_hub python3 -c "from huggingface_hub import snapshot_download; snapshot_download('Qwen/Qwen3-0.6B', local_dir='models/Qwen3-0.6B', allow_patterns=['*.json','*.safetensors','tokenizer*','*.model'])" ./run_all_benchmarks.sh models/Qwen3-0.6B qwen3_0_6bProduces:
Environment used
CPU-only x86_64, 15 GB RAM, no GPU,
torch==2.11.0+cu130,transformers==5.5.4. Public (not gated) open-source models only.Follow-up
The only pending optimization explicitly called out in the reports is the bf16-store conversion inside
kakeya_kv_codec.py(currently compressed tensors are float32 on CPU). The reports make this visible by always showing the "bf16 store, projected" column next to the "f32 store" column, so it's clear what absolute gain that optimization would unlock.