Kakeya vs TurboQuant+ head-to-head on the same open-source models#3
Merged
Merged
Conversation
Cross-compresses the same captured KV tensors with both methods and
reports byte-exact side-by-side compression ratios + reconstruction
MSE for 7 open-source models at multiple context lengths.
Headline findings (128k bf16, head_dim=128 models):
- TurboQuant turbo2 / turbo3 / turbo4: 7.5 / 5.1 / 3.9x (constant,
context-independent per-token scalar quantization).
- Kakeya at same context: 2.1-4.5x depending on model (grows with
context as block skeleton amortizes).
- TurboQuant wins on raw ratio for ctx < 32k on all models.
- Kakeya wins on raw ratio on Qwen3 @ 32k+, Gemma 4 / Qwen3 @ 128k.
- On K cache MSE Kakeya is 8-400x lower than turbo3 on Qwen /
DeepSeek-R1-Distill (the models TurboQuant's own README flags
for the q8_0-K+turbo-V rescue config).
- On V cache MSE TurboQuant is 2-10x lower.
Files:
compare_kakeya_vs_turboquant.py harness: load model -> capture
KV -> run both codecs -> JSON
run_comparison_matrix.sh one-shot orchestrator for the
7-model table
reports/compare/<model>/compare_<ctx>.json
per-row byte-level JSON report
reports/compare/SUMMARY.md side-by-side tables + analysis
Notes:
- TurboQuant+ Python prototype was cloned into turboquant_plus/
from github.com/TheTom/turboquant_plus and installed with
'pip install -e'. The directory is gitignored; repo users
clone it themselves.
- Kakeya codec, benchmark harness and extrapolator are unchanged.
- Comparison is memory/MSE only. End-to-end quality (PPL, NIAH)
would need a shared evaluation protocol.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
cursor Bot
pushed a commit
that referenced
this pull request
Apr 18, 2026
Before implementing the byte-math projection that said 'aggressive PCA
(variance_ratio=0.85) + outlier channel (d_res=4) closes 29% of the
v1.2 vs turbo3 gap', ran the quality ablation.
New:
- kakeyaturbo/src/bin/kakeyaturbo-deff-outlier-ablation.rs: Rust
binary that fits PCA at a grid of variance_ratio values and
reconstructs blocks with 0/2/4/8 outlier channels, reporting
per-block MSE inflation vs the v1.2 baseline (vr=0.95, d_res=0).
- benchmarks/run_deff_outlier_ablation.py: Python driver that
captures real K tensors from all 7 HF models at ctx=4096 and
drives the Rust binary on every full-attention layer's K stream.
Same methodology as PR #6.
- benchmarks/aggregate_deff_ablation.py: Decision-table aggregator.
- reports/k_deff_outlier_ablation/<model>/*.json: 190 measurements
(7 models × ~30 full-attn layers × 1 K stream per layer) across
the 5 × 4 grid.
- reports/k_deff_outlier_ablation/DECISION.md: full decision write-up.
Headline finding (cross-model mean K MSE inflation over v1.2 baseline):
d_res=0 d_res=2 d_res=4 d_res=8
variance_ratio
0.95 1.00x 0.82x 0.71x 0.57x (ACCEPT)
0.90 2.00x 1.66x 1.45x 1.16x (REJECT/REJECT/REJECT/MARGINAL)
0.85 3.00x 2.49x 2.19x 1.76x (all REJECT)
0.80 3.99x 3.32x 2.92x 2.36x (all REJECT)
0.70 5.96x 4.96x 4.35x 3.52x (all REJECT)
Per-model breakdown of the proposed 'Option B' (vr=0.85, d_res=4):
Qwen2.5-0.5B 1.80x (below 30% threshold only because hd=64)
Qwen3-0.6B 2.11x
gemma-4-E2B-it 2.67x (hd=512)
DeepSeek-R1 2.28x
GLM-Edge-1.5B 2.31x
SmolLM2-1.7B 1.83x (hd=64)
GLM-Edge-4B 2.31x
Every model except hd=64 ones exceeds 2× inflation.
Verdict: REJECT 'Option B'. The byte saving of +17% K ratio costs
+119% K MSE -- the same regime that breaks symmetric turbo3 on
Qwen-family (PR #3 documented). Proceed only with quality-preserving
alternatives listed in DECISION.md (cross-layer skeleton sharing,
smaller K-means codebook, larger block_size, or accept the gap).
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
cursor Bot
pushed a commit
that referenced
this pull request
Apr 22, 2026
Buckets on the HF (+7.82%) vs vLLM (+35.33%) 27-pp gap: #1 Engine baseline shift ~10 pp (clean-model PPL disagreement; 0.145 KL; 18% top-1 disagreement) #2 Codec residual magnitude ~0 (codec is engine- agnostic; mse ratio 1.01) #3 Noise-sensitivity curve HF MORE sensitive per \u03c3 in linear regime; not the cause #4 Boundary layers already skipped +69 pp saved by SPRINT_CLOSEOUT boundary policy #5 Cross-layer non-linear compound +39 pp (joint-cell - \u03a3 singletons over 22 quiet layers) Localised root cause: vLLM's single-forward bf16 residual-stream accumulation through Flash-Attention compounds per-layer codec residuals ~39 pp above their sum, while HF eager's f32-accumulate + teacher-force over DynamicCache compounds them less aggressively. Each per-layer residual is small on both engines (Phase 4 matched); what differs is the accumulation path. Deployment recommendations: 1. Extend vLLM boundary skip to {2, 6, 11} on top of the existing {0,1,7,14,26,27}; cuts ~10-15 pp off the joint Delta-ppl. 2. Adaptive per-layer bit-width: K b=4 on the hot layers, b=3 elsewhere; preserves 19/28 of the ratio benefit. Phase 3 ran only on vLLM (reused production harness); the HF per- layer curve is left as a follow-up if someone wants to confirm that HF's cross-layer interaction is the ~+10 pp we infer here. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
FluffyAIcode
pushed a commit
that referenced
this pull request
Apr 22, 2026
…_LAYERS
CPU-reference snapshot harness supports boundary skip via its
--boundary-skip-layers CLI flag. The in-backend equivalent — which
kicks in when vLLM's engine calls _seal_and_write_block from the
attention forward — was missing. Add it.
Environment-var driven so a single backend binary can serve
different recipes without rebuild:
KAKEYA_BOUNDARY_SKIP_LAYERS="0,1,7,14,26,27" # PR #17 DeepSeek-1.5B
KAKEYA_BOUNDARY_SKIP_LAYERS="0,1,34,35" # Qwen3-4B
KAKEYA_BOUNDARY_SKIP_LAYERS="" # no skip (default)
Semantics: when _seal_and_write_block sees a layer on the skip list,
the codec is bypassed and the bf16 K/V are stashed in
state.bf16_shadow (per-layer dict, already existed for the
KAKEYA_DEBUG_BF16_SHADOW probe path). _decode_sealed consults the
shadow FIRST and returns the bf16 tensors verbatim — bit-exact
round-trip, zero codec distortion on boundary layers. The paged-
cache byte slot for boundary blocks is zeroed as a deterministic
placeholder (never read).
Distinct from KAKEYA_SKIP_LAYERS (plugin.py) which tells the Σ_q /
centroid calibration bundle loader which layers to exclude. The
two happen to coincide in the PR #17 recipe but are semantically
independent: a deploy can skip layer 0 from calibration while still
running codec on it, or vice versa.
Off-cache memory cost: ~128 KB per boundary block per layer
(n × n_kv × head_size × 2 bytes, 4 boundary layers × up to ctx /
512 blocks). At the snapshot-harness workload (~4 blocks / layer)
this is a few MB total — negligible. The paged-cache allocator
doesn't see this memory; noted in the docstring for when long-ctx
workloads make it worth tracking.
Tests (vllm_backend/kakeya_v1_3_ppl/tests/test_boundary_skip.py):
* Parser: 7 tests on CPU (unset → empty, single layer, DS-1.5B
recipe, whitespace-tolerant, malformed raises, cached).
* Seal/decode on CUDA: 3 tests (boundary layer bf16-exact
roundtrip, non-boundary layer runs codec unchanged, mixed
boundary/codec layers don't cross-contaminate shadows).
* 10/10 PASS on H200.
Existing test suite: 52/52 pass (the pre-existing
test_seal_exactly_one_full_block failure in
test_phase_b_end_to_end.py is unchanged by this commit —
verified by running the suite at 7e109ea both with and without
this patch; same rel_err=0.678 assertion failure in both).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Byte-for-byte comparison of Kakeya KV cache compression vs
TurboQuant+ (PolarQuant + WHT + scalar quantization, from
TheTom/turboquant_plus)
on the same captured KV tensors, same models, same context
lengths.
Both methods were run on the same machine (CPU-only, 15 GiB RAM, BF16,
eager attention) using the same harness. TurboQuant+ was installed
from its official Python prototype; Kakeya uses the codec and standard
preset already in this repo (
block_size=512, residual_length=256, d_res=8, K=16, variance_ratio=0.95).Headline (bf16 baseline, total cache)
TurboQuant+'s ratio is context-independent (per-token scalar
quantization). Kakeya's ratio grows with context (block skeleton
amortizes). Both facts are visible in the same table:
2k tokens
128k tokens (Kakeya = byte-exact extrapolation)
Bold Kakeya entries = it surpasses turbo4 there. Kakeya never surpasses turbo3.
Where Kakeya is better (reconstruction quality on K)
Measured on the same captured KV tensors — first full-attention layer,
first compressed block (512 rows):
On the exact models TurboQuant+'s own README flags for the
q8_0-K + turbo-Vrescue config, Kakeya handles K cache reconstruction orders of magnitude better than symmetric turbo3. The reason is simple: TurboQuant+'s codebooks are calibrated for unit-norm vectors, and the K norms in the Qwen family are large and heterogeneous. Kakeya's PCA on raw values absorbs that variance without per-vector renormalization.On V cache, TurboQuant+ is 2–10× lower MSE — a consequence of its per-coordinate bit-depth doing well on near-Gaussian rotated V values, while Kakeya's
d_res=8sparse residual discards per-row information V cache actually uses.Architecture observations
This should land around 8–10× combined ratio with better quality than either alone.
What's in this PR
compare_kakeya_vs_turboquant.pyrun_comparison_matrix.shreports/compare/<model>/compare_<ctx>.jsonreports/compare/SUMMARY.md.gitignoreturboquant_plus/from this repoReproducing
Limitations
block_size + residual_lengthtokens to begin compressing. For prompts shorter than ~768 tokens Kakeya is effectivelyDynamicCache.transformers.Cachesubclass. The "zero kernel work" advantage of Kakeya is real but comes with a CPU-side runtime cost the TurboQuant+ C port does not pay.Environment
CPU-only x86_64, 15 GB RAM, BF16, eager attention,
torch==2.11.0+cu130,transformers==5.5.4. TurboQuant+ Python prototype version captured at clone time.