Unified KV Cache Compression Methods for Auto-Regressive Models
-
Updated
Jan 4, 2025 - Python
Unified KV Cache Compression Methods for Auto-Regressive Models
LLM KV cache compression made easy
Awesome-LLM-KV-Cache: A curated list of 📙Awesome LLM KV Cache Papers with Codes.
[NeurIPS'25 Oral] Query-agnostic KV cache eviction: 3–4× reduction in memory and 2× decrease in latency (Qwen3/2.5, Gemma3, LLaMA3)
llama.cpp fork with TurboQuant WHT-rotated KV cache & weight compression + Gemma 4 MTP and Qwen 3.6 NextN speculative decoding (+30-50% throughput).
Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)
[ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projection
First open-source implementation of Google TurboQuant (ICLR 2026) -- near-optimal KV cache compression for LLM inference. 5x compression with near-zero quality loss.
Pytorch implementation for "Compressed Context Memory For Online Language Model Interaction" (ICLR'24)
This is the official repo of "QuickLLaMA: Query-aware Inference Acceleration for Large Language Models"
xKV: Cross-Layer SVD for KV-Cache Compression
(ACL2025 oral) SCOPE: Optimizing KV Cache Compression in Long-context Generation
Accurate and fast KV cache compression with a gating mechanism
Native Windows build of vLLM 0.21.0 — no WSL, no Docker. Pre-built wheels + 36-file Windows patch + 10 KV cache compression dtypes (6 Multi-TurboQuant + 4 upstream TurboQuant). PyTorch 2.11 + CUDA 12.6 + Triton + Flash-Attention 2.
First open-source KVTC implementation (NVIDIA, ICLR 2026) -- 8-32x KV cache compression via PCA + adaptive quantization + entropy coding
A list of awesome papers on compression and acceleration of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs).
Official repo for AMD hybrid models training and inference workflow
Discrete Kakeya cover for LLM KV cache: D4/E8 nested-lattice quantisation realising a Kakeya-style tube-cover over the direction sphere. 2.4x-2.8x compression at <1% perplexity loss on Qwen3, Llama-3, DeepSeek, GLM-4, Gemma. Drop-in transformers.DynamicCache. pip install kakeyalattice.
LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation
AI agent skill implementing Google's TurboQuant compression algorithm (ICLR 2026) — 6x KV cache memory reduction, 8x speedup, zero accuracy loss. Compatible with Claude Code, Codex CLI, and all Agent Skills-compatible tools.
Add a description, image, and links to the kv-cache-compression topic page so that developers can more easily learn about it.
To associate your repository with the kv-cache-compression topic, visit your repo's landing page and select "manage topics."