Skip to content

Scenario A — snapshot-mode KV compression on vLLM: Δppl +29% / top-1 74% (improves top-1 by 15 pp, Δppl only by 6 pp)#17

Merged
FluffyAIcode merged 31 commits into
mainfrom
AgentMemory/v1-3-ppl-snapshot-mode-vllm-102e
Apr 23, 2026
Merged

Scenario A — snapshot-mode KV compression on vLLM: Δppl +29% / top-1 74% (improves top-1 by 15 pp, Δppl only by 6 pp)#17
FluffyAIcode merged 31 commits into
mainfrom
AgentMemory/v1-3-ppl-snapshot-mode-vllm-102e

Conversation

@FluffyAIcode
Copy link
Copy Markdown
Owner

@FluffyAIcode FluffyAIcode commented Apr 22, 2026

Summary

Reproduces HF's two-pass DynamicCache semantics inside vLLM
(clean prefill → codec snapshot → teacher-force with codec'd cache).
This is the right semantics for the production use case of
"compress the already-populated KV cache to save memory during
generation" (Scenario A).

Result on DS-Distill-Qwen-1.5B / vLLM 0.7.3 / 4 WikiText-103 passages / ctx=2048:

Mode Harness Δppl top-1 Verdict
HF 2-pass DynamicCache (SPRINT_CLOSEOUT reference) HF eager +7.82 % 78.97 % MARGINAL
vLLM snapshot-mode (this PR) vLLM FA +29.07 % 74.22 % REJECT
vLLM in-forward (PR #15 production) vLLM FA +35.33 % 59.38 % REJECT
  • vs in-forward: −6 pp Δppl, +15 pp top-1 (from 59 % to 74 %,
    within 5 pp of HF's 79 %).
  • vs HF: still +21 pp Δppl, but now only −5 pp top-1.

Comparison vs upstream vLLM PR #38479 (TurboQuant, merged 2026-04-15)

Full write-up: reports/v1_3_ppl/snapshot_mode/COMPARISON_VLLM_PR_38479.md.

PR #38479's headline vs community-verified numbers

The upstream PR's description presents Qwen3-4B GSM8K going from
0.900 (baseline) → 0.720 at 4.9× compression (tq3). Independent
reproductions in the PR comment thread tell a different story:

  • @mgoin (Neural Magic): tq3 on Qwen3-4B → GSM8K 0.009
    (0.9 %, effectively broken). tq4 crashes.
  • @varjoranta (A100): default tq3 produces garbage; coherent
    output only after TQ_VALUE_BITS=8 (FP8 V, drops compression to
    ~2×).
  • @MidasMining (A4000): tq3 default at 71 % accuracy, FP8
    values required for 100 %.

Reliable operating point: k8v4 — FP8 keys + 4-bit uniform V +
norm correction + per-head quantization — ~2.6× compression, 96 %
of baseline GSM8K
. That's one compression tier less aggressive
than our 4.61× target. When the community pushed their codec to
match our compression ratio (3-bit K + 2-3-bit V), quality collapsed
just as badly as ours does on vLLM.

Algorithmic comparison

Feature PR #38479 Our PR #17
Compression point Online, fused Triton store+decode Offline snapshot via Rust CPU subprocess
K transform WHT on raw K, Gaussian Lloyd-Max PCA rank=D/2 + WHT + calibrated Lloyd-Max
V transform Uniform quant + norm correction PCA + WHT + share_basis Lloyd-Max (calibrated)
Q-preconditioning ✅ (Chol Σ_q on K)
Outlier compensation ✅ (T=2.0 on ~4.5 % K coords)
Norm correction ✅ (re-normalise centroid before inverse rotate)
Per-head quant ❌ (shared block-level PCA/K-means)
Verified stable compression ~2.6× (k8v4) 4.61× on HF; 4.61× on vLLM REJECT

Our codec has more algorithmic guardrails; their engineering is
substantially tighter.
PR #38479's decompression runs
inside the FA Triton kernel, so there's no fp32↔bf16 round-trip
and the codec's error integrates through the attention kernel with
minimal extra numerical noise. PR #17 goes GPU bf16 → CPU fp32 →
Rust subprocess → disk → numpy → GPU fp32 → bf16 → FA, which
introduces the "intrinsic engine" bucket we measured at ~11 pp in
PR #16's revised decomposition.

Concrete ways to close the gap (ordered by effort)

Since PR #38479 has landed upstream, the productive target is
combining their engineering with our algorithm:

  1. Add norm correction to our snapshot-mode harness (small code
    change; ~1-2 pp Δppl improvement expected).
  2. Switch to per-head quantization (Python-side restructuring
    of the block pooling; ~1-3 pp).
  3. Port our codec math into a fused Triton decode kernel (big
    eng project; should close most of the remaining ~11 pp
    engine bucket).
  4. Contribute Q-precond + calibrated Lloyd-Max + outlier
    compensation back to PR #38479's backend
    — the long-term
    convergence point where both efforts become one.

The expected end state is Δppl ~+10 %, top-1 > 78 % at 4.6×
compression on vLLM
— closing the HF↔vLLM gap without giving up
compression ratio, by combining the two efforts' best pieces.

What this PR ships

  • benchmarks/e2e_ppl_validation_vllm_snapshot.py — three-phase
    hook (capture, replace, off) that emulates HF's
    DynamicCache pattern inside vLLM.
  • benchmarks/run_v1_3_ppl_snapshot_vllm.sh — driver with the
    same env-var knobs as the production harness.
  • reports/v1_3_ppl/snapshot_mode/FINDINGS.md — per-mode results
  • reports/v1_3_ppl/snapshot_mode/COMPARISON_VLLM_PR_38479.md
    full head-to-head analysis against the merged upstream PR.
  • reports/v1_3_ppl/snapshot_mode/ds_distill_qwen_1_5b_snapshot_vllm_snapshot.json
    — per-passage raw metrics.

What this PR does NOT change on the repo

The standing production harness (PR #15, in-forward hook) is not
modified. This PR stands alongside as an alternative scenario-A
implementation. PR #15 stays the research harness for studying
end-to-end Δppl under in-forward codec injection.

Relationship to other PRs

Open in Web Open in Cursor 

cursoragent and others added 30 commits April 21, 2026 15:13
Brings the v1.3 pieces that are shared by the e2e validation branch
onto main as a clean baseline for vLLM integration:

- kakeyaturbo/src/codec.rs: PcaMethod enum (Exact | Randomized{...}),
  CodecParams gets pca_method field, fit_pca_dispatch routes to
  exact or randomized PCA path.
- kakeyaturbo/src/pca.rs: adds fit_weighted_pca_randomized()
  (Halko-Martinsson-Tropp randomized SVD with power iterations).
- kakeyaturbo/src/lib.rs: re-export PcaMethod.
- kakeyaturbo/src/bin/kakeyaturbo-bench.rs: new CLI flags
    --pca-method exact|randomized
    --rsvd-target-rank N --rsvd-oversample N --rsvd-power-iters N
    --dump-decoded PATH   (write decoded KKTV for external drivers)
- benchmarks/e2e_ppl_validation.py: HF-transformers harness that
  prefills DynamicCache, round-trips every full-attention layer
  through the Rust codec, teacher-forces continuation tokens, and
  reports Δppl / top1 / KL.
- kakeyaturbo/tests/integration.rs: update CodecParams initializer
  for the new field.

All 153 Rust unit + integration + proptest cases pass.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
benchmarks/e2e_ppl_validation_vllm.py is a drop-in alternative to
e2e_ppl_validation.py that routes the forward pass through vLLM
rather than HF eager attention, so the measured \u0394ppl reflects the
codec's behaviour under the production inference engine.

Integration approach:

- Monkey-patch vllm.attention.layer.Attention.forward (installed
  before LLM construction).
- The patched forward, when CodecState.active is True, round-trips
  the K and V tensors through the v1.3 Rust codec (kakeyaturbo-bench
  --dump-decoded) before passing them on to the underlying
  paged-attention kernel. K uses inner_product metric, V uses mse +
  share_basis (asymmetric config matching the HF harness).
- Each passage is evaluated twice: once with the codec OFF
  (reference) and once ON (alt), using vLLM's prompt_logprobs=1.
  Per-position PPL and top-1 agreement over [ctx_len, ctx_len+n_eval)
  are compared with the same ACCEPT/MARGINAL/REJECT verdict as the
  HF harness, so the two engines' numbers can be placed side-by-side.

benchmarks/run_v1_3_ppl_vllm.sh: convenience driver that builds the
Rust bench binary and launches the harness with the standard smoke
config (Qwen2.5-0.5B, ctx=1024, 2 passages, b=2, rsvd).

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
- Patched forward now has the exact (query, key, value, kv_cache,
  attn_metadata) signature vLLM's custom-op dispatcher expects.
- Use self.head_size from the Attention module to reshape the KV
  tensor correctly regardless of whether it enters as 2D
  [num_tokens, num_kv_heads * head_size] or already-reshaped 3D.
- Use self.layer_name (vLLM's stable per-layer identifier) as the
  default layer_id; fall back to a module-scope counter only when
  not present.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
On Vast.ai vLLM is installed into /venv/main, not into the system
python3 on PATH. Default PYTHON_BIN to the venv python when that
venv exists, and allow override via the PYTHON_BIN env var.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Adds reports/v1_3_rsvd_rope/e2e_ppl_vllm_smoke/:
- qwen2_5_0_5b_vllm.json: per-passage metrics (Qwen2.5-0.5B, ctx=1024,
  n_eval=64, 2 passages, b=2 rsvd, randomized PCA, vr=0.95).
- FINDINGS.md: engine setup, cross-engine comparison against the HF
  harness from PR #12, reproduction instructions.

Result summary:
  Δppl mean = +291.9 %   (passage 1: +192 %, passage 2: +391 %)
  top1 mean = 46.9 %
  verdict   = REJECT (threshold is |Δppl|<=1% AND top1>=95%)

This confirms on the production inference engine (vLLM 0.7.3 with
Flash-Attention on H200) what PR #12 found on HF eager: the v1.3
codec at its tier-1 setting does not preserve downstream quality.
The magnitude of the degradation is smaller on vLLM than on HF
(+292% vs +29,086%), but both clearly REJECT.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Per SPRINT_CLOSEOUT.md (PR #13), the production recipe is "v1.3 PPL"
= v1.3 RSVD + 4 guardrails (Q-preconditioning, calibrated Lloyd-Max
K codebook, 6-layer boundary protection, outlier compensation T=2.0).

The smoke result landed in this PR's last commit (+292% \u0394ppl, 47%
top-1 on Qwen2.5-0.5B + bare v1.3 b=2) is the V0 baseline under the
sprint's ladder, NOT the production v1.3 PPL. Its number aligns with
the ladder's V0 cell (+355% \u0394ppl, 42% top-1 on HF / DS-Distill).

Remove that datapoint from this PR; this PR now scopes only the
reusable vLLM integration scaffolding (codec port + Attention.forward
monkey-patch + harness skeleton). The production-recipe integration
is moved to a follow-up branch:

    AgentMemory/v1-3-ppl-full-guardrails-vllm-102e

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Brings the calibrated Lloyd-Max codebook + outlier compensation into
the Rust codec so the full 'v1.3 PPL' production recipe can be
driven from the vLLM harness.

New CodecParams fields:
  - custom_centroids: Option<Vec<f32>>   \u2014 calibrated Lloyd-Max table
  - outlier_threshold: Option<f32>        \u2014 residual threshold T
  - exact_rank_cap: Option<usize>         \u2014 caps d_eff on exact PCA
  - skeleton_dtype fp16|fp32

New CLI flags on kakeyaturbo-bench:
  --centroids-file PATH     load 2^bits LE-f32 sorted centroid table
  --outlier-threshold T     extract post-WHT residual coords with
                            |scaled_residual| > T as (u16 idx, f16
                            val) sparse overrides at decode

Internal wiring: *_with_centroids variants of encode_block /
decode_block / encode_layer / decode_layer thread the centroid table
and outlier buffer through the block pipeline without changing the
wire format when neither is set (bit-compatible default).

153 Rust unit + 5 integration + 6 proptest = 164 tests pass.

Source: cherry-pick of 521e97b ('outlier compensation: Rust codec
support + Python harness + 4-passage PPL validation on DS-Distill')
onto the v1.3 scaffolding branch (includes the preceding Steps 3+4
Lloyd-Max + boundary infrastructure from 05dfbc5).

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Artifacts needed for the 'v1.3 PPL' production recipe:

  reports/v1_4_q_pca/flagship/deepseek_distill_q_calib.{safetensors,json}
    \u03a3_q Cholesky factors per (layer, kv-head) for
    DeepSeek-R1-Distill-Qwen-1.5B (28 layers \u00d7 [2, 128, 128]).

  reports/v1_4_q_pca/calibrated_codebook/ds_K_b{2,3}_centroids.f32
  reports/v1_4_q_pca/calibrated_codebook/ds_V_b2_centroids.f32
    Empirical Lloyd-Max centroid tables (2^bits LE-f32 floats each)
    trained on pooled 25M post-WHT residual samples from DS-Distill.

Python helpers:

  benchmarks/q_precondition.py     K-stream whitening K_tilde = K @ L
  benchmarks/q_calibration.py      offline \u03a3_q Cholesky calibration
  benchmarks/lloyd_max_calibration.py  offline residual codebook fitting

These are the pieces the forthcoming vLLM harness extension needs
to drive the full v1.3 PPL recipe (Q-precond + calibrated Lloyd-Max
+ boundary skip + outlier T=2.0).

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
benchmarks/e2e_ppl_validation_vllm_full.py drives the full recipe:

  1. Q-preconditioning  K_tilde = K @ L (per layer, per kv-head)
  2. Calibrated Lloyd-Max  via --centroids-file to Rust codec
  3. 6-layer boundary skip  [0, 1, 7, 14, 26, 27] stay bf16
  4. Outlier compensation  T = 2.0 on K residual (--outlier-threshold)

Unlike the scaffolding harness (e2e_ppl_validation_vllm.py) which
hooks at vllm.attention.layer.Attention.forward (post-RoPE), this
harness patches vllm.model_executor.models.qwen2.Qwen2Attention.forward
to intercept K/V immediately after the QKV projection, BEFORE RoPE:

  qkv \u2192 split(q, k, v)
  k   \u2190 unwhiten(codec_roundtrip(whiten(k), centroids, outlier))  [pre-RoPE]
  v   \u2190 codec_roundtrip(v, centroids)
  q, k \u2190 rotary_emb(positions, q, k)
  \u2026 rest of normal attention runs on the repaired K and V

This matches the PR #13 HF harness (benchmarks/e2e_ppl_pre_rope.py)
semantically: \u03a3_q is calibrated on pre-RoPE K distributions, so the
whitening must be applied to pre-RoPE K for the Sigma_q-MSE
equivalence to hold.

benchmarks/run_v1_3_ppl_full_vllm.sh: driver with env-overridable
defaults matching the SPRINT_CLOSEOUT production cell:

  DS-Distill D=128, K b=3 + V b=2, T=2.0, 6 bdry \u2192
  target \u0394ppl +7.82% / top-1 78.97% / ratio 4.61\u00d7 (MARGINAL)

Syntax-check clean; end-to-end on GPU pending.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Full production recipe (K b=3 Q-precond + calibrated Lloyd-Max +
outlier T=2.0 + 6-layer boundary skip; V b=2 calibrated + share-basis)
runs end-to-end on vLLM 0.7.3 for DeepSeek-R1-Distill-Qwen-1.5B.

Result (ctx=2048, n_eval=64, 4 WikiText-103 passages):

  Passage 1: \u0394ppl -8.86%  top1 56.2%
  Passage 2: \u0394ppl +32.79% top1 51.6%
  Passage 3: \u0394ppl +40.46% top1 65.6%
  Passage 4: \u0394ppl +76.92% top1 64.1%

  Mean \u0394ppl = +35.33 %, mean top-1 = 59.4 %  \u2192 REJECT

Guardrails move bare v1.3 b=2 from +292% on vLLM (PR #14) \u2192 +35%
on vLLM (this PR), an ~8\u00d7 \u0394ppl improvement \u2014 directional agreement
with the HF ladder (+356% \u2192 +8% on DS-Distill). However vLLM ends
~4.5\u00d7 worse in \u0394ppl than HF at the same codec config on the same
model family. Two likely causes (Flash-Attention numerics vs. HF
eager, and CPU/GPU fp32<->bf16 boundary noise) are documented with
follow-up sweeps in FINDINGS.md.

Artifact: reports/v1_3_ppl/vllm/ds_distill_qwen_1_5b_vllm_full.json
Full write-up: reports/v1_3_ppl/vllm/FINDINGS.md

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Motivation: PR #15 showed vLLM v1.3 PPL full-recipe gives \u0394ppl
+35.3% vs HF's +7.82% on the same codec config on DS-Distill. Two
hypotheses were named but not separated: (1) \u03a3_q was calibrated on
pre-RoPE Q, but Flash-Attention computes Q@K.T on post-RoPE Q,
breaking the Sigma_q -> K_tilde metric equivalence; (2) the per-
forward CPU\u2194GPU and fp32\u2194bf16 round-trip itself accumulates
enough numerical noise to degrade PPL.

This harness runs four cells against a single shared reference,
pair-wise per passage so all cells observe the same ref PPL:

  identity-pre_qp   whiten \u2192 identity codec \u2192 unwhiten
                      isolates hypothesis (2): everything except
                      compression
  codec-no_qp       real codec, no whitening
                      isolates "codec only"
  codec-pre_qp      production recipe (matches PR #15)
  codec-post_qp     codec + post-RoPE \u03a3_q_post self-calibrated
                      online from this run's own post-RoPE Q tensors
                      isolates hypothesis (1): correct whitening
                      under FA

Key implementation details:

- Qwen2Attention.forward is patched once; the patch branches on
  CodecState.mode/qp_mode to pick the right hook.
- PostRopeQCalib accumulates Sum(q q.T) per (layer, kv-head) during
  a cheap dedicated calibration forward pass (codec off), then
  Cholesky-factors with a small ridge for stability. For GQA models
  (num_heads > num_kv_heads) we pool Q heads in the same KV group
  before accumulating, matching the metric FA actually computes.
- All cells share the same  computed once; each cell runs
  its own alt_pls and compares per passage.
- Identity codec does skip the kakeyaturbo-bench subprocess but still
  goes through the full fp32\u21a6numpy\u21a6CPU\u21a6numpy\u21a6fp32 path, so it
  measures the complete CPU\u2194GPU noise floor.

Syntax-clean; GPU run on Vast.ai H200 pending in the same turn.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…irection

Paired 4-cell ablation on DS-Distill + vLLM H200 (shared ref):

  identity-pre_qp    \u0394ppl  -0.29%  top1 98.83%  ACCEPT
  codec-no_qp        \u0394ppl +152.78%  top1 59.38%  REJECT
  codec-pre_qp       \u0394ppl  +35.33%  top1 59.38%  REJECT  (= PR #15)
  codec-post_qp      \u0394ppl  +54.28%  top1 57.03%  REJECT

Findings:

- H2 (CPU\u2194GPU + fp32\u2194bf16 noise) is ruled out. The identity cell
  walks the complete production hook pipeline minus compression and
  records -0.29% \u0394ppl / 98.83% top-1.
- H1 (\u03a3_q was calibrated on pre-RoPE Q but FA operates on post-RoPE
  Q) as a direct fix-up is wrong. Online self-calibrated \u03a3_q^post
  makes things STRICTLY WORSE (+54% vs +35%). Math: RoPE is
  position-dependent; pooling post-RoPE Q over tokens averages away
  the per-token rotations and collapses anisotropy, giving a flatter
  pooled \u03a3 than the true per-token FA metric.
- Pre-RoPE whitening IS the FA-correct thing (R_t L L^T R_t^T =
  R_t \u03a3_q R_t^T commutes with the per-token rotation). The Q-precond
  architectural choice in PR #13 is verified correct for vLLM too.

The remaining +35% gap is not Q-precond placement but almost
certainly calibration-distribution drift: \u03a3_q + centroids were all
fit on HF DynamicCache prefill snapshots, but vLLM's Qwen2 layer
produces slightly different prefill K/V distributions (different bf16
accumulation / RoPE impl / attn bias). The codec has to eat that
mismatch. Next experiment: re-fit \u03a3_q and Lloyd-Max centroids on
vLLM prefill snapshots and re-run codec-pre_qp.

Artifacts:
  reports/v1_3_ppl/vllm_ablation/FINDINGS.md
  reports/v1_3_ppl/vllm_ablation/ds_distill_qwen_1_5b_vllm_ablation.json

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Follow-up to the ablation in reports/v1_3_ppl/vllm_ablation/:
H2 (noise) and post-RoPE Q-precond hypothesis both ruled out; the
remaining +35% \u0394ppl gap on vLLM vs HF is most likely calibration-
distribution drift (\u03a3_q and Lloyd-Max centroids were fit on HF
DynamicCache snapshots, not vLLM prefill snapshots).

benchmarks/vllm_calibration_refit.py:

  1. Spins up vLLM LLM (bf16, enforce_eager).
  2. Installs a capture-only monkey patch on
     Qwen2Attention.forward that records the pre-RoPE q/k/v
     tensors without modifying the forward.
  3. Runs N calibration passages from the WikiText-103 TRAIN split
     by default (disjoint from the TEST split the PPL measurement
     uses), so no leakage.
  4. Factors \u03a3_q per (layer, kv-head) by pooling the Q heads in
     each GQA KV group. Matches the format used by
     benchmarks/q_precondition.QPrecond exactly:
       layer_<l>_chol      [n_kv, D, D]  fp32
       layer_<l>_inv_chol  [n_kv, D, D]  fp32
       layer_<l>_sigma     [n_kv, D, D]  fp32
  5. Re-runs the Lloyd-Max residual pipeline on the captured K
     (whitened with the fresh \u03a3_q) and V streams, producing
     drop-in replacements for the ds_K_b{2,3}_centroids.f32 /
     ds_V_b2_centroids.f32 tables.

Outputs at --out-dir (default reports/v1_3_ppl/vllm_recalibrated/):
  q_calib.safetensors
  q_calib.json
  K_b2_centroids.f32, K_b3_centroids.f32, V_b2_centroids.f32
  SUMMARY.json

benchmarks/run_vllm_calibration_refit.sh is the driver.

Next step (not in this commit): re-run the codec-pre_qp ablation cell
with --q-calib-pre-rope=.../q_calib.safetensors and --k-centroids /
--v-centroids pointing at the new .f32 files.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
lloyd_max_calibration.py has top-level 'from transformers import …'
and 'import benchmarks.pre_rope_cache …', which were blocking our
tool from being usable in a vLLM-only environment (HF transformers
IS present, but pre_rope_cache is a HF-eager patch that we don't
need here — we just want four math utilities).

Load the pure-numpy functions by stripping the HF imports from the
source before exec()'ing it, then picking up fit_pca_simple /
next_pow2 / wht_rotate / lloyd_max_iterate from the resulting
namespace. Semantics unchanged.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Captured pre-RoPE Q/K/V from vLLM 0.7.3 on
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B using 8 disjoint
WikiText-103 TRAIN split passages of 2048 tokens each (16k tokens
per layer per kv-head pool).

Artifacts (drop-in compatible with QPrecond / kakeyaturbo-bench):

  q_calib.safetensors      28 layers \u00d7 [2, 128, 128] (chol/inv/sigma)
                           \u03a3_q anisotropy (cond): median 4506, max 109076
                           (off-diag max / diag mean: median 15.5, max 34.8)
                           \u2014 strongly anisotropic, Q-precond has something
                           to do.
  K_b2_centroids.f32       Gaussian MSE 0.1143 \u2192 calibrated 0.0721 (1.59\u00d7)
  K_b3_centroids.f32       Gaussian MSE 0.0318 \u2192 calibrated 0.0214 (1.48\u00d7)
  V_b2_centroids.f32       Gaussian MSE 0.1140 \u2192 calibrated 0.1140 (1.00\u00d7)
  q_calib.json             per-(layer, kv-head) diagnostics
  SUMMARY.json             centroid comparison

These replace the HF-calibrated tables in
reports/v1_4_q_pca/{flagship,calibrated_codebook}/ when the
vLLM harness is run with --q-calib-pre-rope=.../q_calib.safetensors
and --k-centroids/-v-centroids pointing at the new .f32 files.

K improvement ratios match the SPRINT_CLOSEOUT HF-calibrated numbers
(1.47\u00d7 at b=2 / 1.40\u00d7 at b=3) closely \u2014 the vLLM post-WHT residual
distribution looks quantitatively similar to HF's, just slightly
shifted. Whether this shift is what causes the +35% \u21a6 ??? \u0394ppl gap
is the next ablation cell.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Hypothesis H3 ('the +35% vLLM \u0394ppl gap vs HF's +7.82% comes from
\u03a3_q + Lloyd-Max being calibrated on HF, not vLLM, prefill
distributions') was the leading candidate after the H1/H2 ablation.

This commit tests and rules it out.

Procedure:
  1. Capture pre-RoPE Q/K/V from vLLM on 8 disjoint WikiText-103
     train passages (2048 tokens each).
  2. Fit \u03a3_q and K/V Lloyd-Max tables on THAT data.
  3. Re-run the 4-cell ablation with the drop-in replacements.

Result comparison (HF-calibrated vs vLLM-calibrated, same test passages):

  identity-pre_qp    -0.29% \u2192  +0.15%   (noise)
  codec-no_qp      +152.78% \u2192 +144.56%  (noise)
  codec-pre_qp      +35.33% \u2192  +38.69%  (+3 pp, noise)
  codec-post_qp     +54.28% \u2192  +58.24%  (noise)

Calibration drift does NOT explain the HF vs vLLM gap. vLLM-origin
calibration does not measurably beat HF-origin calibration, because
the pre-RoPE Q/K/V distributions on vLLM are close enough to HF's
that the HF tables are already well-matched.

Lloyd-Max improvement ratios corroborate: HF-calibrated K b=2 is
1.47x better than Gaussian; vLLM-calibrated K b=2 is 1.59x. Close.

Remaining candidates:
  H4: Flash-Attention bf16 softmax/score reduction amplifies
      codec residuals more than HF eager's f32-accumulate path.
      Engine-level numerical sensitivity, not fixable by
      re-calibration.
  H5: vLLM's prompt_logprobs=1 forward path integrates compression
      residuals through a different numerical trajectory than HF's
      prefill + teacher-force two-pass.

Both would require a different class of experiment (e.g. a
near-exact codec run vs identity, or port of the HF harness to run
teacher-forcing on a vLLM-reconstructed cache) to distinguish.

Artifacts:
  reports/v1_3_ppl/vllm_recalibrated_run/FINDINGS.md
  reports/v1_3_ppl/vllm_recalibrated_run/ds_distill_qwen_1_5b_vllm_ablation.json

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Three of the four cells (codec-no_qp, identity-pre_qp, codec-post_qp)
served only to falsify hypotheses H1, H2, H3 for the HF\u2194vLLM \u0394ppl
gap. All three hypotheses are now closed:

  H1 (\u03a3_q in wrong frame)            ruled out: post-RoPE \u03a3_q
                                         strictly worse
  H2 (CPU\u2194GPU + fp32\u2194bf16 noise)    ruled out: identity cell
                                         \u0394ppl \u2248 0, top-1 99%
  H3 (calibration distribution drift)  ruled out: vLLM-origin
                                         calibration indistinguishable
                                         from HF-origin

Only the production cell (codec-pre_qp) carries a standing datapoint
going forward. Delete the 4-cell harness + runner and the two
ablation report directories that contain the now-obsolete cells. The
surviving datapoints (HF-calib production + vLLM-calib production)
will be re-recorded with a clean, 1-cell format from
e2e_ppl_validation_vllm_full.py, which is what this PR has always
shipped for production.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
After ruling out H1 / H2 / H3 in the previous ablation rounds, only
one production-relevant datapoint remains:

  HF-calibrated  (as shipped)  \u0394ppl +35.33%  top1 59.38%
  vLLM-calibrated (this PR)    \u0394ppl +38.69%  top1 61.33%

Both reject; difference is passage noise. This commit:

- Removes the obsolete per-subdir FINDINGS.md.
- Adds a single reports/v1_3_ppl/FINDINGS.md holding the two rows +
  passage detail + what H1/H2/H3 ruled out + the H4/H5 residual
  hypotheses to test next.
- Keeps vllm/{.json}, vllm_calibrated/{.json}, vllm_recalibrated/
  as the backing artifacts.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Needed for the H4 ablation: rerun the same codec-pre_qp cell under
a non-Flash-Attention backend. vLLM 0.7.3 picks the attention
backend from the VLLM_ATTENTION_BACKEND env var; we expose it as
ATTN_BACKEND for the driver.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
H4 setup:
  - ATTN_BACKEND env var in the driver \u2192 VLLM_ATTENTION_BACKEND on
    the engine; switch from FLASH_ATTN to XFORMERS.
  - Reports under reports/v1_3_ppl/vllm_h4_xformers/.

H5 setup:
  - New CodecState.prefix_only_tokens + --prefix-only-tokens CLI.
  - When set, the codec only round-trips the first N rows of each
    layer's K/V tensor; the tail is pass-through. Mirrors the HF
    harness's two-pass 'codec only touched the prefill cache,
    teacher-force saw exact K/V' semantics inside vLLM's single-
    forward path.
  - Driver forwards PREFIX_ONLY_TOKENS env var as --prefix-only-tokens.

H4 result: \u0394ppl +37.82% / top-1 60.16% (XFORMERS) vs +35.33% /
59.38% (FLASH_ATTN); within passage noise. H4 FALSIFIED: the
residual amplification is not specific to Flash-Attention's bf16
softmax. (TORCH_SDPA is unsupported in the vLLM 0.7.3 V0 engine on
CUDA; FLASHINFER is not installed on this image.)

H5 pending in the next commit.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…sified

With PREFIX_ONLY_TOKENS=2048 (codec only touches positions < ctx_len
= 2048, eval window [2048, 2112) stays uncompressed), mirroring HF's
two-pass 'prefill cache \u2192 teacher-force' semantics inside vLLM's
single-forward path:

  \u0394ppl  +35.41 %  (baseline was +35.33 %)
  top-1  59.77 %    (baseline 59.38 %)
  \u2192 REJECT

Essentially identical to the full-prefix baseline. The HF\u2194vLLM gap is
NOT a measurement-path artifact: compressing only the prefill window
and leaving the eval window clean does not change the result.

Combined with H4 (XFORMERS gave +37.82%), both residual hypotheses
are closed. The next step is the engineering fallback: sweep K b=4
(more K headroom).

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
For K b=4 the SPRINT_CLOSEOUT notes calibrated Lloyd-Max centroids do
not help (slightly degrade in fact), so ds_K_b4_centroids.f32 is not
shipped. Let the driver gracefully omit --k-centroids / --v-centroids
when the env var is empty or 'none'.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
… bottleneck on vLLM

After closing H1/H2 earlier and H3 in the previous commit, this
iteration closes H4 and H5 and then executes the engineering fallback
(sweep K b=4, then K b=4 + V b=4) on the codec-pre_qp production
cell.

Results on DS-Distill-Qwen-1.5B / vLLM 0.7.3 H200 / 4 WikiText-103
test passages / ctx=2048 / n_eval=64:

  production  (K b=3, V b=2, FLASH_ATTN, HF calib)  \u0394ppl +35.33 %
  H3 vLLM-calib   same but \u03a3_q+centroids re-fit on vLLM    +38.69 %  noise
  H4 XFORMERS     same except VLLM_ATTENTION_BACKEND=XFORMERS +37.82 %  noise
  H5 prefix-only  same except codec only touches pos<2048    +35.41 %  noise
  strategy K b=4  K bits 3\u21924 (Gaussian K centroids)          +37.30 %  no improvement
  strategy K+V 4  K+V bits to 4, no calibrated centroids     +27.32 %  \u221210 pp

H4 falsified: swapping FA for XFormers does not close the gap. The
\u03b4ppl amplification is not specific to Flash-Attention's bf16 softmax.

H5 falsified: restricting the codec to positions <ctx_len (eval window
sees clean K/V) does not close the gap. HF's two-pass and vLLM's
single-forward paths integrate codec residuals similarly at PPL
resolution.

Strategy: K headroom alone doesn't help (+35.33 \u2192 +37.30 at b=4).
Doubling V rate alone (b=2 \u2192 b=4, no outlier, no V calibration)
buys ~10 pp \u0394ppl (+37.30 \u2192 +27.32) \u2014 the FIRST knob that shifts
the number meaningfully.

Interpretation: the residual HF\u2194vLLM gap is a V-stream failure mode
that is specific to vLLM's FA-family integration of b=2 V. On HF,
V residuals are 'natively Gaussian' so b=2 Lloyd-Max is near-optimal;
under FA's bf16 softmax(QK^T/\u221ad) @ V accumulation, that
approximation is less forgiving.

For deployment:
  HF users  \u2014 SPRINT_CLOSEOUT MARGINAL cell holds. No change.
  vLLM users \u2014 honest in-engine \u0394ppl is +35% at the standard config.
    Cheapest fix: V b=2 \u2192 V b=4 (\u221210 pp \u0394ppl, loses \u22c51/2 V compression).
    Proper fix: V-side codec redesign targeting FA's score@V accumulation.

Artifacts:
  reports/v1_3_ppl/FINDINGS.md (consolidated \u2014 6 rows)
  reports/v1_3_ppl/vllm_kb4/ds_distill_qwen_1_5b_kb4_vllm_full.json
  reports/v1_3_ppl/vllm_kv4/ds_distill_qwen_1_5b_kb4_vb4_vllm_full.json

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
… knob

Drop all H3/H4/H5/kb4/kv4 artifacts and the refit tool. The single
standing cell on this branch is:

  codec-pre_qp (production)  DS-Distill / vLLM 0.7.3 / FLASH_ATTN
    K b=3 + V b=2 + HF-calibrated Lloyd-Max + outlier T=2.0 +
    6-layer boundary skip + pre-RoPE Q-preconditioning
  \u2192 \u0394ppl +35.33 %, top-1 59.38 %, REJECT

All retired cells were confirmed to leave \u0394ppl within passage noise
of this baseline (or, for KV b=4, moved it by only ~10 pp while still
REJECT).

Replace their per-stream debugging surface with a single clean knob:
--compress-stream {kv, k, v} (also COMPRESS_STREAM env var on the
driver). 'k' routes only K through the codec and leaves V bf16;
'v' is the mirror; 'kv' (default) is the production cell. This
exposes the per-channel \u0394ppl attribution that PR #15's V-vs-K
localisation pointed at, cleanly, from the production harness.

Remove the now-obsolete --prefix-only-tokens, ATTN_BACKEND, and
K_CENTROIDS=none overrides from the driver; the streams knob covers
the remaining question we want to answer.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…94ppl

Three rows on DS-Distill-Qwen-1.5B / vLLM 0.7.3 H200, 4 WikiText-103
test passages, ctx=2048, n_eval=64, shared reference (paired):

  production (K+V)  \u0394ppl +35.33 %  top1 59.38 %  REJECT
  K-only (V bf16)   \u0394ppl +22.55 %  top1 69.14 %  REJECT
  V-only (K bf16)   \u0394ppl +11.10 %  top1 74.22 %  REJECT

K-only + V-only = +33.65 \u2248 +35.33 \u2212 1.68 pp interaction. So K and V
contribute roughly additively at this codec config, with K carrying
about two-thirds of the Δppl (+22.55 / +35.33 \u2248 64 %) and V carrying
about one-third (+11.10 / +35.33 \u2248 31 %).

This is a material deviation from the HF picture: SPRINT_CLOSEOUT's
v1.3 PPL recipe spends ALL four guardrails (Q-precond, K Lloyd-Max
centroids, outlier T=2.0, 6-layer boundary skip) on the K stream
precisely because HF's V b=2 with share_basis is 'natively Gaussian'
and near-lossless. On vLLM+FA the same K-side stack still leaves
+22.55 pp \u0394ppl on K alone, so K is the bigger lever; the cheapest
path to closing the HF\u2194vLLM gap is a vLLM-specific K-side redesign.

Top-1 attribution is slightly different from \u0394ppl attribution:
K-only drops top-1 by 15.08 pp and V-only by 25.78 pp (joint K+V:
40.86 pp loss). Both channels' logit distortions reorder the one-
best similarly at full context; V distortions tend to preserve top-1
better than they preserve the log-prob of the true token.

Artifacts:
  reports/v1_3_ppl/vllm_k_only/ds_distill_qwen_1_5b_k_only_vllm_full.json
  reports/v1_3_ppl/vllm_v_only/ds_distill_qwen_1_5b_v_only_vllm_full.json
  reports/v1_3_ppl/FINDINGS.md  (consolidated, 3 rows)

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
In SPRINT_CLOSEOUT v1.3 PPL, outlier compensation T=2.0 is K-side
only; V has calibrated Lloyd-Max and 6-layer boundary but no outlier
threshold. The Rust codec already supports --outlier-threshold on
any stream \u2014 it's just that our Python harness hardcoded V's to
None to match the HF recipe.

Expose it:
  --v-outlier-threshold T   Python CLI
  V_OUTLIER_THRESHOLD=T     driver env var (unset = V has no
                            outlier compensation, matching the HF
                            v1.3 PPL recipe)

Semantics of the 'four guardrails' for V under this PR's per-channel
attribution question:

  (1) Q-preconditioning          N/A for V (V does not contract with
                                  Q; there is no \u03a3_q metric on V)
  (2) calibrated Lloyd-Max        already on  (ds_V_b2_centroids.f32)
  (3) 6-layer boundary skip       already on  (same layers as K)
  (4) outlier compensation        now optional via --v-outlier-threshold
                                  (was hardcoded None)

The V-only baseline row in FINDINGS.md (\u0394ppl +11.10 %) already has
(2) and (3) active. This commit lets the next V-only run add (4)
too, to answer the question 'what is V-only PPL with all four
guardrails that APPLY to V turned on?'

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
On DS-Distill + vLLM FLASH_ATTN / 4 WikiText-103 test passages /
shared paired reference, running the production cell with
COMPRESS_STREAM=v and V_OUTLIER_THRESHOLD=2.0 gives:

  V-only (SPRINT_CLOSEOUT recipe, no V outlier)   +11.10 %  top1 74.22 %
  V-only (+ outlier T=2.0)                         +7.04 %  top1 75.39 %
                                                   \u22124.06 pp \u0394ppl

Four SPRINT_CLOSEOUT guardrails and their applicability to V:
  (1) Q-precond   N/A (V does not contract with Q; no \u03a3_q metric)
  (2) Lloyd-Max   already on (ds_V_b2_centroids.f32)
  (3) 6-bdry      already on
  (4) outlier T   was off in SPRINT_CLOSEOUT; this commit enables it

So row 4 is 'V with all APPLICABLE guardrails'. Outlier compensation
is a cheap V add-on worth ~4 pp on vLLM. On HF the SPRINT_CLOSEOUT
reasoning held (V residual already near-Gaussian), but under
FLASH_ATTN's bf16 accumulation the remaining V outliers apparently
still matter.

Consolidated table in reports/v1_3_ppl/FINDINGS.md now has 4 rows:

  production (K+V)                            +35.33 %   59.38 %
  K-only                                      +22.55 %   69.14 %
  V-only (SPRINT_CLOSEOUT V recipe)           +11.10 %   74.22 %
  V-only + outlier (all applicable guards)    +7.04 %    75.39 %

K is still the bigger lever under vLLM; V outlier compensation is a
cheap add that shaves ~4 pp off V's standalone cost on top.
Artifact: reports/v1_3_ppl/vllm_v_only_full_guards/*.json

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
PR #16's gap decomposition localised +39 pp of the HF\u2194vLLM \u0394ppl gap
to cross-layer non-linear compounding introduced by the production
harness running the codec INSIDE the forward at every layer. In
production scenario A ('compress KV after a clean prefill, to save
memory during generation'), the codec is applied ONCE to a snapshot
of the already-populated KV tensors \u2014 just like HF's DynamicCache
pattern.

This harness reproduces that semantic inside vLLM:

  Phase capture:  Qwen2Attention.forward hook records the per-layer
                  pre-RoPE K, V for the entire prompt on a clean
                  forward pass (codec OFF).
  Offline:        run the production v1.3 codec on every (layer,
                  stream) snapshot \u2014 Q-precond + calibrated
                  Lloyd-Max + outlier T=2.0 + 6-layer boundary
                  skip on K; Lloyd-Max + share-basis on V.
  Phase replace:  second forward through the SAME engine. Hook
                  ignores each layer's live k, v projections and
                  substitutes the pre-codec'd tensor from the
                  snapshot. Q still comes from the running residual
                  stream (matches HF teacher-force semantics).

Expected outcome:

  if \u0394ppl \u2248 +8 %  \u2192  the entire +39 pp non-linear compounding
                      was harness-integration; scenario A works on
                      vLLM at parity with HF.
  if \u0394ppl still materially > +8 %  \u2192  there is an intrinsic
                      vLLM/FA component on top of the harness effect.

Artifacts:
  benchmarks/e2e_ppl_validation_vllm_snapshot.py
  benchmarks/run_v1_3_ppl_snapshot_vllm.sh

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…3/59.38)

Scenario A (compress KV after a clean prefill, HF two-pass semantics
inside vLLM) gives:

  passage 1: \u0394ppl +11.52%  top1 75.00%
  passage 2: \u0394ppl +13.01%  top1 71.88%
  passage 3: \u0394ppl +35.59%  top1 73.44%
  passage 4: \u0394ppl +56.17%  top1 76.56%

  mean \u0394ppl +29.07%  top1 74.22%  verdict REJECT

Harness delta vs in-forward mode:
  Mode                   \u0394ppl      top-1   verdict
  HF 2-pass              +7.82%     78.97%  MARGINAL
  vLLM snapshot (this)  +29.07%     74.22%  REJECT
  vLLM in-forward       +35.33%     59.38%  REJECT

PR #16 Phase 6 predicted snapshot-mode vLLM would land near HF's
+7.82% because the sum of 22 non-boundary single-layer \u0394ppl
contributions was -3.9% and the 'extra +39 pp' was attributed to
in-forward cross-layer compounding. That prediction was WRONG.

Actual harness-integration contribution: ~6 pp of the 27 pp gap
(not 39 pp). The top-1 DOES jump substantially (59 \u2192 74, within
5 pp of HF's 79), confirming that in-forward pollution was changing
the argmax \u2014 but the residual \u0394ppl stays far from HF.

Revised bucket attribution of the 27 pp HF\u2194vLLM gap:

  in-forward vs snapshot (harness)           ~ 6 pp
  engine baseline shift (Phase 1 clean)      ~10 pp
  intrinsic engine compounding (FA bf16)     ~11 pp

The dominant term is NOT the harness; it is the engine itself. FA
bf16 attention + softmax propagates the same codec residual
through 28 layers differently than HF eager's fp32-accumulate
softmax. This is the real root cause.

Deployment implication: Scenario A is the correct semantics to
deploy (compress already-filled KV cache for memory relief), and
it IS better than the in-forward harness, but on this codec recipe
it does not reach HF MARGINAL parity on vLLM. Top-1 preservation
(74%) is the first positive vLLM datapoint though; the remaining
gap is in the logit distribution, not the argmax.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…Quant)

Head-to-head analysis of the two approaches to KV cache compression
on vLLM. Covers:

  \u2022 PR-description vs community-verified quality numbers \u2014 PR
    #38479's 'tq3' at 4.9x compression shows GSM8K 0.72 in the PR
    header, but @mgoin (Neural Magic) independently measured 0.009
    and @varjoranta (A100) reports garbage output at the default
    config. The actually-stable operating point is k8v4 at 2.6x
    compression (FP8 keys + 4-bit uniform values), which is one
    compression tier less aggressive than our 4.61x target.

  \u2022 Algorithmic comparison \u2014 PR #38479 uses WHT on raw K (no
    PCA), Gaussian-default Lloyd-Max, uniform V, norm correction,
    per-head quantization. No Q-preconditioning, no outlier
    compensation, no calibrated centroids. Our v1.3 PPL has all
    three. Conversely, their engineering is substantially tighter:
    fused Triton store/decode kernels keep dequant in-kernel with
    the FA computation; no CPU round-trip; CUDAGraph compatible.

  \u2022 Quality comparison at matched compression \u2014 their verified
    stable cell (k8v4, 2.6x, GSM8K 96% of baseline) is not
    directly comparable to our v1.3 PPL production cell (4.61x,
    WikiText-103 \u0394ppl +29%). When the community pushed their
    codec to 3-bit K + 2-bit V (matching our compression ratio),
    quality collapsed just as badly as ours does on vLLM.

  \u2022 Concrete recommendations ordered by cost/benefit:
    1. Add norm correction to our snapshot-mode harness (~1-2 pp).
    2. Per-head quantization instead of per-block pooling
       (~1-3 pp).
    3. Fuse our codec into a Triton decode kernel (closes the
       intrinsic engine bucket; large eng project).
    4. Port our algorithm (Q-precond + calibrated Lloyd-Max +
       outlier) into PR #38479's merged backend \u2014 long-term play
       where the two efforts converge.

Artifact: reports/v1_3_ppl/snapshot_mode/COMPARISON_VLLM_PR_38479.md

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
FluffyAIcode pushed a commit that referenced this pull request Apr 22, 2026
New crate kakeyaturbo-py/ — a thin pyo3 wrapper around the existing
kakeyaturbo Rust library that exposes a single function

    roundtrip_layer(array, **kwargs) -> (decoded, report)

which is a byte-for-byte drop-in replacement for

    subprocess.run([kakeyaturbo-bench, --input, --output,
                    --dump-decoded, --verify, ...])
    + write_kktv(in_path, arr) + read_kktv_f32(dec_path)
    + json.loads(rep_path.read_text())

End of the CPU subprocess + tmpfs KKTV round-trip that used to dominate
the PR #15 / PR #17 harness runtime (one fork per layer per stream per
forward pass).

Semantic contract (tests/test_roundtrip_cli_parity.py, 13/13 passing):

  * decoded tensor is np.testing.assert_array_equal vs the CLI path
    (not approx-equal — genuinely the same bits) for every CLI config
    the harness uses: mse / inner_product / linf metrics, exact vs
    randomized PCA, per-block vs --share-basis, with/without
    --centroids-file, with/without --outlier-threshold, and both
    list-of-floats and filesystem-path centroid input forms.
  * mean_block_mse matches within 1e-9 absolute — the gap is purely
    because kakeyaturbo-bench.rs writes its JSON report with '{:.10}'
    truncation; the underlying f64 is identical.
  * Input validation: 2D float32 C-contiguous required (non-contig
    rejected at the boundary, matching numpy 0.28's ReadonlyArray
    semantics); bit_width in 1..=4; centroids strictly ascending;
    metric in {mse, inner_product, linf}.

Implementation notes:

  * Rust hot path drops the GIL via Python::detach (pyo3 0.28's
    renamed allow_threads).  Multiple layers can be pipelined from a
    thread pool in the caller if desired.
  * PcaMethod::Randomized uses seed_offset = 0x9E37_79B9_7F4A_7C15,
    the same salt the kakeyaturbo-bench binary uses.  Two call paths
    are byte-identical modulo Rust RNG state, which is driven entirely
    by rotation_seed + seed_offset.
  * Cargo.lock checked in: pyo3 0.28 + numpy 0.28 pinned exactly to
    prevent silent API drift.
  * Crate declares unsafe_code=forbid on our glue; pyo3/numpy bring
    their own unsafe code as normal for FFI crates.

Harness wiring:

  * benchmarks/e2e_ppl_validation_vllm_full.py (the actual PR #15
    production-cell driver that produced the +35.33 % Delta-ppl
    number in reports/v1_3_ppl/vllm/): rust_roundtrip(arr, ...) now
    calls kakeyaturbo_py.roundtrip_layer directly.  No KKTV I/O, no
    subprocess.  Raises RuntimeError with a build hint if the wheel
    is missing (no silent fallback — ban-list clause 'no fallback
    paths' preserved).
  * benchmarks/e2e_ppl_validation_vllm.py (the simpler sibling
    harness): same swap.
  * Both harnesses keep their full CLI arg surface — the change is
    purely internal plumbing.  run_v1_3_ppl_full_vllm.sh is unchanged.

Build command:

    cd kakeyaturbo-py
    maturin build --release --strip --interpreter python3
    pip install target/wheels/kakeyaturbo_py-*.whl

Tested on rust 1.83, Python 3.12, ubuntu 24.04.  The wheel is abi3
(cp38+), so it works across all supported Python versions without
rebuilding.

Next milestone (M3 exit criterion): re-run the +35.33 % Delta-ppl cell
on H200 via run_v1_3_ppl_full_vllm.sh and verify the number
reproduces.  That is the semantic anchor before M4/M5 port the codec
into Triton kernels.
cursor Bot pushed a commit that referenced this pull request Apr 22, 2026
The full microscopic reasoning behind the +39 pp 'cross-layer
non-linear compounding' estimate from PR #16 Phase 6 existed only
in conversation, never in any FINDINGS.md. This commit preserves
that reasoning as an audit trail so future agents can rechekc it
against new evidence (Option C being the decisive one).

Content:

  * Two coherent error channels in vLLM in-forward path:
      - direct: \u03b5_l in layer l's attention output
      - indirect: \u03b5_l \u2192 \u03b4_l \u2192 W_k \u00b7 \u03b4_l shifts layer l+1's
                  pre-codec K/V input, so layer l+1's codec runs
                  on a 'wrong K'
  * These are sign-correlated across layers, so variance grows
    as (N\u03c3)\u00b2 instead of N\u03c3\u00b2 \u2014 \u221a22 \u2248 4.7\u00d7 amplification over
    linear composition
  * bf16 residual stream aggravates because correlated errors
    can't cancel sub-ULP; HF eager upcasts softmax to fp32,
    vLLM FA stays bf16

This hypothesis predicts snapshot-mode on vLLM should reach ~+8%
(HF parity). PR #17's actual measurement was only -6 pp (+35 \u2192
+29), so the mechanism logic is real but over-estimated by ~6x;
the ~11 pp 'intrinsic engine' bucket is where the actual root
cause lives.

Evidence FOR the hypothesis (Phase 2, Phase 4 signals, PR #17
top-1 +15 pp recovery) is listed alongside evidence AGAINST (PR
#17 \u0394ppl stays high). The document explicitly says it's a draft
hypothesis with partial support, NOT a verified conclusion.

Status set so Option C (fully in-kernel backend) will conclusively
discriminate between (a) CPU round-trip vs (b) FA bf16 softmax
as the source of the remaining 11 pp. Corrections are pre-listed
for whichever outcome lands.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
FluffyAIcode pushed a commit that referenced this pull request Apr 22, 2026
… report

Port of PR #17's Qwen2 snapshot harness to Qwen3-4B under the current
vLLM v1 installation.  Three engineering changes on top of the PR #17
code:

1. Qwen3Attention has qk-norm between the qkv projection and RoPE;
   the capture/replace hook intercepts K, V POST-qk-norm, PRE-RoPE
   (matching the distribution the M2 Σ_q calibration was built on).

2. vLLM v1 runs the engine core in a subprocess by default, so
   module-level HookState set in the harness wouldn't reach the
   model-forward process.  Set VLLM_ENABLE_V1_MULTIPROCESSING=0 to
   force InprocClient — keeps model inference in the harness process
   where the monkey-patch is installed.

3. Install the Qwen3Attention patch via the existing kakeya_v1_3_ppl
   vllm.general_plugins entry point, gated on KAKEYA_SNAPSHOT_QWEN3=1
   so a regular vLLM run is unaffected.  Runs in the parent (same
   process as the harness due to InprocClient).

Files:
  * benchmarks/e2e_ppl_validation_vllm_snapshot_qwen3.py — three-phase
    harness with --disable-q-precond / --disable-centroids /
    --disable-outlier ablation switches
  * vllm_backend/kakeya_v1_3_ppl/snapshot_hook.py — HookState +
    install_qwen3_snapshot_patch + per-layer diagnostics
  * vllm_backend/kakeya_v1_3_ppl/plugin.py — KAKEYA_SNAPSHOT_QWEN3
    branch (early-returns before installing the production backend)
  * reports/v1_3_ppl/snapshot_mode_qwen3/ — per-ablation JSONs +
    QWEN3_SCENARIO_A_REPORT.md

Measured on H200, Qwen3-4B, WikiText-103 test, 2048 ctx × 64 eval:

  Identity replace (skip all 36 layers):  Δppl = +0.00%,  top1 = 100%  (plumbing OK)
  Full production recipe (Σ_q + centroids + outlier, skip 0/1/34/35):
                                           Δppl = +8859%, top1 = 31%   REJECT
  Bare codec (Σ_q / centroids / outlier all off, 4 passages):
                                           Δppl = +619%,  top1 = 55%   REJECT

Root cause (see report): Qwen3-4B's Σ_q Cholesky factors have
median cond(L) = 252 vs DeepSeek-1.5B's 65.  Unwhiten side amplifies
codec error by the condition number.  Plumbing is correct; the M2
calibration + current codec recipe are too lossy for Qwen3-4B under
the snapshot-mode protocol.  Next steps (out of scope for this commit):
regularised Σ_q recalibration OR codec-budget uplift (b_K=4 / k=32 /
d_eff=96).
FluffyAIcode pushed a commit that referenced this pull request Apr 22, 2026
…203%

Measured CPU-vs-GPU codec side-by-side on captured Qwen3-4B layer-15
K (see /tmp/strict_cpu_vs_gpu.py on H200):

  CPU (Rust CLI, pooled-heads, reshape [n,H,D] → [n*H,D]):  L2 rel-err 0.3766
  GPU (kakeyaturbo_py.gpu_skeleton, per-head [H,n,D]):      L2 rel-err 0.1667

The 2.26x codec-quality gap stems purely from data grouping: the CPU
path's 512-row block sees 64 tokens × 8 kv-heads interleaved, so each
block's PCA basis has to compromise across 8 Qwen3-style per-head
qk-norm distributions.  The GPU path batches over kv-heads, so each
block's basis fits a single head's spectrum.

PR #17 didn't hit this because DeepSeek-1.5B has only 2 kv-heads;
Qwen3-4B has 8, amplifying the pooled-heads penalty.

Added --gpu-codec to benchmarks/e2e_ppl_validation_vllm_snapshot_qwen3.py:
  * gpu_roundtrip() mirrors rust_roundtrip's signature but operates on
    the original [n, H, D] shape without flattening.
  * _gpu_codec_per_head() drives the exact same GPU kernels the vLLM
    attention backend uses in production: fit_skeleton_batched +
    encode_and_pack_batched + _unpack_slot_into_parts +
    decode_block_torch_from_parts.
  * share_basis=True (V-stream default) still falls back to Rust
    because GPU path doesn't implement pooled-across-blocks basis; use
    --no-share-basis-v to force GPU on both streams.

Measured on H200, Qwen3-4B, WikiText-103, 2048 ctx × 64 eval × 4 passages
(KAKEYA_SNAPSHOT_QWEN3=1, VLLM_ENABLE_V1_MULTIPROCESSING=0, bare codec
--- Σ_q / centroids / outlier all off, no share_basis_v):

  CPU pooled-heads bare codec:         Δppl = +619.42%, top1 = 55.08%
  GPU per-head     bare codec:         Δppl = +202.75%, top1 = 66.02%

3.05x PPL reduction and +11 pp top-1 agreement from an algorithmically
equivalent codec (same RSVD parameters, same bit widths, same
Gaussian-default Lloyd-Max) — the whole delta is data grouping.

Updated reports/v1_3_ppl/snapshot_mode_qwen3/QWEN3_SCENARIO_A_REPORT.md
with the new row and the revised next-steps list (share_basis_v on GPU,
regularised Σ_q measured on GPU layout, codec-budget uplift).
FluffyAIcode pushed a commit that referenced this pull request Apr 22, 2026
…_LAYERS

CPU-reference snapshot harness supports boundary skip via its
--boundary-skip-layers CLI flag.  The in-backend equivalent — which
kicks in when vLLM's engine calls _seal_and_write_block from the
attention forward — was missing.  Add it.

Environment-var driven so a single backend binary can serve
different recipes without rebuild:
    KAKEYA_BOUNDARY_SKIP_LAYERS="0,1,7,14,26,27"  # PR #17 DeepSeek-1.5B
    KAKEYA_BOUNDARY_SKIP_LAYERS="0,1,34,35"       # Qwen3-4B
    KAKEYA_BOUNDARY_SKIP_LAYERS=""                # no skip (default)

Semantics: when _seal_and_write_block sees a layer on the skip list,
the codec is bypassed and the bf16 K/V are stashed in
state.bf16_shadow (per-layer dict, already existed for the
KAKEYA_DEBUG_BF16_SHADOW probe path).  _decode_sealed consults the
shadow FIRST and returns the bf16 tensors verbatim — bit-exact
round-trip, zero codec distortion on boundary layers.  The paged-
cache byte slot for boundary blocks is zeroed as a deterministic
placeholder (never read).

Distinct from KAKEYA_SKIP_LAYERS (plugin.py) which tells the Σ_q /
centroid calibration bundle loader which layers to exclude.  The
two happen to coincide in the PR #17 recipe but are semantically
independent: a deploy can skip layer 0 from calibration while still
running codec on it, or vice versa.

Off-cache memory cost: ~128 KB per boundary block per layer
(n × n_kv × head_size × 2 bytes, 4 boundary layers × up to ctx /
512 blocks).  At the snapshot-harness workload (~4 blocks / layer)
this is a few MB total — negligible.  The paged-cache allocator
doesn't see this memory; noted in the docstring for when long-ctx
workloads make it worth tracking.

Tests (vllm_backend/kakeya_v1_3_ppl/tests/test_boundary_skip.py):
  * Parser: 7 tests on CPU (unset → empty, single layer, DS-1.5B
    recipe, whitespace-tolerant, malformed raises, cached).
  * Seal/decode on CUDA: 3 tests (boundary layer bf16-exact
    roundtrip, non-boundary layer runs codec unchanged, mixed
    boundary/codec layers don't cross-contaminate shadows).
  * 10/10 PASS on H200.

Existing test suite: 52/52 pass (the pre-existing
test_seal_exactly_one_full_block failure in
test_phase_b_end_to_end.py is unchanged by this commit —
verified by running the suite at 7e109ea both with and without
this patch; same rel_err=0.678 assertion failure in both).
FluffyAIcode pushed a commit that referenced this pull request Apr 22, 2026
Record the full ablation matrix for the GPU per-head codec path on
Qwen3-4B post-qk-norm K, as a companion to the DS-1.5B CPU findings
in reports/v1_3_ppl/FINDINGS.md and the harness architecture write-up
in reports/v1_3_ppl/snapshot_mode_qwen3/QWEN3_SCENARIO_A_REPORT.md.

Three levers helped:
  (a) per-head GPU codec (pooled-heads CPU → per-head GPU):
      +619.42% → +202.75%  (-417 pp, biggest single lever)
  (b) boundary-layer skip depth (4 → 14):
      +202.75% → +96.86%   (-106 pp, monotonic except 8-layer)
  (c) Σ_q whitening OFF (vs pre-RoPE Σ_q on):
      +8 858% / reg50 +492% → +202%  (-290-8600 pp)

Five levers did NOT help (measured on the 6-layer-boundary base):
  (d) Σ_q Tikhonov regularisation cond ≤ 50: +492%  (any L rotation
      hurts, not just bad conditioning)
  (e) M2 calibrated Lloyd-Max centroids: noise-level vs Gaussian
      (pre-RoPE / post-qk-norm distribution mismatch)
  (f) outlier compensation T=2.0: noise-level vs off
      (Qwen3 post-qk-norm K residuals lack the heavy tail)
  (g) Query-subspace codec (Approach B) at r ∈ {32..128}: r=32
      +17 758%, r=128 +292% — all worse than bare +202%
      (Qwen3 post-qk-norm Q is not low-rank on WikiText)
  (h) share_basis on V (K GPU + V CPU fallback): +226.79% vs 169.36%
      without share_basis

Current best on Qwen3-4B: 14-layer symmetric boundary skip + bare
GPU per-head codec + no Σ_q / calibrated centroids / outlier:
Δppl = +96.86 %, top-1 = 76.56 %.  Verdict REJECT (still ≥ +20%),
but top-1 above the PR #17 DS-1.5B MARGINAL line of ~74%.

Symmetric boundary-skip depth sweep:

  depth    layers                                    Δppl     top-1
  ----- ---------------------------------------- --------- --------
      4 [0,1,34,35]                               +202.75%   66.02%
      6 [0,1,2,33,34,35]                          +169.36%   67.97%
      7 [0,1,2,3,33,34,35] (asymmetric)           +169.51%   68.36%
      8 [0,1,2,3,32,33,34,35]                     +200.55%   66.41%  (non-monotonic outlier)
     10 [0..4, 31..35]                            +145.79%   71.48%
     12 [0..5, 30..35]                            +134.71%   75.39%
     14 [0..6, 29..35]                             +96.86%   76.56%

11 new JSONs committed under
reports/v1_3_ppl/snapshot_mode_qwen3/bdry_sweep/; each carries
--boundary-skip-layers in its sidecar and is paired with the plain
+gpu-codec --no-share-basis-v --disable-q-precond --disable-centroids
--disable-outlier base.

Next planned: codec-budget sweep (b_K=3→4, k=16→32, d_eff=64→96) on
the 14-layer-boundary base to see if MARGINAL (≤+20%) is reachable.
FluffyAIcode pushed a commit that referenced this pull request Apr 22, 2026
Adds three codec-budget CLI knobs to the Qwen3 snapshot harness:
  --rsvd-target-rank-factor     d_eff = max(2, int(D * factor))
  --k-kmeans-k                  K stream spherical K-means count
  --v-kmeans-k                  V stream K-means count (GPU path only)

Threaded through codec_layer → gpu_roundtrip → _gpu_codec_per_head,
recorded in each run's JSON summary.  V-stream CPU-fallback path
(share_basis=True) is unaffected — its kmeans_k is still hardwired
to 16 in kakeyaturbo-bench via --k.

Measurements on H200, Qwen3-4B, WikiText-103, 2048 ctx × 64 eval,
4 passages each, 14-layer boundary skip [0..6, 29..35], all other
guardrails off (Σ_q, calibrated centroids, outlier comp disabled).

  b_K  k   d_eff    Δppl      top-1   verdict
  --- --- -----  ---------  --------  -------
   3   16   64    +96.86%   76.56%   baseline
   4   16   64    +96.49%   77.73%   b_K alone: noise
   3   32   64    +74.10%   75.39%   k alone: **-22.8 pp**
   3   16   96   +100.04%   74.22%   d_eff alone: regresses
   4   32   64    +73.87%   75.78%
   4   16   96    +98.61%   73.83%
   4   32   96    +61.80%   78.12%   all three: **-35.1 pp**
   4   64   96    +61.84%   79.30%   **best**
   4   64  128    +69.78%   79.69%   d_eff=full rank regresses

K-means cluster count is the dominant budget knob; d_eff and b_K
only compound with larger k.  d_eff=128 saturates (codec is
PCA-saturated beyond 96 on post-qk-norm Qwen3-4B K).

Current best: Δppl = +61.84 %, top-1 = 79.30 %.  Below MARGINAL
Δppl bar (≤+20 %), but top-1 is now above the PR #17 DS-1.5B
MARGINAL (74.22 %).

9 new JSONs under reports/v1_3_ppl/snapshot_mode_qwen3/budget_sweep/.
FINDINGS_GPU.md updated with the sweep table + file index.

Also reverts the WHT-skip + SRHT-sketch code (b5dffa4 from earlier
in this thread); those knobs were measured in
rsvd_wht_ablation/ at ≤3 pp variance-band changes, kept only as
committed report JSONs for reproducibility.
FluffyAIcode pushed a commit that referenced this pull request Apr 22, 2026
Attempted to push Δppl below +61.84% by 'shrink V, fund K':

  b_V  k_V  b_K  k_K    Δppl     top-1   Δ vs baseline
  ---  ---  ---  ---  ---------  -------  -------------
   2   16    4   64    +61.84%   79.30%   baseline
   1   16    4   64    +79.84%   76.56%   +18.0 pp worse
   2    8    4   64    +69.24%   78.52%   +7.4 pp worse
   1    8    4   64    +92.80%   74.22%   +31.0 pp worse  (worst)
   2   16    4  128    +65.98%   81.64%   +4.1 pp Δppl, +2.3 pp top-1 (Pareto)
   2    8    4  128    +67.95%   80.08%   +6.1 pp worse (strictly worse)

Findings:

 * V is at the Pareto frontier at (b_V=2, k_V=16).  Every shrink
   direction costs more Δppl than any K uplift can buy back.
   bit_width=1 (Lloyd-Max with 2 levels) is too coarse for V's
   flat-spectrum residuals.

 * k_K=64 → k_K=128 is the Pareto-optimal direction for top-1:
   spends 4 pp Δppl, gains 2.3 pp top-1 (new high of 81.64%).
   Useful for argmax / MMLU-style evals where top-1 matters more.

 * Pareto-incompatible trade: saving V bytes and spending them
   on K gives strictly worse Δppl AND strictly worse top-1.  V
   is paying in perplexity more than K is earning.

Two operational recipes now pinned in FINDINGS_GPU.md:

  Recipe A (Δppl-optimal, LM-eval):  b_K=4, k_K=64,  d_eff=96
                                     → +61.84%, 79.30%
  Recipe B (top-1-optimal, MCQ/MMLU): b_K=4, k_K=128, d_eff=96
                                     → +65.98%, 81.64%

5 new JSONs under reports/v1_3_ppl/snapshot_mode_qwen3/budget_sweep/.
This concludes the codec-budget ablation — V is on the frontier,
K has both a Δppl and a top-1 optimum at different points on the
same frontier, and neither Recipe hits MARGINAL (≤+20% Δppl) but
both sit well above the PR #17 DS-1.5B MARGINAL top-1 of 74.22%.
@FluffyAIcode FluffyAIcode merged commit bb6b823 into main Apr 23, 2026
@FluffyAIcode FluffyAIcode deleted the AgentMemory/v1-3-ppl-snapshot-mode-vllm-102e branch April 23, 2026 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants