v1.3 PPL HF↔vLLM gap decomposition: localised to cross-layer non-linear compounding (+39 pp) and baseline shift (~10 pp) by FluffyAIcode · Pull Request #16 · FluffyAIcode/LLM-KV--Cache-compress

FluffyAIcode · 2026-04-22T00:28:33Z

Goal

PR #15 left a 27 pp Δppl gap unexplained between the v1.3 PPL
production cell on HF (+7.82 %, MARGINAL) and vLLM 0.7.3 FLASH_ATTN
(+35.33 %, REJECT) at the same codec config on the same DS-Distill
model. Five hypotheses were ruled out there; this PR measures the
remaining candidates with four targeted phases and assigns concrete
numbers to each bucket.

Δppl bucket table

#	Bucket	Estimate	Phase
1	Engine baseline shift (clean model)	~10 pp (−11 % PPL rel; 0.145 KL; 18 % top-1 disagree)	Phase 1+5
2	Codec residual magnitude (pre-hook I/O identity)	~0 (engine-agnostic; mse ratio 1.01±0.05)	Phase 4
3	Engine noise sensitivity (σ·rms(K)·randn linear)	HF more sensitive (−21 pp at σ=0.01) — doesn't explain the gap	Phase 2
4	Boundary-layer concentration (already skipped)	+69 pp saved by SPRINT_CLOSEOUT boundary policy	Phase 3
5	Cross-layer non-linear compounding	+39 pp (joint cell − Σ singletons over 22 quiet layers)	Phase 3

Executive summary

The codec is innocent. It sees statistically identical pre-RoPE
K/V from HF and vLLM (Phase 4, mean input magnitudes within 1 %)
and produces matched reconstruction errors (mse ratio median 1.01).
HF-origin vs vLLM-origin calibration (PR v1.3 PPL on vLLM: production cell + per-channel attribution — K 64%, V 31%, V-outlier worth −4 pp #15 H3) is the same story.
The engines disagree on the clean model. With codec OFF, HF
eager and vLLM FA give different logits on the same tokens: mean
PPL rel gap −11 %, symmetric KL 0.145 on top-20, top-1 disagree
18 %. This is a baseline shift in the denominator of the Δppl
metric; it accounts for roughly 10 pp of the 27 pp gap.
Engine sensitivity to σ·randn goes the wrong way. In the
linear regime (σ ≤ 0.01), HF eager is more sensitive to matched
Gaussian noise than vLLM FA. The "FA bf16 softmax amplifies
codec residuals more" theory is falsified; the gap is not about
generic noise.
Root cause: cross-layer non-linear compounding in vLLM's
residual stream. On the production recipe, 22 "quiet" layers'
single-layer Δppl contributions are individually small (all
|Δppl| < 9 %; top-1 ≥ 86 %), and they sum to −3.9 %. But the
joint forward measures +35.3 %. The joint action has a +39 pp
cross-layer interaction on vLLM that HF's teacher-force-over-
DynamicCache path does not compound as aggressively.

Per-phase one-liners

Phase 1+5: codec OFF, HF and vLLM disagree on the clean model
(KL 0.145, PPL rel gap −11 %).
Phase 4: the codec sees and produces engine-agnostic K/V and
residuals (1 % delta).
Phase 2: matched σ·rms injection, HF more sensitive in the
linear regime; not the cause.
Phase 3: layer 0 alone → +56 % Δppl; 6-layer SPRINT_CLOSEOUT
boundary skip saves +69 pp; 22 quiet layers' singletons sum to
−3.9 %; joint cell = +35.33 % → +39 pp cross-layer compounding.

Deployment implications

HF users

No change; SPRINT_CLOSEOUT +7.82 % MARGINAL reproduces.

vLLM users

Honest in-engine Δppl is +35 % REJECT at the SPRINT_CLOSEOUT recipe.
Two cheap measured mitigations:

Extend boundary skip to {2, 6, 11} in addition to the
existing {0, 1, 7, 14, 26, 27}. Each is 5-8 pp single-layer,
expected 10-15 pp savings on the joint cell.
Adaptive per-layer bit-width: K b=3 globally except K b=4
on the 3 hot layers. Addresses the non-linear compounding at
its source; keeps 19/28 of the ratio benefit.

Both are follow-up experiments. The important contribution of this
PR is that the cause is now localised — no more unexplained gap.

What's in this branch

Harnesses (4)

benchmarks/gap_phase1_5_engine_baseline.py — Phase 1+5: runs HF
and vLLM on the same tokens with codec off; reports per-passage
PPL, top-K KL, top-1 agreement.
benchmarks/gap_phase2_noise_sensitivity.py — Phase 2: reimplements
Qwen2Attention.forward on each engine to inject
σ·rms(K/V)·randn at the pre-RoPE hook; sweeps σ.
benchmarks/gap_phase3_per_layer_vllm.py — Phase 3: reuses the
production harness, inverts boundary_skip_layers per cell to
activate codec on exactly one layer at a time.
benchmarks/gap_phase4_residual_magnitude.py — Phase 4: captures
pre-RoPE K/V from HF eager and vLLM with non-invasive patches,
runs the v1.3 codec on each, compares MSE / relative norm per
layer.

Reports (6 files)

reports/v1_3_ppl/gap_decomposition/FINDINGS.md — consolidated
bucket decomposition (this PR's deliverable).
reports/v1_3_ppl/gap_decomposition/phase{1,2,3,4}/FINDINGS.md
and corresponding .json artifacts for each phase.

Test status

Python: syntax-clean.
H200 runs: all four phases completed; results committed.

Relationship to other PRs

PR E2E PPL validation: codec REJECTs downstream on real Qwen2.5 — major finding, paper claims must be revised #12 — HF harness that first flagged the downstream-quality
problem on bare v1.3.
PR Outlier compensation + Besicovitch-product skeleton — diagnostic sprint #13 (draft) — SPRINT_CLOSEOUT definition of "v1.3 PPL".
PR vLLM integration scaffolding for the kakeyaturbo codec (codec port + Attention.forward hook + harness) #14 (scaffolding) — Rust v1.3 port + post-RoPE hook.
PR v1.3 PPL on vLLM: production cell + per-channel attribution — K 64%, V 31%, V-outlier worth −4 pp #15 — v1.3 PPL production recipe on vLLM + per-channel K/V
attribution; identified the remaining 27 pp gap that this PR
decomposes.
This PR — gap decomposition + deployment recommendation
(boundary skip extension + adaptive per-layer K bit-width).

Brings the v1.3 pieces that are shared by the e2e validation branch onto main as a clean baseline for vLLM integration: - kakeyaturbo/src/codec.rs: PcaMethod enum (Exact | Randomized{...}), CodecParams gets pca_method field, fit_pca_dispatch routes to exact or randomized PCA path. - kakeyaturbo/src/pca.rs: adds fit_weighted_pca_randomized() (Halko-Martinsson-Tropp randomized SVD with power iterations). - kakeyaturbo/src/lib.rs: re-export PcaMethod. - kakeyaturbo/src/bin/kakeyaturbo-bench.rs: new CLI flags --pca-method exact|randomized --rsvd-target-rank N --rsvd-oversample N --rsvd-power-iters N --dump-decoded PATH (write decoded KKTV for external drivers) - benchmarks/e2e_ppl_validation.py: HF-transformers harness that prefills DynamicCache, round-trips every full-attention layer through the Rust codec, teacher-forces continuation tokens, and reports Δppl / top1 / KL. - kakeyaturbo/tests/integration.rs: update CodecParams initializer for the new field. All 153 Rust unit + integration + proptest cases pass. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

benchmarks/e2e_ppl_validation_vllm.py is a drop-in alternative to e2e_ppl_validation.py that routes the forward pass through vLLM rather than HF eager attention, so the measured \u0394ppl reflects the codec's behaviour under the production inference engine. Integration approach: - Monkey-patch vllm.attention.layer.Attention.forward (installed before LLM construction). - The patched forward, when CodecState.active is True, round-trips the K and V tensors through the v1.3 Rust codec (kakeyaturbo-bench --dump-decoded) before passing them on to the underlying paged-attention kernel. K uses inner_product metric, V uses mse + share_basis (asymmetric config matching the HF harness). - Each passage is evaluated twice: once with the codec OFF (reference) and once ON (alt), using vLLM's prompt_logprobs=1. Per-position PPL and top-1 agreement over [ctx_len, ctx_len+n_eval) are compared with the same ACCEPT/MARGINAL/REJECT verdict as the HF harness, so the two engines' numbers can be placed side-by-side. benchmarks/run_v1_3_ppl_vllm.sh: convenience driver that builds the Rust bench binary and launches the harness with the standard smoke config (Qwen2.5-0.5B, ctx=1024, 2 passages, b=2, rsvd). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

- Patched forward now has the exact (query, key, value, kv_cache, attn_metadata) signature vLLM's custom-op dispatcher expects. - Use self.head_size from the Attention module to reshape the KV tensor correctly regardless of whether it enters as 2D [num_tokens, num_kv_heads * head_size] or already-reshaped 3D. - Use self.layer_name (vLLM's stable per-layer identifier) as the default layer_id; fall back to a module-scope counter only when not present. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

On Vast.ai vLLM is installed into /venv/main, not into the system python3 on PATH. Default PYTHON_BIN to the venv python when that venv exists, and allow override via the PYTHON_BIN env var. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Adds reports/v1_3_rsvd_rope/e2e_ppl_vllm_smoke/: - qwen2_5_0_5b_vllm.json: per-passage metrics (Qwen2.5-0.5B, ctx=1024, n_eval=64, 2 passages, b=2 rsvd, randomized PCA, vr=0.95). - FINDINGS.md: engine setup, cross-engine comparison against the HF harness from PR #12, reproduction instructions. Result summary: Δppl mean = +291.9 % (passage 1: +192 %, passage 2: +391 %) top1 mean = 46.9 % verdict = REJECT (threshold is |Δppl|<=1% AND top1>=95%) This confirms on the production inference engine (vLLM 0.7.3 with Flash-Attention on H200) what PR #12 found on HF eager: the v1.3 codec at its tier-1 setting does not preserve downstream quality. The magnitude of the degradation is smaller on vLLM than on HF (+292% vs +29,086%), but both clearly REJECT. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Per SPRINT_CLOSEOUT.md (PR #13), the production recipe is "v1.3 PPL" = v1.3 RSVD + 4 guardrails (Q-preconditioning, calibrated Lloyd-Max K codebook, 6-layer boundary protection, outlier compensation T=2.0). The smoke result landed in this PR's last commit (+292% \u0394ppl, 47% top-1 on Qwen2.5-0.5B + bare v1.3 b=2) is the V0 baseline under the sprint's ladder, NOT the production v1.3 PPL. Its number aligns with the ladder's V0 cell (+355% \u0394ppl, 42% top-1 on HF / DS-Distill). Remove that datapoint from this PR; this PR now scopes only the reusable vLLM integration scaffolding (codec port + Attention.forward monkey-patch + harness skeleton). The production-recipe integration is moved to a follow-up branch: AgentMemory/v1-3-ppl-full-guardrails-vllm-102e Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Brings the calibrated Lloyd-Max codebook + outlier compensation into the Rust codec so the full 'v1.3 PPL' production recipe can be driven from the vLLM harness. New CodecParams fields: - custom_centroids: Option<Vec<f32>> \u2014 calibrated Lloyd-Max table - outlier_threshold: Option<f32> \u2014 residual threshold T - exact_rank_cap: Option<usize> \u2014 caps d_eff on exact PCA - skeleton_dtype fp16|fp32 New CLI flags on kakeyaturbo-bench: --centroids-file PATH load 2^bits LE-f32 sorted centroid table --outlier-threshold T extract post-WHT residual coords with |scaled_residual| > T as (u16 idx, f16 val) sparse overrides at decode Internal wiring: *_with_centroids variants of encode_block / decode_block / encode_layer / decode_layer thread the centroid table and outlier buffer through the block pipeline without changing the wire format when neither is set (bit-compatible default). 153 Rust unit + 5 integration + 6 proptest = 164 tests pass. Source: cherry-pick of 521e97b ('outlier compensation: Rust codec support + Python harness + 4-passage PPL validation on DS-Distill') onto the v1.3 scaffolding branch (includes the preceding Steps 3+4 Lloyd-Max + boundary infrastructure from 05dfbc5). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Artifacts needed for the 'v1.3 PPL' production recipe: reports/v1_4_q_pca/flagship/deepseek_distill_q_calib.{safetensors,json} \u03a3_q Cholesky factors per (layer, kv-head) for DeepSeek-R1-Distill-Qwen-1.5B (28 layers \u00d7 [2, 128, 128]). reports/v1_4_q_pca/calibrated_codebook/ds_K_b{2,3}_centroids.f32 reports/v1_4_q_pca/calibrated_codebook/ds_V_b2_centroids.f32 Empirical Lloyd-Max centroid tables (2^bits LE-f32 floats each) trained on pooled 25M post-WHT residual samples from DS-Distill. Python helpers: benchmarks/q_precondition.py K-stream whitening K_tilde = K @ L benchmarks/q_calibration.py offline \u03a3_q Cholesky calibration benchmarks/lloyd_max_calibration.py offline residual codebook fitting These are the pieces the forthcoming vLLM harness extension needs to drive the full v1.3 PPL recipe (Q-precond + calibrated Lloyd-Max + boundary skip + outlier T=2.0). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

benchmarks/e2e_ppl_validation_vllm_full.py drives the full recipe: 1. Q-preconditioning K_tilde = K @ L (per layer, per kv-head) 2. Calibrated Lloyd-Max via --centroids-file to Rust codec 3. 6-layer boundary skip [0, 1, 7, 14, 26, 27] stay bf16 4. Outlier compensation T = 2.0 on K residual (--outlier-threshold) Unlike the scaffolding harness (e2e_ppl_validation_vllm.py) which hooks at vllm.attention.layer.Attention.forward (post-RoPE), this harness patches vllm.model_executor.models.qwen2.Qwen2Attention.forward to intercept K/V immediately after the QKV projection, BEFORE RoPE: qkv \u2192 split(q, k, v) k \u2190 unwhiten(codec_roundtrip(whiten(k), centroids, outlier)) [pre-RoPE] v \u2190 codec_roundtrip(v, centroids) q, k \u2190 rotary_emb(positions, q, k) \u2026 rest of normal attention runs on the repaired K and V This matches the PR #13 HF harness (benchmarks/e2e_ppl_pre_rope.py) semantically: \u03a3_q is calibrated on pre-RoPE K distributions, so the whitening must be applied to pre-RoPE K for the Sigma_q-MSE equivalence to hold. benchmarks/run_v1_3_ppl_full_vllm.sh: driver with env-overridable defaults matching the SPRINT_CLOSEOUT production cell: DS-Distill D=128, K b=3 + V b=2, T=2.0, 6 bdry \u2192 target \u0394ppl +7.82% / top-1 78.97% / ratio 4.61\u00d7 (MARGINAL) Syntax-check clean; end-to-end on GPU pending. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Full production recipe (K b=3 Q-precond + calibrated Lloyd-Max + outlier T=2.0 + 6-layer boundary skip; V b=2 calibrated + share-basis) runs end-to-end on vLLM 0.7.3 for DeepSeek-R1-Distill-Qwen-1.5B. Result (ctx=2048, n_eval=64, 4 WikiText-103 passages): Passage 1: \u0394ppl -8.86% top1 56.2% Passage 2: \u0394ppl +32.79% top1 51.6% Passage 3: \u0394ppl +40.46% top1 65.6% Passage 4: \u0394ppl +76.92% top1 64.1% Mean \u0394ppl = +35.33 %, mean top-1 = 59.4 % \u2192 REJECT Guardrails move bare v1.3 b=2 from +292% on vLLM (PR #14) \u2192 +35% on vLLM (this PR), an ~8\u00d7 \u0394ppl improvement \u2014 directional agreement with the HF ladder (+356% \u2192 +8% on DS-Distill). However vLLM ends ~4.5\u00d7 worse in \u0394ppl than HF at the same codec config on the same model family. Two likely causes (Flash-Attention numerics vs. HF eager, and CPU/GPU fp32<->bf16 boundary noise) are documented with follow-up sweeps in FINDINGS.md. Artifact: reports/v1_3_ppl/vllm/ds_distill_qwen_1_5b_vllm_full.json Full write-up: reports/v1_3_ppl/vllm/FINDINGS.md Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Motivation: PR #15 showed vLLM v1.3 PPL full-recipe gives \u0394ppl +35.3% vs HF's +7.82% on the same codec config on DS-Distill. Two hypotheses were named but not separated: (1) \u03a3_q was calibrated on pre-RoPE Q, but Flash-Attention computes Q@K.T on post-RoPE Q, breaking the Sigma_q -> K_tilde metric equivalence; (2) the per- forward CPU\u2194GPU and fp32\u2194bf16 round-trip itself accumulates enough numerical noise to degrade PPL. This harness runs four cells against a single shared reference, pair-wise per passage so all cells observe the same ref PPL: identity-pre_qp whiten \u2192 identity codec \u2192 unwhiten isolates hypothesis (2): everything except compression codec-no_qp real codec, no whitening isolates "codec only" codec-pre_qp production recipe (matches PR #15) codec-post_qp codec + post-RoPE \u03a3_q_post self-calibrated online from this run's own post-RoPE Q tensors isolates hypothesis (1): correct whitening under FA Key implementation details: - Qwen2Attention.forward is patched once; the patch branches on CodecState.mode/qp_mode to pick the right hook. - PostRopeQCalib accumulates Sum(q q.T) per (layer, kv-head) during a cheap dedicated calibration forward pass (codec off), then Cholesky-factors with a small ridge for stability. For GQA models (num_heads > num_kv_heads) we pool Q heads in the same KV group before accumulating, matching the metric FA actually computes. - All cells share the same computed once; each cell runs its own alt_pls and compares per passage. - Identity codec does skip the kakeyaturbo-bench subprocess but still goes through the full fp32\u21a6numpy\u21a6CPU\u21a6numpy\u21a6fp32 path, so it measures the complete CPU\u2194GPU noise floor. Syntax-clean; GPU run on Vast.ai H200 pending in the same turn. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…irection Paired 4-cell ablation on DS-Distill + vLLM H200 (shared ref): identity-pre_qp \u0394ppl -0.29% top1 98.83% ACCEPT codec-no_qp \u0394ppl +152.78% top1 59.38% REJECT codec-pre_qp \u0394ppl +35.33% top1 59.38% REJECT (= PR #15) codec-post_qp \u0394ppl +54.28% top1 57.03% REJECT Findings: - H2 (CPU\u2194GPU + fp32\u2194bf16 noise) is ruled out. The identity cell walks the complete production hook pipeline minus compression and records -0.29% \u0394ppl / 98.83% top-1. - H1 (\u03a3_q was calibrated on pre-RoPE Q but FA operates on post-RoPE Q) as a direct fix-up is wrong. Online self-calibrated \u03a3_q^post makes things STRICTLY WORSE (+54% vs +35%). Math: RoPE is position-dependent; pooling post-RoPE Q over tokens averages away the per-token rotations and collapses anisotropy, giving a flatter pooled \u03a3 than the true per-token FA metric. - Pre-RoPE whitening IS the FA-correct thing (R_t L L^T R_t^T = R_t \u03a3_q R_t^T commutes with the per-token rotation). The Q-precond architectural choice in PR #13 is verified correct for vLLM too. The remaining +35% gap is not Q-precond placement but almost certainly calibration-distribution drift: \u03a3_q + centroids were all fit on HF DynamicCache prefill snapshots, but vLLM's Qwen2 layer produces slightly different prefill K/V distributions (different bf16 accumulation / RoPE impl / attn bias). The codec has to eat that mismatch. Next experiment: re-fit \u03a3_q and Lloyd-Max centroids on vLLM prefill snapshots and re-run codec-pre_qp. Artifacts: reports/v1_3_ppl/vllm_ablation/FINDINGS.md reports/v1_3_ppl/vllm_ablation/ds_distill_qwen_1_5b_vllm_ablation.json Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Follow-up to the ablation in reports/v1_3_ppl/vllm_ablation/: H2 (noise) and post-RoPE Q-precond hypothesis both ruled out; the remaining +35% \u0394ppl gap on vLLM vs HF is most likely calibration- distribution drift (\u03a3_q and Lloyd-Max centroids were fit on HF DynamicCache snapshots, not vLLM prefill snapshots). benchmarks/vllm_calibration_refit.py: 1. Spins up vLLM LLM (bf16, enforce_eager). 2. Installs a capture-only monkey patch on Qwen2Attention.forward that records the pre-RoPE q/k/v tensors without modifying the forward. 3. Runs N calibration passages from the WikiText-103 TRAIN split by default (disjoint from the TEST split the PPL measurement uses), so no leakage. 4. Factors \u03a3_q per (layer, kv-head) by pooling the Q heads in each GQA KV group. Matches the format used by benchmarks/q_precondition.QPrecond exactly: layer_<l>_chol [n_kv, D, D] fp32 layer_<l>_inv_chol [n_kv, D, D] fp32 layer_<l>_sigma [n_kv, D, D] fp32 5. Re-runs the Lloyd-Max residual pipeline on the captured K (whitened with the fresh \u03a3_q) and V streams, producing drop-in replacements for the ds_K_b{2,3}_centroids.f32 / ds_V_b2_centroids.f32 tables. Outputs at --out-dir (default reports/v1_3_ppl/vllm_recalibrated/): q_calib.safetensors q_calib.json K_b2_centroids.f32, K_b3_centroids.f32, V_b2_centroids.f32 SUMMARY.json benchmarks/run_vllm_calibration_refit.sh is the driver. Next step (not in this commit): re-run the codec-pre_qp ablation cell with --q-calib-pre-rope=.../q_calib.safetensors and --k-centroids / --v-centroids pointing at the new .f32 files. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

lloyd_max_calibration.py has top-level 'from transformers import …' and 'import benchmarks.pre_rope_cache …', which were blocking our tool from being usable in a vLLM-only environment (HF transformers IS present, but pre_rope_cache is a HF-eager patch that we don't need here — we just want four math utilities). Load the pure-numpy functions by stripping the HF imports from the source before exec()'ing it, then picking up fit_pca_simple / next_pow2 / wht_rotate / lloyd_max_iterate from the resulting namespace. Semantics unchanged. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Captured pre-RoPE Q/K/V from vLLM 0.7.3 on deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B using 8 disjoint WikiText-103 TRAIN split passages of 2048 tokens each (16k tokens per layer per kv-head pool). Artifacts (drop-in compatible with QPrecond / kakeyaturbo-bench): q_calib.safetensors 28 layers \u00d7 [2, 128, 128] (chol/inv/sigma) \u03a3_q anisotropy (cond): median 4506, max 109076 (off-diag max / diag mean: median 15.5, max 34.8) \u2014 strongly anisotropic, Q-precond has something to do. K_b2_centroids.f32 Gaussian MSE 0.1143 \u2192 calibrated 0.0721 (1.59\u00d7) K_b3_centroids.f32 Gaussian MSE 0.0318 \u2192 calibrated 0.0214 (1.48\u00d7) V_b2_centroids.f32 Gaussian MSE 0.1140 \u2192 calibrated 0.1140 (1.00\u00d7) q_calib.json per-(layer, kv-head) diagnostics SUMMARY.json centroid comparison These replace the HF-calibrated tables in reports/v1_4_q_pca/{flagship,calibrated_codebook}/ when the vLLM harness is run with --q-calib-pre-rope=.../q_calib.safetensors and --k-centroids/-v-centroids pointing at the new .f32 files. K improvement ratios match the SPRINT_CLOSEOUT HF-calibrated numbers (1.47\u00d7 at b=2 / 1.40\u00d7 at b=3) closely \u2014 the vLLM post-WHT residual distribution looks quantitatively similar to HF's, just slightly shifted. Whether this shift is what causes the +35% \u21a6 ??? \u0394ppl gap is the next ablation cell. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Hypothesis H3 ('the +35% vLLM \u0394ppl gap vs HF's +7.82% comes from \u03a3_q + Lloyd-Max being calibrated on HF, not vLLM, prefill distributions') was the leading candidate after the H1/H2 ablation. This commit tests and rules it out. Procedure: 1. Capture pre-RoPE Q/K/V from vLLM on 8 disjoint WikiText-103 train passages (2048 tokens each). 2. Fit \u03a3_q and K/V Lloyd-Max tables on THAT data. 3. Re-run the 4-cell ablation with the drop-in replacements. Result comparison (HF-calibrated vs vLLM-calibrated, same test passages): identity-pre_qp -0.29% \u2192 +0.15% (noise) codec-no_qp +152.78% \u2192 +144.56% (noise) codec-pre_qp +35.33% \u2192 +38.69% (+3 pp, noise) codec-post_qp +54.28% \u2192 +58.24% (noise) Calibration drift does NOT explain the HF vs vLLM gap. vLLM-origin calibration does not measurably beat HF-origin calibration, because the pre-RoPE Q/K/V distributions on vLLM are close enough to HF's that the HF tables are already well-matched. Lloyd-Max improvement ratios corroborate: HF-calibrated K b=2 is 1.47x better than Gaussian; vLLM-calibrated K b=2 is 1.59x. Close. Remaining candidates: H4: Flash-Attention bf16 softmax/score reduction amplifies codec residuals more than HF eager's f32-accumulate path. Engine-level numerical sensitivity, not fixable by re-calibration. H5: vLLM's prompt_logprobs=1 forward path integrates compression residuals through a different numerical trajectory than HF's prefill + teacher-force two-pass. Both would require a different class of experiment (e.g. a near-exact codec run vs identity, or port of the HF harness to run teacher-forcing on a vLLM-reconstructed cache) to distinguish. Artifacts: reports/v1_3_ppl/vllm_recalibrated_run/FINDINGS.md reports/v1_3_ppl/vllm_recalibrated_run/ds_distill_qwen_1_5b_vllm_ablation.json Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Three of the four cells (codec-no_qp, identity-pre_qp, codec-post_qp) served only to falsify hypotheses H1, H2, H3 for the HF\u2194vLLM \u0394ppl gap. All three hypotheses are now closed: H1 (\u03a3_q in wrong frame) ruled out: post-RoPE \u03a3_q strictly worse H2 (CPU\u2194GPU + fp32\u2194bf16 noise) ruled out: identity cell \u0394ppl \u2248 0, top-1 99% H3 (calibration distribution drift) ruled out: vLLM-origin calibration indistinguishable from HF-origin Only the production cell (codec-pre_qp) carries a standing datapoint going forward. Delete the 4-cell harness + runner and the two ablation report directories that contain the now-obsolete cells. The surviving datapoints (HF-calib production + vLLM-calib production) will be re-recorded with a clean, 1-cell format from e2e_ppl_validation_vllm_full.py, which is what this PR has always shipped for production. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

After ruling out H1 / H2 / H3 in the previous ablation rounds, only one production-relevant datapoint remains: HF-calibrated (as shipped) \u0394ppl +35.33% top1 59.38% vLLM-calibrated (this PR) \u0394ppl +38.69% top1 61.33% Both reject; difference is passage noise. This commit: - Removes the obsolete per-subdir FINDINGS.md. - Adds a single reports/v1_3_ppl/FINDINGS.md holding the two rows + passage detail + what H1/H2/H3 ruled out + the H4/H5 residual hypotheses to test next. - Keeps vllm/{.json}, vllm_calibrated/{.json}, vllm_recalibrated/ as the backing artifacts. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Needed for the H4 ablation: rerun the same codec-pre_qp cell under a non-Flash-Attention backend. vLLM 0.7.3 picks the attention backend from the VLLM_ATTENTION_BACKEND env var; we expose it as ATTN_BACKEND for the driver. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

H4 setup: - ATTN_BACKEND env var in the driver \u2192 VLLM_ATTENTION_BACKEND on the engine; switch from FLASH_ATTN to XFORMERS. - Reports under reports/v1_3_ppl/vllm_h4_xformers/. H5 setup: - New CodecState.prefix_only_tokens + --prefix-only-tokens CLI. - When set, the codec only round-trips the first N rows of each layer's K/V tensor; the tail is pass-through. Mirrors the HF harness's two-pass 'codec only touched the prefill cache, teacher-force saw exact K/V' semantics inside vLLM's single- forward path. - Driver forwards PREFIX_ONLY_TOKENS env var as --prefix-only-tokens. H4 result: \u0394ppl +37.82% / top-1 60.16% (XFORMERS) vs +35.33% / 59.38% (FLASH_ATTN); within passage noise. H4 FALSIFIED: the residual amplification is not specific to Flash-Attention's bf16 softmax. (TORCH_SDPA is unsupported in the vLLM 0.7.3 V0 engine on CUDA; FLASHINFER is not installed on this image.) H5 pending in the next commit. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…sified With PREFIX_ONLY_TOKENS=2048 (codec only touches positions < ctx_len = 2048, eval window [2048, 2112) stays uncompressed), mirroring HF's two-pass 'prefill cache \u2192 teacher-force' semantics inside vLLM's single-forward path: \u0394ppl +35.41 % (baseline was +35.33 %) top-1 59.77 % (baseline 59.38 %) \u2192 REJECT Essentially identical to the full-prefix baseline. The HF\u2194vLLM gap is NOT a measurement-path artifact: compressing only the prefill window and leaving the eval window clean does not change the result. Combined with H4 (XFORMERS gave +37.82%), both residual hypotheses are closed. The next step is the engineering fallback: sweep K b=4 (more K headroom). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

For K b=4 the SPRINT_CLOSEOUT notes calibrated Lloyd-Max centroids do not help (slightly degrade in fact), so ds_K_b4_centroids.f32 is not shipped. Let the driver gracefully omit --k-centroids / --v-centroids when the env var is empty or 'none'. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

… bottleneck on vLLM After closing H1/H2 earlier and H3 in the previous commit, this iteration closes H4 and H5 and then executes the engineering fallback (sweep K b=4, then K b=4 + V b=4) on the codec-pre_qp production cell. Results on DS-Distill-Qwen-1.5B / vLLM 0.7.3 H200 / 4 WikiText-103 test passages / ctx=2048 / n_eval=64: production (K b=3, V b=2, FLASH_ATTN, HF calib) \u0394ppl +35.33 % H3 vLLM-calib same but \u03a3_q+centroids re-fit on vLLM +38.69 % noise H4 XFORMERS same except VLLM_ATTENTION_BACKEND=XFORMERS +37.82 % noise H5 prefix-only same except codec only touches pos<2048 +35.41 % noise strategy K b=4 K bits 3\u21924 (Gaussian K centroids) +37.30 % no improvement strategy K+V 4 K+V bits to 4, no calibrated centroids +27.32 % \u221210 pp H4 falsified: swapping FA for XFormers does not close the gap. The \u03b4ppl amplification is not specific to Flash-Attention's bf16 softmax. H5 falsified: restricting the codec to positions <ctx_len (eval window sees clean K/V) does not close the gap. HF's two-pass and vLLM's single-forward paths integrate codec residuals similarly at PPL resolution. Strategy: K headroom alone doesn't help (+35.33 \u2192 +37.30 at b=4). Doubling V rate alone (b=2 \u2192 b=4, no outlier, no V calibration) buys ~10 pp \u0394ppl (+37.30 \u2192 +27.32) \u2014 the FIRST knob that shifts the number meaningfully. Interpretation: the residual HF\u2194vLLM gap is a V-stream failure mode that is specific to vLLM's FA-family integration of b=2 V. On HF, V residuals are 'natively Gaussian' so b=2 Lloyd-Max is near-optimal; under FA's bf16 softmax(QK^T/\u221ad) @ V accumulation, that approximation is less forgiving. For deployment: HF users \u2014 SPRINT_CLOSEOUT MARGINAL cell holds. No change. vLLM users \u2014 honest in-engine \u0394ppl is +35% at the standard config. Cheapest fix: V b=2 \u2192 V b=4 (\u221210 pp \u0394ppl, loses \u22c51/2 V compression). Proper fix: V-side codec redesign targeting FA's score@V accumulation. Artifacts: reports/v1_3_ppl/FINDINGS.md (consolidated \u2014 6 rows) reports/v1_3_ppl/vllm_kb4/ds_distill_qwen_1_5b_kb4_vllm_full.json reports/v1_3_ppl/vllm_kv4/ds_distill_qwen_1_5b_kb4_vb4_vllm_full.json Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

… knob Drop all H3/H4/H5/kb4/kv4 artifacts and the refit tool. The single standing cell on this branch is: codec-pre_qp (production) DS-Distill / vLLM 0.7.3 / FLASH_ATTN K b=3 + V b=2 + HF-calibrated Lloyd-Max + outlier T=2.0 + 6-layer boundary skip + pre-RoPE Q-preconditioning \u2192 \u0394ppl +35.33 %, top-1 59.38 %, REJECT All retired cells were confirmed to leave \u0394ppl within passage noise of this baseline (or, for KV b=4, moved it by only ~10 pp while still REJECT). Replace their per-stream debugging surface with a single clean knob: --compress-stream {kv, k, v} (also COMPRESS_STREAM env var on the driver). 'k' routes only K through the codec and leaves V bf16; 'v' is the mirror; 'kv' (default) is the production cell. This exposes the per-channel \u0394ppl attribution that PR #15's V-vs-K localisation pointed at, cleanly, from the production harness. Remove the now-obsolete --prefix-only-tokens, ATTN_BACKEND, and K_CENTROIDS=none overrides from the driver; the streams knob covers the remaining question we want to answer. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…94ppl Three rows on DS-Distill-Qwen-1.5B / vLLM 0.7.3 H200, 4 WikiText-103 test passages, ctx=2048, n_eval=64, shared reference (paired): production (K+V) \u0394ppl +35.33 % top1 59.38 % REJECT K-only (V bf16) \u0394ppl +22.55 % top1 69.14 % REJECT V-only (K bf16) \u0394ppl +11.10 % top1 74.22 % REJECT K-only + V-only = +33.65 \u2248 +35.33 \u2212 1.68 pp interaction. So K and V contribute roughly additively at this codec config, with K carrying about two-thirds of the Δppl (+22.55 / +35.33 \u2248 64 %) and V carrying about one-third (+11.10 / +35.33 \u2248 31 %). This is a material deviation from the HF picture: SPRINT_CLOSEOUT's v1.3 PPL recipe spends ALL four guardrails (Q-precond, K Lloyd-Max centroids, outlier T=2.0, 6-layer boundary skip) on the K stream precisely because HF's V b=2 with share_basis is 'natively Gaussian' and near-lossless. On vLLM+FA the same K-side stack still leaves +22.55 pp \u0394ppl on K alone, so K is the bigger lever; the cheapest path to closing the HF\u2194vLLM gap is a vLLM-specific K-side redesign. Top-1 attribution is slightly different from \u0394ppl attribution: K-only drops top-1 by 15.08 pp and V-only by 25.78 pp (joint K+V: 40.86 pp loss). Both channels' logit distortions reorder the one- best similarly at full context; V distortions tend to preserve top-1 better than they preserve the log-prob of the true token. Artifacts: reports/v1_3_ppl/vllm_k_only/ds_distill_qwen_1_5b_k_only_vllm_full.json reports/v1_3_ppl/vllm_v_only/ds_distill_qwen_1_5b_v_only_vllm_full.json reports/v1_3_ppl/FINDINGS.md (consolidated, 3 rows) Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

In SPRINT_CLOSEOUT v1.3 PPL, outlier compensation T=2.0 is K-side only; V has calibrated Lloyd-Max and 6-layer boundary but no outlier threshold. The Rust codec already supports --outlier-threshold on any stream \u2014 it's just that our Python harness hardcoded V's to None to match the HF recipe. Expose it: --v-outlier-threshold T Python CLI V_OUTLIER_THRESHOLD=T driver env var (unset = V has no outlier compensation, matching the HF v1.3 PPL recipe) Semantics of the 'four guardrails' for V under this PR's per-channel attribution question: (1) Q-preconditioning N/A for V (V does not contract with Q; there is no \u03a3_q metric on V) (2) calibrated Lloyd-Max already on (ds_V_b2_centroids.f32) (3) 6-layer boundary skip already on (same layers as K) (4) outlier compensation now optional via --v-outlier-threshold (was hardcoded None) The V-only baseline row in FINDINGS.md (\u0394ppl +11.10 %) already has (2) and (3) active. This commit lets the next V-only run add (4) too, to answer the question 'what is V-only PPL with all four guardrails that APPLY to V turned on?' Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

On DS-Distill + vLLM FLASH_ATTN / 4 WikiText-103 test passages / shared paired reference, running the production cell with COMPRESS_STREAM=v and V_OUTLIER_THRESHOLD=2.0 gives: V-only (SPRINT_CLOSEOUT recipe, no V outlier) +11.10 % top1 74.22 % V-only (+ outlier T=2.0) +7.04 % top1 75.39 % \u22124.06 pp \u0394ppl Four SPRINT_CLOSEOUT guardrails and their applicability to V: (1) Q-precond N/A (V does not contract with Q; no \u03a3_q metric) (2) Lloyd-Max already on (ds_V_b2_centroids.f32) (3) 6-bdry already on (4) outlier T was off in SPRINT_CLOSEOUT; this commit enables it So row 4 is 'V with all APPLICABLE guardrails'. Outlier compensation is a cheap V add-on worth ~4 pp on vLLM. On HF the SPRINT_CLOSEOUT reasoning held (V residual already near-Gaussian), but under FLASH_ATTN's bf16 accumulation the remaining V outliers apparently still matter. Consolidated table in reports/v1_3_ppl/FINDINGS.md now has 4 rows: production (K+V) +35.33 % 59.38 % K-only +22.55 % 69.14 % V-only (SPRINT_CLOSEOUT V recipe) +11.10 % 74.22 % V-only + outlier (all applicable guards) +7.04 % 75.39 % K is still the bigger lever under vLLM; V outlier compensation is a cheap add that shaves ~4 pp off V's standalone cost on top. Artifact: reports/v1_3_ppl/vllm_v_only_full_guards/*.json Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Runs the same 4 WikiText-103 test passages through HF transformers (eager, fp32 accumulate) and vLLM (0.7.3, FLASH_ATTN, bf16) with the codec OFF on both, then reports per-passage: ppl_hf, ppl_vllm, rel_gap = (vllm - hf) / hf mean |\u0394 log P(true_token)| across the eval window mean symmetric KL over the top-K=20 logprob bucket (engines' reported alternatives at each position) top-1 agreement between engines If rel_gap is small (< 1%) and KL is small (< 0.01), the engines give effectively the same logits on the clean model; any remaining HF\u2194vLLM \u0394ppl gap on the codec cell must come from the engine's response to codec perturbation, not a baseline mismatch. Single script / single process / reuses the GPU across engines (vLLM first, then HF after torch.cuda.empty_cache()). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

With codec OFF, same tokens, DS-Distill, 4 WikiText test passages: mean PPL rel gap (vLLM-HF)/HF = -11.44 % mean top-1 agreement (engines) = 82.03 % mean symmetric KL on top-20 = 0.145 nats mean |\u0394 log P(true_token)| = 0.347 nats HF transformers (eager, bf16) consistently reports higher PPL than vLLM (FLASH_ATTN, bf16) on the same tokens through the same bf16 weights. 1 in 5 positions disagrees on top-1. KL on the top-20 bucket is order-of-magnitude larger than bf16 round-off. This is a baseline mismatch: the 'HF +7.82% \u0394ppl' and 'vLLM +35.33% \u0394ppl' numbers are measured against different reference PPLs, so part of the 27 pp HF\u2194vLLM \u0394ppl gap is the two engines answering slightly different questions about slightly different logit distributions. Phase 2 (noise sensitivity curves) will quantify how much of the 27 pp survives after normalising for that. Caveat: HF emits 'Sliding Window Attention is enabled but not implemented for eager; unexpected results may be encountered.' on DS-Distill. DS-Distill's config has sliding_window: null so this warning is spurious, but the eager path may take a slower numeric route than vLLM's FA kernel. A concrete contributor to the observed KL. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Captures pre-RoPE K/V from BOTH engines on the same WikiText-103 passages via non-invasive monkey patches: vLLM : Qwen2Attention.forward capture (reused from PR #15 tools) HF : transformers.models.qwen2.modeling_qwen2.Qwen2Attention.forward emulates the eager pre-RoPE extraction, records k/v, then delegates to the original forward for the rest For each layer runs the production v1.3 codec (Q-precond + calibrated Lloyd-Max + outlier T=2.0 on K; Lloyd-Max share_basis on V) and reports per-layer: codec_mean_block_mse (reported by kakeyaturbo-bench) mse_vs_ground_truth (numpy mean((K - K_hat)**2)) relnorm (||K - K_hat||_F / ||K||_F) mean_K_abs, mean_V_abs (to see if input magnitudes are matched) The output JSON pairs vLLM and HF rows per layer. If the (vllm / hf) MSE ratio is ~1 across layers, the codec sees statistically identical K/V distributions from both engines and the codec errors are matched; the HF<->vLLM \u0394ppl gap cannot be blamed on 'codec saw different inputs on different engines'. If the ratio is consistently != 1, we have localised a concrete pre-codec mismatch. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

22 full-attention layers of DS-Distill, 8192 pre-RoPE K/V vectors per layer per engine: K mse ratio (vLLM/HF) median 1.012 max 1.056 min 0.98 V mse ratio (vLLM/HF) median 1.018 max 1.056 min 0.96 mean |K|: vllm 0.9806 hf 0.9743 (\u0394 0.64%) mean |V|: vllm 0.6765 hf 0.6690 (\u0394 1.12%) Codec sees statistically identical K/V from both engines and produces statistically identical reconstruction errors. Q-preconditioning, calibrated Lloyd-Max centroids, outlier compensation \u2014 all engine- agnostic at the 1% resolution. This confirms Phase 1 from PR #15 (calibration drift H3) on a different metric. Gap decomposition so far: Phase 1 engine baseline mismatch -11% PPL rel, 0.145 KL Phase 4 codec residual mismatch ~0 (1% noise) The 27 pp HF\u2194vLLM \u0394ppl gap cannot be explained by 'codec saw different inputs' or 'codec errors are different'. The rest must be in the engine's RESPONSE to the same fixed codec residual \u2014 exactly what Phase 2 (noise-sensitivity curve) is designed to measure. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Replaces the codec with K' = K + \u03c3 * rms(K) * randn at the pre-RoPE hook on each engine. Hook points are symmetric: vLLM : Qwen2Attention.forward (same point as production harness) HF : transformers Qwen2Attention.forward, pre-RoPE via a monkey-patched q/k/v projection swap-out Sweeps sigma \u2208 {0.001, 0.01, 0.03, 0.1, 0.3}, with --mode controlling which streams are perturbed (both, k only, v only). Loads the model ONCE per engine, computes the reference ref_lps ONCE, then runs all sigmas as separate alt passes. Seeded RNG is counter-based so that different sigmas and different engines use independent noise realisations but the sweep is reproducible. Outputs JSON with per-sigma, per-passage \u0394ppl and |\u0394 log P(true)|. Run the script twice (once --engine vllm, once --engine hf) to get both curves. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…setattr tricks) Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Linear regime (\u03c3 \u2264 0.01), matched noise on both engines, DS-Distill: \u03c3 vLLM \u0394ppl HF \u0394ppl 0.001 -0.57 % +1.74 % 0.010 +12.37 % +33.50 % Above \u03c3 = 0.03 both engines are in the saturation regime (\u0394ppl > 2000 %); no meaningful comparison there. HF's eager forward amplifies matched relative-RMS noise MORE than vLLM's FA path in the linear regime, not less. This contradicts the working theory from PR #15 that 'FA bf16 softmax is the culprit', and is hard to reconcile with the observed codec cell where HF \u0394ppl is +7.82% but vLLM \u0394ppl is +35.33%. Two remaining possibilities: (a) the codec produces the same fp32 residual on both engines (Phase 4 confirmed this), but HF and vLLM cross the fp32 \u2192 bf16 boundary at different points inside the attention module; HF may be upcasting to fp32 somewhere vLLM isn't, so HF's effective \u03c3 at the attention kernel is smaller than vLLM's. (b) the codec's error is structured (Lloyd-Max + WHT + outlier) and its projection onto the attention metric differs from random noise's projection. If that structure aligns with directions HF eager suppresses but FA doesn't, the engines swap their sensitivity order versus \u03c3\u00b7randn. Either way, 'engine noise sensitivity curve' does not explain the 27 pp codec-cell \u0394ppl gap. The remaining candidate is structural: Phase 3 (per-layer codec attribution) should localise whether the extra vLLM damage is concentrated in a small layer set. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Reuses the production harness's hook and codec config; only inverts the boundary_skip_layers set per cell to enable codec on exactly one layer at a time. 28 cells on DS-Distill, each measures \u0394ppl / top-1 against a shared codec-OFF reference. Ranks layers by |\u0394ppl| so Phase 6 can say whether the residual 27 pp HF\u2194vLLM gap is concentrated (small layer set) or uniform (layer-agnostic numerical path difference). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…o +39 pp Per-layer single-layer codec attribution on vLLM (DS-Distill): L00 +56.49 % (catastrophic) L07 +15.59 % L11 -8.45 % L06 +6.71 % L02 -6.68 % L15 -5.19 % ...rest in [-5, +5] % Aggregates: \u03a3 over 6 SPRINT_CLOSEOUT boundary layers {0,1,7,14,26,27} : +68.97 % \u03a3 over 22 'quiet' non-boundary layers : -3.93 % production cell (boundary skipped, 22 codec-active, measured):+35.33 % Two findings: 1. SPRINT_CLOSEOUT's 6-layer boundary skip on vLLM is mandatory. Those 6 layers alone carry +69 pp \u0394ppl; keeping them bf16 saves 69 pp. 2. Non-linear cross-layer compounding: summing singletons over the 22 active layers gives only -3.93% (noise), but the joint cell is +35.33%. Joint action has a ~+39 pp cross-layer interaction. This interaction term is the remaining candidate for the HF\u2194vLLM gap: HF's eager path with fp32 accumulation through the residual stream compounds per-layer codec residuals less aggressively than vLLM's FA path does. Each per-layer residual is small on both engines (Phase 4 confirmed matched magnitudes); their joint effect diverges across engines. Deployment implication \u2014 two measured cheap interventions for vLLM: (a) extend boundary skip to include the next-worst 3 layers {2, 6, 11}; individually they are 5-8 pp, compounding savings could be 10-15 pp off the joint +35 %. (b) adaptive per-layer bit-width: keep K b=3 on most layers, go b=4 only on the 3-4 hot layers. Both kill most of the non-linear compounding at its source on vLLM while preserving >22/28 of the ratio benefit. Artifacts: reports/v1_3_ppl/gap_decomposition/phase3/FINDINGS.md reports/v1_3_ppl/gap_decomposition/phase3/ds_distill_qwen_1_5b_vllm_per_layer.json Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Buckets on the HF (+7.82%) vs vLLM (+35.33%) 27-pp gap: #1 Engine baseline shift ~10 pp (clean-model PPL disagreement; 0.145 KL; 18% top-1 disagreement) #2 Codec residual magnitude ~0 (codec is engine- agnostic; mse ratio 1.01) #3 Noise-sensitivity curve HF MORE sensitive per \u03c3 in linear regime; not the cause #4 Boundary layers already skipped +69 pp saved by SPRINT_CLOSEOUT boundary policy #5 Cross-layer non-linear compound +39 pp (joint-cell - \u03a3 singletons over 22 quiet layers) Localised root cause: vLLM's single-forward bf16 residual-stream accumulation through Flash-Attention compounds per-layer codec residuals ~39 pp above their sum, while HF eager's f32-accumulate + teacher-force over DynamicCache compounds them less aggressively. Each per-layer residual is small on both engines (Phase 4 matched); what differs is the accumulation path. Deployment recommendations: 1. Extend vLLM boundary skip to {2, 6, 11} on top of the existing {0,1,7,14,26,27}; cuts ~10-15 pp off the joint Delta-ppl. 2. Adaptive per-layer bit-width: K b=4 on the hot layers, b=3 elsewhere; preserves 19/28 of the ratio benefit. Phase 3 ran only on vLLM (reused production harness); the HF per- layer curve is left as a follow-up if someone wants to confirm that HF's cross-layer interaction is the ~+10 pp we infer here. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…3/59.38) Scenario A (compress KV after a clean prefill, HF two-pass semantics inside vLLM) gives: passage 1: \u0394ppl +11.52% top1 75.00% passage 2: \u0394ppl +13.01% top1 71.88% passage 3: \u0394ppl +35.59% top1 73.44% passage 4: \u0394ppl +56.17% top1 76.56% mean \u0394ppl +29.07% top1 74.22% verdict REJECT Harness delta vs in-forward mode: Mode \u0394ppl top-1 verdict HF 2-pass +7.82% 78.97% MARGINAL vLLM snapshot (this) +29.07% 74.22% REJECT vLLM in-forward +35.33% 59.38% REJECT PR #16 Phase 6 predicted snapshot-mode vLLM would land near HF's +7.82% because the sum of 22 non-boundary single-layer \u0394ppl contributions was -3.9% and the 'extra +39 pp' was attributed to in-forward cross-layer compounding. That prediction was WRONG. Actual harness-integration contribution: ~6 pp of the 27 pp gap (not 39 pp). The top-1 DOES jump substantially (59 \u2192 74, within 5 pp of HF's 79), confirming that in-forward pollution was changing the argmax \u2014 but the residual \u0394ppl stays far from HF. Revised bucket attribution of the 27 pp HF\u2194vLLM gap: in-forward vs snapshot (harness) ~ 6 pp engine baseline shift (Phase 1 clean) ~10 pp intrinsic engine compounding (FA bf16) ~11 pp The dominant term is NOT the harness; it is the engine itself. FA bf16 attention + softmax propagates the same codec residual through 28 layers differently than HF eager's fp32-accumulate softmax. This is the real root cause. Deployment implication: Scenario A is the correct semantics to deploy (compress already-filled KV cache for memory relief), and it IS better than the in-forward harness, but on this codec recipe it does not reach HF MARGINAL parity on vLLM. Top-1 preservation (74%) is the first positive vLLM datapoint though; the remaining gap is in the logit distribution, not the argmax. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

The full microscopic reasoning behind the +39 pp 'cross-layer non-linear compounding' estimate from PR #16 Phase 6 existed only in conversation, never in any FINDINGS.md. This commit preserves that reasoning as an audit trail so future agents can rechekc it against new evidence (Option C being the decisive one). Content: * Two coherent error channels in vLLM in-forward path: - direct: \u03b5_l in layer l's attention output - indirect: \u03b5_l \u2192 \u03b4_l \u2192 W_k \u00b7 \u03b4_l shifts layer l+1's pre-codec K/V input, so layer l+1's codec runs on a 'wrong K' * These are sign-correlated across layers, so variance grows as (N\u03c3)\u00b2 instead of N\u03c3\u00b2 \u2014 \u221a22 \u2248 4.7\u00d7 amplification over linear composition * bf16 residual stream aggravates because correlated errors can't cancel sub-ULP; HF eager upcasts softmax to fp32, vLLM FA stays bf16 This hypothesis predicts snapshot-mode on vLLM should reach ~+8% (HF parity). PR #17's actual measurement was only -6 pp (+35 \u2192 +29), so the mechanism logic is real but over-estimated by ~6x; the ~11 pp 'intrinsic engine' bucket is where the actual root cause lives. Evidence FOR the hypothesis (Phase 2, Phase 4 signals, PR #17 top-1 +15 pp recovery) is listed alongside evidence AGAINST (PR #17 \u0394ppl stays high). The document explicitly says it's a draft hypothesis with partial support, NOT a verified conclusion. Status set so Option C (fully in-kernel backend) will conclusively discriminate between (a) CPU round-trip vs (b) FA bf16 softmax as the source of the remaining 11 pp. Corrections are pre-listed for whichever outcome lands. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

cursoragent and others added 30 commits April 21, 2026 15:13

vllm_calibration_refit: set __file__ in the synthetic namespace

f2065ff

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Phase 1+5: fix dtype kwarg (transformers 4.51 uses torch_dtype)

486c68a

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

cursoragent and others added 10 commits April 21, 2026 23:50

Phase 4: fix flat row counting; compare vs ground truth correctly

1e44959

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Phase 2: reimplement HF Qwen2Attention.forward cleanly (no nn.Module …

a7c7b23

…setattr tricks) Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

FluffyAIcode mentioned this pull request Apr 22, 2026

Scenario A — snapshot-mode KV compression on vLLM: Δppl +29% / top-1 74% (improves top-1 by 15 pp, Δppl only by 6 pp) #17

Merged

FluffyAIcode closed this Apr 23, 2026

FluffyAIcode deleted the AgentMemory/v1-3-ppl-gap-decomposition-102e branch April 23, 2026 15:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.3 PPL HF↔vLLM gap decomposition: localised to cross-layer non-linear compounding (+39 pp) and baseline shift (~10 pp)#16

v1.3 PPL HF↔vLLM gap decomposition: localised to cross-layer non-linear compounding (+39 pp) and baseline shift (~10 pp)#16
FluffyAIcode wants to merge 40 commits into
mainfrom
AgentMemory/v1-3-ppl-gap-decomposition-102e

FluffyAIcode commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented Apr 22, 2026

Goal

Δppl bucket table

Executive summary

Per-phase one-liners

Deployment implications

HF users

vLLM users

What's in this branch

Harnesses (4)

Reports (6 files)

Test status

Relationship to other PRs

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants