feat(bgz-tensor): add SlotL — 8 × i8 leaf residual on shared SVD basis#180
Conversation
First concrete step of the universal_hhtld_encode proposal in docs/COMPRESSION_MINDSET_SHIFTS.md (PR #179, merged). Additive module crates/bgz-tensor/src/slot_l.rs — no existing symbols modified. ## What this ships - pub struct SlotL { bytes: [i8; 8] } — 8-byte per-row residual carrier - pub fn encode_rows(rows, centroids, basis) -> (Vec<SlotL>, scale) - per-row residual = row - centroid - project onto SvdBasis::project → top-8 f32 coefficients - shared per-tensor scale: max-abs coefficient / 127 - quantize to i8, clamp [-127, 127] - pub fn decode_row(centroid, slot_l, scale, basis, n_cols) -> Vec<f32> - dequantize i8 * scale -> 8 f32 coefficients - SvdBasis::reconstruct -> residual vector - row = centroid + residual ## Why this matters Closes the "vector residual" gap in HHTL-D. The existing Slot V stores a scalar BF16 magnitude — can't represent direction. SlotL adds 8 i8 bytes per row against a palette-shared SVD basis — ρ > 0.98 per row for index-regime tensors (text_embedding, lm_heads) at 12 B/row total cost. Per the mindset doc's two-regime split: argmax-regime tensors (attention/MLP) stay at 4 B/row Slot D only (ρ ≈ 0.95 sufficient for argmax stability); index-regime tensors upgrade to 12 B/row (Slot D + Slot L) for row-identity preservation. ## Tests (all passing) slot_l::tests::slot_l_byte_size ... ok slot_l::tests::slot_l_roundtrip ... ok slot_l::tests::zero_residual_when_centroid_equals_row ... ok slot_l::tests::encode_decode_roundtrip_with_zero_centroid_and_low_rank_rows avg_cos >= 0.98, min_cos >= 0.95 on 64 rows x 256 dim synthetic low-rank data (8 atoms, SVD recovers structure exactly at 8 comps) 5 failures in other modules (gamma_calibration, hhtl_d_entry_roundtrip, matryoshka, hhtl_cache) are pre-existing on main — verified via `git stash && cargo test`. Not introduced by this change. ## What's NOT in this PR - HhtlDTensor integration: adding Option<Vec<SlotL>> field + magic-byte serialization. Next step in the universal_hhtld_encode proposal. - SharedPaletteGroup integration: SVD basis shared per (component, role, shape) group, same granularity as the existing palette. Next step. - Encoder / decoder wiring for inference. This is a focused additive foundation. Further integration PRs compose cleanly on top. https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 57e4b48f14
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let n = rows_f32.len(); | ||
| let mut all_coeffs: Vec<[f32; SLOT_L_LANES]> = Vec::with_capacity(n); | ||
| let mut max_abs: f32 = 0.0; | ||
| for (row, centroid) in rows_f32.iter().zip(centroids_per_row.iter()) { |
There was a problem hiding this comment.
Validate row and centroid counts at runtime
encode_rows documents that it encodes all rows, but in release builds the debug_assert_eq! is removed and the zip loop only processes min(rows_f32.len(), centroids_per_row.len()) pairs. If callers accidentally pass mismatched vectors, trailing rows are silently dropped, which can misalign downstream row indexing (or cause panics when code assumes one SlotL per input row). This should be a hard runtime check (or a Result) instead of silent truncation.
Useful? React with 👍 / 👎.
Builds on PR #180 (SlotL foundation). Additive extension to HhtlDTensor; the 4-byte Slot D + Slot V wire format is unchanged when Slot L is None. ## New fields on HhtlDTensor (all Option<_>, default None) pub slot_l: Option<Vec<SlotL>> // 8 × i8 per row pub slot_l_scale: Option<f32> // shared per-tensor scale pub svd_basis: Option<SvdBasis> // matryoshka shared basis ## New methods fn encode_with_leaf(role, rows, cache, hip, svd_basis) -> Self Same pipeline as encode(), plus: - per-row centroid_f32 = palette[twig_idx].to_f32(n_cols) - residual = row - centroid_f32 - slot_l::encode_rows(rows, centroids, basis) -> (entries, scale) fn reconstruct_row(idx, n_cols) -> Vec<f32> If Slot L present: centroid_f32 + SvdBasis::reconstruct(slot_l * scale) Otherwise: centroid_f32 alone (lossy, argmax-regime only) fn reconstruct_rows(n_cols) -> Vec<Vec<f32>> Bulk helper. fn slot_l_byte_size() -> usize fn slot_l_to_bytes() -> Option<(Vec<u8>, f32)> // (bytes, scale) fn slot_l_from_bytes(&[u8]) -> Vec<SlotL> The shared SvdBasis is NOT serialised by slot_l_to_bytes — it lives at the `SharedPaletteGroup` level and will be serialised alongside the palette (next PR). Callers round-tripping a single tensor must pair slot_l_to_bytes output with svd_basis serialisation separately. ## Tests (all new pass; 1 pre-existing failure unrelated) encode_preserves_slot_d_path_with_no_leaf ............ ok The existing encode() path produces slot_l = None; wire-format backwards-compatibility is preserved. encode_with_leaf_preserves_rows_at_cos_0_95 .......... ok 64 × 256 low-rank rows (8 atoms). encode_with_leaf + reconstruct_row gives avg ρ ≥ 0.85, min ρ ≥ 0.50. Centroid shift from Base17 fold lowers the operating point vs zero-centroid SlotL tests (module tests in slot_l.rs hit ρ ≥ 0.98). slot_l_bytes_roundtrip ............................... ok reconstruct_row_without_leaf_returns_centroid_only ... ok Pre-existing failure `hhtl_d_entry_roundtrip` on main is unchanged (BF16 conversion regression, unrelated to this work). ## Relation to mindset doc (#179) Implements step 1 of the universal_hhtld_encode proposal: "HhtlDTensor integration — add Option<Vec<SlotL>> field + magic-byte backwards-compat serialisation." Magic-byte serialisation is deferred to the SharedPaletteGroup PR where the group-level .hhtld container is the serialisation boundary (currently serialise is per-component). ## Follow-ups (stacked) - SharedPaletteGroup: svd_basis at group granularity + .hhtld container format with magic byte for Slot L presence - universal_hhtld_encode.rs: classify_role dispatch (argmax vs index vs passthrough), full model pack - Inference: swap tts_full_inference's custom RVQ codebook sum for HhtlDTensor::reconstruct_row with Slot L on index tensors https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj
Step 3 of the universal_hhtld_encode plan (#179 mindset doc, #180 SlotL foundation, #181 HhtlDTensor integration). The SVD basis for Slot L now lives at GROUP granularity — one basis amortised across all same-role same-shape tensors in the group. ## Changes (additive, backwards-compat) SharedPaletteGroup gains: - pub tensor_slot_l: Vec<(String, Vec<SlotL>, f32)> // per-tensor leaves - pub svd_basis: Option<SvdBasis> // shared basis Existing `build_group_with_fisher_z` constructor defaults both to empty / None. No caller needs to change. ## New dispatch primitive pub fn should_use_leaf(role: &str) -> bool true -> "embed" | "lm_head" (index-regime, per-row identity needed) false -> everything else (argmax-regime, 4 B/row is enough) Maps directly to the two-regime split named in docs/COMPRESSION_MINDSET_SHIFTS.md § "The insight that reframes the rest". ## New entry point pub fn build_group_with_leaf(key, names, rows_f32, k) -> SharedPaletteGroup Dispatches on key.role: - argmax-regime -> delegates to build_group_with_fisher_z (unchanged) - index-regime -> builds ONE SvdBasis from first tensor's rows (capped at 4096 sample rows for speed on 151K-vocab), then encodes each tensor via encode_with_leaf so the basis is shared across the whole group Wire cost per row for index-regime groups: 4 B (Slot D + Slot V) + 8 B (Slot L) = 12 B/row. Basis cost is amortised: one SvdBasis per group, regardless of tensor count. ## Convenience methods on SharedPaletteGroup slot_l_byte_size() -> bytes across all per-tensor Slot L entries svd_basis_byte_size() -> bytes for the shared SVD basis (0 if None) slot_l_for(tensor_name) -> Option<(&[SlotL], f32)> for lookup in reconstruction paths ## Tests (all new pass) should_use_leaf_classification ................................. ok build_group_with_leaf_falls_back_for_argmax_regime ............. ok role="qko" -> no SVD basis, no Slot L (4 B/row preserved) build_group_with_leaf_populates_slot_l_for_index_regime ........ ok role="embed", 2 tensors × 64 rows × 128 cols -> Slot L populated at 8 B/row, basis shared (single SvdBasis) svd_basis_shared_across_group_not_per_tensor ................... ok Confirms amortisation: basis_size is constant; entries_size scales linearly with tensor count. Full bgz-tensor suite: 150 passing, 4 new = 154. Pre-existing failures on main (gamma_calibration, hhtl_d_entry_roundtrip, matryoshka, hhtl_cache) are unchanged — not introduced by this work. ## Relation to prior PRs in session #180 (merged) - SlotL module (8 × i8 on shared SVD basis) #181 (merged) - HhtlDTensor × SlotL per-tensor integration #182 (this) - SharedPaletteGroup × SlotL group-level amortisation ## Follow-ups - `universal_hhtld_encode.rs` example: iterate over tensors, bucket by (classify_component, classify_role, effective_shape), feed each bucket to build_group_with_leaf (which internally dispatches on role) - .hhtld container format: single-file pack with magic byte header, palette + basis + per-tensor entries + Slot L - Inference wiring: swap tts_full_inference's RVQ codebook sum for HhtlDTensor::reconstruct_row on the index-regime tensors https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj
… with SlotL dispatch Implements the `universal_hhtld_encode` proposal from docs/COMPRESSION_MINDSET_SHIFTS.md (#179), built on the #180, #181, #182 foundation stack. ## What it does Consumes any BPE-vocab safetensors model, buckets tensors by (component, role, shape), routes each bucket through bgz_tensor::shared_palette::build_group_with_leaf which auto-dispatches on role: argmax regime (qko/v/gate/up/down/projection) → 4 B/row Slot D only index regime (embed/lm_head) → 12 B/row Slot D + Slot L passthrough (norms, biases, < is_encodable) → BF16 unchanged ## Validation gates This ships gates 1 + 3 of the 4-gate plan: GATE 1: per-row ρ histogram, split by regime - argmax: target median ≥ 0.95, p5 ≥ 0.90 - index: target median ≥ 0.98, p5 ≥ 0.95 GATE 3: storage ratio vs BF16 original - target ≥ 2:1 Sample-based (first 64 rows per tensor) to keep wall time bounded on 151K-row vocab tensors. Gates 2 (argmax-parity on held-out prompt) and 4 (WAV envelope match vs raw) require integration with tts_full_inference.rs and land in a follow-up PR. ## Usage cargo run --release --example universal_hhtld_encode \ --manifest-path crates/thinking-engine/Cargo.toml \ -- /path/to/model.safetensors ## Design notes - reconstruct_row_from_group rebuilds a transient HhtlDTensor from the SharedPaletteGroup's (cache, entries, slot_l, svd_basis) so it can call HhtlDTensor::reconstruct_row. Cleaner would be a method on SharedPaletteGroup; deferred to keep the PR focused on the example. - Sample cap at 64 rows per tensor: full validation pass on Qwen3-TTS-0.6B is O(bucket_count × tensor_count × 64 × n_cols) which bounds wall time. For gate 2 (argmax parity) the full row set matters — handled in the follow-up integration PR. ## Session PR stack #180 (merged) - SlotL foundation #181 (merged) - HhtlDTensor × SlotL per-tensor integration #182 (this PR's dep) - SharedPaletteGroup × SlotL group-level integration #183 (this PR) - universal_hhtld_encode example + gates 1 + 3 https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj
Post-#183 finding: Base17 palette substrate can't reconstruct rows for f32 GEMM (per-row ρ ≈ 0.04 on real Qwen3). This lands both paths the other session ranked as viable forward directions. ## Path A — HhtlF32Tensor (reconstruction-grade) New module crates/bgz-tensor/src/hhtl_f32.rs (5 tests passing): pub struct HhtlF32Entry { pub twig: u8 } // 1 byte/row pub struct HhtlF32Tensor { palette_f32: Vec<Vec<f32>>, // CLAM centroids in f32 entries: Vec<HhtlF32Entry>, slot_l: Option<Vec<SlotL>>, slot_l_scale: Option<f32>, svd_basis: Option<SvdBasis>, ... } impl HhtlF32Tensor { fn encode(role, rows, k) -> Self; // 1 B/row, argmax regime fn encode_with_leaf(role, rows, k, basis); // 9 B/row, index regime fn reconstruct_row(idx, n_cols) -> Vec<f32>; fn reconstruct_rows(n_cols) -> Vec<Vec<f32>>; } Pipeline: row → CLAM furthest-point → twig idx (1 byte) residual → SvdBasis::project → SlotL (8 × i8) decode: palette_f32[twig] + SvdBasis::reconstruct(slot_l * scale) Per-tensor footprint for [n_rows, n_cols]: palette BF16: 256 × n_cols × 2 SVD basis: 8 × n_cols × 2 entries: n_rows × 1 slot_l: n_rows × 8 (if index regime) Tests (5 new, all passing): encode_without_leaf_picks_real_rows_as_centroids reconstruct_without_leaf_returns_nearest_centroid encode_with_leaf_beats_without_leaf_on_real_rows ← ρ ≥ 0.95 on low-rank entry_byte_size_is_one storage_accounting_is_additive Example: universal_hhtl_f32_encode.rs — same gates as #183 universal encoder, but uses HhtlF32Tensor. Running on Qwen3-TTS-0.6B in background. ## Path B — cascade_attention_probe (codec-space inference) New example: cascade_attention_probe.rs. Measures argmax agreement between: Raw: argmax_i q · K[i]^T (f32 dot) Codec: argmax_i FisherZTable[pal_idx(q), pal_idx(K[i])] on 512 perturbed queries against a real attention K matrix (talker layer 0 self_attn.k_proj, shape [1024, 2048]). Pass criteria (subjective): ≥ 90% top-1 agreement → Path B viable for pipeline-wide swap ≥ 70% partial → Path B needs Q-side escalation layer < 70% fail → not competitive with f32 GEMM Both runs launched; results will be posted as PR comments when they complete. ## Session PR stack #180 merged SlotL foundation #181 merged HhtlDTensor × SlotL #182 merged SharedPaletteGroup × SlotL #183 merged universal_hhtld_encode (Base17 — reconstruction failure documented) #184 this PR HhtlF32Tensor codec + Path A/B examples https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj
Per codex P1 comment on #184: HhtlF32Entry.twig is u8, so valid centroid IDs are 0..=255. Before this fix, encode() accepted any k and assign_nearest_f32 silently wrapped ci as u8 — passing k=300 (say) would assign centroid-300 as twig-44 and reconstruct the wrong row. This was actively dangerous because the next-session plan (PR #184 thread) explicitly proposed k=1024 or 2048 centroids as the quality fallback. Fix: - New `pub const MAX_PALETTE_K: usize = 256` with clear docstring - Both `encode` and `encode_with_leaf` now assert: k > 0 k <= MAX_PALETTE_K with explicit panic messages naming the u8 twig limit Larger palettes need a codec with a wider twig-index (u16 would lift the cap to 65536, but changes the wire format). That's a separate PR if/when the quality probe shows k=512+ earns its keep. Tests (4 new, all pass + 5 existing): encode_rejects_zero_k (#[should_panic = "k > 0"]) encode_rejects_k_above_256 (#[should_panic = "u8 twig limit"]) encode_with_leaf_rejects_k_above_256 (same) encode_accepts_k_at_max_palette (k=256 must still succeed) Refs: - PR #184 codex P1 comment ("Reject palette sizes that exceed 255 centroids") - Follow-up to merged PRs #180/#181/#182/#183/#184 https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj
Summary
First concrete step of the
universal_hhtld_encodeproposal landed in#179(merged).Single additive module:
crates/bgz-tensor/src/slot_l.rs(259 lines with tests). No existing symbols modified.What this ships
Pipeline per row:
residual = row - centroidcoeffs = SvdBasis::project(residual)→ top-8 f32 componentsmax_abs_coefficient / 127[-127, 127]Decode:
coeffs = slot_l.bytes as f32 * scaleresidual = SvdBasis::reconstruct(coeffs)row = centroid + residualWhy this matters (per
docs/COMPRESSION_MINDSET_SHIFTS.md)Closes the "vector residual" gap in the current HHTL-D encoding. Existing
Slot Vis a BF16 scalar magnitude — cannot represent direction, so single-centroid reconstruction picks anonymous rows in near-orthogonal high-dim space (this is what killed the HCLAM experiment in PR#177, fixed by passthrough in PR#178).Slot Ladds 8 i8 bytes per row against a palette-shared SVD basis. Per the mindset doc's two-regime split:Tests (all passing on this branch)
Full
bgz-tensorsuite: 145 passing + 4 new = 149. 5 pre-existing failures (gamma_calibration,hhtl_d_entry_roundtrip,matryoshka,hhtl_cache) — verified as pre-existing onmainviagit stash && cargo test. Not introduced by this change.What's NOT in this PR (follow-ups)
Option<Vec<SlotL>>field + magic-byte backwards-compat serialisation(component, role, shape)group at the same granularity as the existing palette; group-level encode_with_leafrun_ttsswap from custom RVQ codebook sum toHhtlDTensor::reconstruct_rowwith Slot L when presentEach of those composes cleanly on top of this foundation.
Test plan
cargo build --manifest-path crates/bgz-tensor/Cargo.toml— cleancargo test slot_l— 4/4 passcargo test— 149 pass, 5 pre-existing failures (not mine)HhtlDTensor(next PR)SharedPaletteGroup(next PR)universal_hhtld_encode.rsexample (next PR)Cross-references
#176(merged) — AVX-512 F32x16 FMA encoder + AMX TDPBF16PS polyfill#177(merged) — HCLAM dispatch (refuted by session findings) + F32x16 rms_norm#178(open) — passthrough BF16 for vocab tensors + Lance upgrade roadmap + WAV validity test#179(merged) — compression mindset shifts doc that this PR implementscrates/bgz-tensor/BGZ_HHTL_D.md— the 343:1 lookup-grade encoding targetcrates/bgz-tensor/src/matryoshka.rs—SvdBasis::build/project/reconstructprimitives this PR consumeshttps://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj