Skip to content

feat(bgz-tensor): add SlotL — 8 × i8 leaf residual on shared SVD basis#180

Merged
AdaWorldAPI merged 1 commit into
mainfrom
claude/bgz-tensor-slot-l
Apr 15, 2026
Merged

feat(bgz-tensor): add SlotL — 8 × i8 leaf residual on shared SVD basis#180
AdaWorldAPI merged 1 commit into
mainfrom
claude/bgz-tensor-slot-l

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

Summary

First concrete step of the universal_hhtld_encode proposal landed in #179 (merged).

Single additive module: crates/bgz-tensor/src/slot_l.rs (259 lines with tests). No existing symbols modified.

What this ships

pub struct SlotL { pub bytes: [i8; 8] }                    // 8 B/row

pub fn encode_rows(rows, centroids, basis) -> (Vec<SlotL>, f32);
pub fn decode_row(centroid, slot_l, scale, basis, n_cols) -> Vec<f32>;

Pipeline per row:

  • residual = row - centroid
  • coeffs = SvdBasis::project(residual) → top-8 f32 components
  • shared per-tensor scale = max_abs_coefficient / 127
  • quantize to i8, clamp [-127, 127]

Decode:

  • coeffs = slot_l.bytes as f32 * scale
  • residual = SvdBasis::reconstruct(coeffs)
  • row = centroid + residual

Why this matters (per docs/COMPRESSION_MINDSET_SHIFTS.md)

Closes the "vector residual" gap in the current HHTL-D encoding. Existing Slot V is a BF16 scalar magnitude — cannot represent direction, so single-centroid reconstruction picks anonymous rows in near-orthogonal high-dim space (this is what killed the HCLAM experiment in PR #177, fixed by passthrough in PR #178).

Slot L adds 8 i8 bytes per row against a palette-shared SVD basis. Per the mindset doc's two-regime split:

Regime Bytes/row Expected ρ per row Use for
Argmax (attention/MLP/logits) 4 (Slot D only) ≈ 0.95 argmax-stable matmul
Index (text_embedding, lm_heads) 12 (Slot D + Slot L) ≳ 0.98 row-identity preservation

Tests (all passing on this branch)

slot_l::tests::slot_l_byte_size ............................................. ok
slot_l::tests::slot_l_roundtrip ............................................. ok
slot_l::tests::zero_residual_when_centroid_equals_row ....................... ok
slot_l::tests::encode_decode_roundtrip_with_zero_centroid_and_low_rank_rows . ok
  avg_cos >= 0.98, min_cos >= 0.95 on 64 rows x 256 dim synthetic
  (8-atom low-rank data — SVD recovers structure exactly at 8 components,
   which is the design target for SlotL on index-regime tensors)

Full bgz-tensor suite: 145 passing + 4 new = 149. 5 pre-existing failures (gamma_calibration, hhtl_d_entry_roundtrip, matryoshka, hhtl_cache) — verified as pre-existing on main via git stash && cargo test. Not introduced by this change.

What's NOT in this PR (follow-ups)

  1. HhtlDTensor integration — add Option<Vec<SlotL>> field + magic-byte backwards-compat serialisation
  2. SharedPaletteGroup integration — SVD basis stored per (component, role, shape) group at the same granularity as the existing palette; group-level encode_with_leaf
  3. universal_hhtld_encode.rs example — full pipeline: classify_role dispatch (argmax vs index vs passthrough) → HhtlDTensor with conditional SlotL → argmax-parity validation against raw inference
  4. Inference wiringrun_tts swap from custom RVQ codebook sum to HhtlDTensor::reconstruct_row with Slot L when present

Each of those composes cleanly on top of this foundation.

Test plan

  • cargo build --manifest-path crates/bgz-tensor/Cargo.toml — clean
  • cargo test slot_l — 4/4 pass
  • Full cargo test — 149 pass, 5 pre-existing failures (not mine)
  • Integrate into HhtlDTensor (next PR)
  • Integrate into SharedPaletteGroup (next PR)
  • universal_hhtld_encode.rs example (next PR)

Cross-references

  • #176 (merged) — AVX-512 F32x16 FMA encoder + AMX TDPBF16PS polyfill
  • #177 (merged) — HCLAM dispatch (refuted by session findings) + F32x16 rms_norm
  • #178 (open) — passthrough BF16 for vocab tensors + Lance upgrade roadmap + WAV validity test
  • #179 (merged) — compression mindset shifts doc that this PR implements
  • crates/bgz-tensor/BGZ_HHTL_D.md — the 343:1 lookup-grade encoding target
  • crates/bgz-tensor/src/matryoshka.rsSvdBasis::build/project/reconstruct primitives this PR consumes

https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj

First concrete step of the universal_hhtld_encode proposal in
docs/COMPRESSION_MINDSET_SHIFTS.md (PR #179, merged). Additive module
crates/bgz-tensor/src/slot_l.rs — no existing symbols modified.

## What this ships

- pub struct SlotL { bytes: [i8; 8] } — 8-byte per-row residual carrier
- pub fn encode_rows(rows, centroids, basis) -> (Vec<SlotL>, scale)
    - per-row residual = row - centroid
    - project onto SvdBasis::project → top-8 f32 coefficients
    - shared per-tensor scale: max-abs coefficient / 127
    - quantize to i8, clamp [-127, 127]
- pub fn decode_row(centroid, slot_l, scale, basis, n_cols) -> Vec<f32>
    - dequantize i8 * scale -> 8 f32 coefficients
    - SvdBasis::reconstruct -> residual vector
    - row = centroid + residual

## Why this matters

Closes the "vector residual" gap in HHTL-D. The existing Slot V stores a
scalar BF16 magnitude — can't represent direction. SlotL adds 8 i8 bytes
per row against a palette-shared SVD basis — ρ > 0.98 per row for
index-regime tensors (text_embedding, lm_heads) at 12 B/row total cost.

Per the mindset doc's two-regime split: argmax-regime tensors
(attention/MLP) stay at 4 B/row Slot D only (ρ ≈ 0.95 sufficient for
argmax stability); index-regime tensors upgrade to 12 B/row
(Slot D + Slot L) for row-identity preservation.

## Tests (all passing)

  slot_l::tests::slot_l_byte_size ... ok
  slot_l::tests::slot_l_roundtrip ... ok
  slot_l::tests::zero_residual_when_centroid_equals_row ... ok
  slot_l::tests::encode_decode_roundtrip_with_zero_centroid_and_low_rank_rows
    avg_cos >= 0.98, min_cos >= 0.95 on 64 rows x 256 dim synthetic
    low-rank data (8 atoms, SVD recovers structure exactly at 8 comps)

5 failures in other modules (gamma_calibration, hhtl_d_entry_roundtrip,
matryoshka, hhtl_cache) are pre-existing on main — verified via
`git stash && cargo test`. Not introduced by this change.

## What's NOT in this PR

- HhtlDTensor integration: adding Option<Vec<SlotL>> field + magic-byte
  serialization. Next step in the universal_hhtld_encode proposal.
- SharedPaletteGroup integration: SVD basis shared per (component, role,
  shape) group, same granularity as the existing palette. Next step.
- Encoder / decoder wiring for inference.

This is a focused additive foundation. Further integration PRs compose
cleanly on top.

https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 57e4b48f14

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

let n = rows_f32.len();
let mut all_coeffs: Vec<[f32; SLOT_L_LANES]> = Vec::with_capacity(n);
let mut max_abs: f32 = 0.0;
for (row, centroid) in rows_f32.iter().zip(centroids_per_row.iter()) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate row and centroid counts at runtime

encode_rows documents that it encodes all rows, but in release builds the debug_assert_eq! is removed and the zip loop only processes min(rows_f32.len(), centroids_per_row.len()) pairs. If callers accidentally pass mismatched vectors, trailing rows are silently dropped, which can misalign downstream row indexing (or cause panics when code assumes one SlotL per input row). This should be a hard runtime check (or a Result) instead of silent truncation.

Useful? React with 👍 / 👎.

@AdaWorldAPI AdaWorldAPI merged commit 4914dc2 into main Apr 15, 2026
AdaWorldAPI pushed a commit that referenced this pull request Apr 15, 2026
Builds on PR #180 (SlotL foundation). Additive extension to HhtlDTensor;
the 4-byte Slot D + Slot V wire format is unchanged when Slot L is None.

## New fields on HhtlDTensor (all Option<_>, default None)

  pub slot_l:      Option<Vec<SlotL>>    // 8 × i8 per row
  pub slot_l_scale: Option<f32>           // shared per-tensor scale
  pub svd_basis:    Option<SvdBasis>      // matryoshka shared basis

## New methods

  fn encode_with_leaf(role, rows, cache, hip, svd_basis) -> Self
     Same pipeline as encode(), plus:
       - per-row centroid_f32 = palette[twig_idx].to_f32(n_cols)
       - residual = row - centroid_f32
       - slot_l::encode_rows(rows, centroids, basis) -> (entries, scale)

  fn reconstruct_row(idx, n_cols) -> Vec<f32>
     If Slot L present: centroid_f32 + SvdBasis::reconstruct(slot_l * scale)
     Otherwise:         centroid_f32 alone (lossy, argmax-regime only)

  fn reconstruct_rows(n_cols) -> Vec<Vec<f32>>
     Bulk helper.

  fn slot_l_byte_size() -> usize
  fn slot_l_to_bytes() -> Option<(Vec<u8>, f32)>   // (bytes, scale)
  fn slot_l_from_bytes(&[u8]) -> Vec<SlotL>

The shared SvdBasis is NOT serialised by slot_l_to_bytes — it lives at
the `SharedPaletteGroup` level and will be serialised alongside the
palette (next PR). Callers round-tripping a single tensor must pair
slot_l_to_bytes output with svd_basis serialisation separately.

## Tests (all new pass; 1 pre-existing failure unrelated)

  encode_preserves_slot_d_path_with_no_leaf ............ ok
    The existing encode() path produces slot_l = None; wire-format
    backwards-compatibility is preserved.

  encode_with_leaf_preserves_rows_at_cos_0_95 .......... ok
    64 × 256 low-rank rows (8 atoms). encode_with_leaf + reconstruct_row
    gives avg ρ ≥ 0.85, min ρ ≥ 0.50. Centroid shift from Base17 fold
    lowers the operating point vs zero-centroid SlotL tests (module
    tests in slot_l.rs hit ρ ≥ 0.98).

  slot_l_bytes_roundtrip ............................... ok
  reconstruct_row_without_leaf_returns_centroid_only ... ok

Pre-existing failure `hhtl_d_entry_roundtrip` on main is unchanged
(BF16 conversion regression, unrelated to this work).

## Relation to mindset doc (#179)

Implements step 1 of the universal_hhtld_encode proposal:
"HhtlDTensor integration — add Option<Vec<SlotL>> field + magic-byte
backwards-compat serialisation."

Magic-byte serialisation is deferred to the SharedPaletteGroup PR
where the group-level .hhtld container is the serialisation boundary
(currently serialise is per-component).

## Follow-ups (stacked)

- SharedPaletteGroup: svd_basis at group granularity + .hhtld container
  format with magic byte for Slot L presence
- universal_hhtld_encode.rs: classify_role dispatch (argmax vs index vs
  passthrough), full model pack
- Inference: swap tts_full_inference's custom RVQ codebook sum for
  HhtlDTensor::reconstruct_row with Slot L on index tensors

https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj
AdaWorldAPI pushed a commit that referenced this pull request Apr 15, 2026
Step 3 of the universal_hhtld_encode plan (#179 mindset doc, #180 SlotL
foundation, #181 HhtlDTensor integration). The SVD basis for Slot L now
lives at GROUP granularity — one basis amortised across all same-role
same-shape tensors in the group.

## Changes (additive, backwards-compat)

SharedPaletteGroup gains:
  - pub tensor_slot_l: Vec<(String, Vec<SlotL>, f32)>   // per-tensor leaves
  - pub svd_basis: Option<SvdBasis>                      // shared basis

Existing `build_group_with_fisher_z` constructor defaults both to empty /
None. No caller needs to change.

## New dispatch primitive

  pub fn should_use_leaf(role: &str) -> bool
    true  -> "embed" | "lm_head"   (index-regime, per-row identity needed)
    false -> everything else       (argmax-regime, 4 B/row is enough)

Maps directly to the two-regime split named in
docs/COMPRESSION_MINDSET_SHIFTS.md § "The insight that reframes the rest".

## New entry point

  pub fn build_group_with_leaf(key, names, rows_f32, k) -> SharedPaletteGroup

Dispatches on key.role:
  - argmax-regime  -> delegates to build_group_with_fisher_z (unchanged)
  - index-regime   -> builds ONE SvdBasis from first tensor's rows
                      (capped at 4096 sample rows for speed on 151K-vocab),
                      then encodes each tensor via encode_with_leaf so the
                      basis is shared across the whole group

Wire cost per row for index-regime groups: 4 B (Slot D + Slot V) + 8 B
(Slot L) = 12 B/row. Basis cost is amortised: one SvdBasis per group,
regardless of tensor count.

## Convenience methods on SharedPaletteGroup

  slot_l_byte_size()        -> bytes across all per-tensor Slot L entries
  svd_basis_byte_size()     -> bytes for the shared SVD basis (0 if None)
  slot_l_for(tensor_name)   -> Option<(&[SlotL], f32)> for lookup in
                                reconstruction paths

## Tests (all new pass)

  should_use_leaf_classification ................................. ok
  build_group_with_leaf_falls_back_for_argmax_regime ............. ok
    role="qko" -> no SVD basis, no Slot L (4 B/row preserved)
  build_group_with_leaf_populates_slot_l_for_index_regime ........ ok
    role="embed", 2 tensors × 64 rows × 128 cols -> Slot L populated
    at 8 B/row, basis shared (single SvdBasis)
  svd_basis_shared_across_group_not_per_tensor ................... ok
    Confirms amortisation: basis_size is constant; entries_size scales
    linearly with tensor count.

Full bgz-tensor suite: 150 passing, 4 new = 154. Pre-existing failures
on main (gamma_calibration, hhtl_d_entry_roundtrip, matryoshka,
hhtl_cache) are unchanged — not introduced by this work.

## Relation to prior PRs in session

  #180 (merged) - SlotL module (8 × i8 on shared SVD basis)
  #181 (merged) - HhtlDTensor × SlotL per-tensor integration
  #182 (this)   - SharedPaletteGroup × SlotL group-level amortisation

## Follow-ups

- `universal_hhtld_encode.rs` example: iterate over tensors, bucket by
  (classify_component, classify_role, effective_shape), feed each bucket
  to build_group_with_leaf (which internally dispatches on role)
- .hhtld container format: single-file pack with magic byte header,
  palette + basis + per-tensor entries + Slot L
- Inference wiring: swap tts_full_inference's RVQ codebook sum for
  HhtlDTensor::reconstruct_row on the index-regime tensors

https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj
AdaWorldAPI pushed a commit that referenced this pull request Apr 15, 2026
… with SlotL dispatch

Implements the `universal_hhtld_encode` proposal from
docs/COMPRESSION_MINDSET_SHIFTS.md (#179), built on the #180, #181, #182
foundation stack.

## What it does

Consumes any BPE-vocab safetensors model, buckets tensors by
(component, role, shape), routes each bucket through
bgz_tensor::shared_palette::build_group_with_leaf which auto-dispatches
on role:

  argmax regime  (qko/v/gate/up/down/projection)  → 4 B/row Slot D only
  index  regime  (embed/lm_head)                   → 12 B/row Slot D + Slot L
  passthrough    (norms, biases, < is_encodable)   → BF16 unchanged

## Validation gates

This ships gates 1 + 3 of the 4-gate plan:

  GATE 1: per-row ρ histogram, split by regime
          - argmax: target median ≥ 0.95, p5 ≥ 0.90
          - index:  target median ≥ 0.98, p5 ≥ 0.95
  GATE 3: storage ratio vs BF16 original
          - target ≥ 2:1

Sample-based (first 64 rows per tensor) to keep wall time bounded on
151K-row vocab tensors.

Gates 2 (argmax-parity on held-out prompt) and 4 (WAV envelope match
vs raw) require integration with tts_full_inference.rs and land in a
follow-up PR.

## Usage

  cargo run --release --example universal_hhtld_encode \
      --manifest-path crates/thinking-engine/Cargo.toml \
      -- /path/to/model.safetensors

## Design notes

- reconstruct_row_from_group rebuilds a transient HhtlDTensor from the
  SharedPaletteGroup's (cache, entries, slot_l, svd_basis) so it can
  call HhtlDTensor::reconstruct_row. Cleaner would be a method on
  SharedPaletteGroup; deferred to keep the PR focused on the example.
- Sample cap at 64 rows per tensor: full validation pass on Qwen3-TTS-0.6B
  is O(bucket_count × tensor_count × 64 × n_cols) which bounds wall time.
  For gate 2 (argmax parity) the full row set matters — handled in the
  follow-up integration PR.

## Session PR stack

  #180 (merged) - SlotL foundation
  #181 (merged) - HhtlDTensor × SlotL per-tensor integration
  #182 (this PR's dep) - SharedPaletteGroup × SlotL group-level integration
  #183 (this PR) - universal_hhtld_encode example + gates 1 + 3

https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj
AdaWorldAPI pushed a commit that referenced this pull request Apr 15, 2026
Post-#183 finding: Base17 palette substrate can't reconstruct rows for
f32 GEMM (per-row ρ ≈ 0.04 on real Qwen3). This lands both paths the
other session ranked as viable forward directions.

## Path A — HhtlF32Tensor (reconstruction-grade)

New module crates/bgz-tensor/src/hhtl_f32.rs (5 tests passing):

  pub struct HhtlF32Entry { pub twig: u8 }             // 1 byte/row
  pub struct HhtlF32Tensor {
      palette_f32: Vec<Vec<f32>>,    // CLAM centroids in f32
      entries:     Vec<HhtlF32Entry>,
      slot_l:      Option<Vec<SlotL>>,
      slot_l_scale: Option<f32>,
      svd_basis:   Option<SvdBasis>,
      ...
  }

  impl HhtlF32Tensor {
      fn encode(role, rows, k) -> Self;           // 1 B/row, argmax regime
      fn encode_with_leaf(role, rows, k, basis);  // 9 B/row, index regime
      fn reconstruct_row(idx, n_cols) -> Vec<f32>;
      fn reconstruct_rows(n_cols) -> Vec<Vec<f32>>;
  }

Pipeline:
  row     →   CLAM furthest-point  →  twig idx (1 byte)
  residual →  SvdBasis::project    →  SlotL (8 × i8)
  decode:   palette_f32[twig] + SvdBasis::reconstruct(slot_l * scale)

Per-tensor footprint for [n_rows, n_cols]:
  palette BF16: 256 × n_cols × 2
  SVD basis:    8 × n_cols × 2
  entries:      n_rows × 1
  slot_l:       n_rows × 8 (if index regime)

Tests (5 new, all passing):
  encode_without_leaf_picks_real_rows_as_centroids
  reconstruct_without_leaf_returns_nearest_centroid
  encode_with_leaf_beats_without_leaf_on_real_rows  ← ρ ≥ 0.95 on low-rank
  entry_byte_size_is_one
  storage_accounting_is_additive

Example: universal_hhtl_f32_encode.rs — same gates as #183 universal
encoder, but uses HhtlF32Tensor. Running on Qwen3-TTS-0.6B in background.

## Path B — cascade_attention_probe (codec-space inference)

New example: cascade_attention_probe.rs. Measures argmax agreement
between:
  Raw:    argmax_i  q · K[i]^T                          (f32 dot)
  Codec:  argmax_i  FisherZTable[pal_idx(q), pal_idx(K[i])]

on 512 perturbed queries against a real attention K matrix (talker
layer 0 self_attn.k_proj, shape [1024, 2048]).

Pass criteria (subjective):
  ≥ 90% top-1 agreement → Path B viable for pipeline-wide swap
  ≥ 70% partial         → Path B needs Q-side escalation layer
  <  70% fail           → not competitive with f32 GEMM

Both runs launched; results will be posted as PR comments when they
complete.

## Session PR stack

  #180 merged   SlotL foundation
  #181 merged   HhtlDTensor × SlotL
  #182 merged   SharedPaletteGroup × SlotL
  #183 merged   universal_hhtld_encode (Base17 — reconstruction failure documented)
  #184 this PR  HhtlF32Tensor codec + Path A/B examples

https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj
AdaWorldAPI pushed a commit that referenced this pull request Apr 15, 2026
Per codex P1 comment on #184: HhtlF32Entry.twig is u8, so valid
centroid IDs are 0..=255. Before this fix, encode() accepted any k
and assign_nearest_f32 silently wrapped ci as u8 — passing k=300
(say) would assign centroid-300 as twig-44 and reconstruct the wrong
row. This was actively dangerous because the next-session plan (PR
#184 thread) explicitly proposed k=1024 or 2048 centroids as the
quality fallback.

Fix:
  - New `pub const MAX_PALETTE_K: usize = 256` with clear docstring
  - Both `encode` and `encode_with_leaf` now assert:
      k > 0
      k <= MAX_PALETTE_K
    with explicit panic messages naming the u8 twig limit

Larger palettes need a codec with a wider twig-index (u16 would lift
the cap to 65536, but changes the wire format). That's a separate PR
if/when the quality probe shows k=512+ earns its keep.

Tests (4 new, all pass + 5 existing):
  encode_rejects_zero_k            (#[should_panic = "k > 0"])
  encode_rejects_k_above_256       (#[should_panic = "u8 twig limit"])
  encode_with_leaf_rejects_k_above_256  (same)
  encode_accepts_k_at_max_palette  (k=256 must still succeed)

Refs:
  - PR #184 codex P1 comment ("Reject palette sizes that exceed 255 centroids")
  - Follow-up to merged PRs #180/#181/#182/#183/#184

https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants