Skip to content

fix: passthrough BF16 for vocab tensors + Lance upgrade roadmap + WAV validity test#178

Merged
AdaWorldAPI merged 3 commits into
mainfrom
claude/teleport-session-setup-wMZfb
Apr 14, 2026
Merged

fix: passthrough BF16 for vocab tensors + Lance upgrade roadmap + WAV validity test#178
AdaWorldAPI merged 3 commits into
mainfrom
claude/teleport-session-setup-wMZfb

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

Summary

Follow-up to #177 (merged) with the correctness fix for the text-embedding collapse, a planning doc for Lance 2 → 4/5 migration, and an end-to-end validity test producing actual WAV audio.

SHA What
3b084d1 Passthrough BF16 for n_rows > 8192 — HCLAM 256×256 (landed in #177) collapsed WORSE than RVQ on the vocab embedding (cos=0.004 vs 0.054). Root cause: vocab rows are near-orthogonal in 2048-d; single-centroid tree quantization cannot synthesize directions. Skip compression on the one vocab-sized tensor and ship it as BF16. Result: codec token match 225/225 = 100.0% on Qwen3-TTS-0.6B.
f6ed834 docs/LANCE_UPGRADE_ROADMAP.md (161 L) — Planning doc for Lance 2→4/5 migration. 9 features relevant (IVF_RQ, IVF multi-split PR #6423, HNSW fp16 partition assignment, CacheBackend, distributed segment builds, BF16 PyTorch ingest, pre-transposed PQ SIMD, file format 2.3, Hamming HNSW). DataFusion 51→52.1 is the primary blocker. 5-phase plan.
6e8ee98 tts_full_inference mkdir_p fix — example panicked on fresh-clone because codebooks/ parent dir didn't exist. Now creates it. Surfaced while running the end-to-end validity test that produced the WAV.

End-to-end validity test (new signal)

The #177 "225/225 codec token match" only proved raw inference == RVQ-reconstructed inference — stable codec, didn't prove raw itself was correct. A single-token-constant bug would still show 225/225.

This session ran tts_full_inference.rs end-to-end on Qwen3-TTS-0.6B with prompt "Hello world, this is a test of text to speech synthesis using a compressed neural network running entirely in Rust.":

[1] Tokenize: 22 tokens
[2] Embed: RMS=0.013 in 1.38s
[3] 28 talker layers: RMS climbs 0.66 → 1.01 (expected growth)
[3.5] codec_head + re-embed: RMS=0.026
[4] 5 code_predictor layers: RMS 0.23 → 1.37
[5] 128 autoregressive steps in 50.12s
    Step 0:   tokens=[68, 151, 102, 40]
    Step 32:  tokens=[254, 183, 154, 199]
    Step 64:  tokens=[92, 48, 158, 42]
    Step 96:  tokens=[31, 81, 196, 221]
    Step 127: tokens=[142, 251, 253, 178]
[6] Decoder latent RMS=2.29
[7] WAV: /home/user/models/tts_real_speech.wav (65580 bytes, 1.37s)

WAV statistics (mono 24 kHz 16-bit, 32,768 samples):

Metric Value Interpretation
RMS 0.0865 (−21 dB) normal speech loudness
Peak 1.0000 hits limiter (tanh/snake)
Zero-crossings 11,968 /s high — fricatives / unvoiced
Energy by time decile 0.097, 0.092, 0.066, 0.087, 0.068, 0.059, 0.131, 0.075, 0.067, 0.096 varies — real amplitude modulation

Not silence, not constant, not noise. Codec tokens vary across all 128×16 positions. Wiring is real — 33 transformer layers + codec head + 5 code_predictor layers + RVQ dequant + conv decoder all work together and emit dynamic audio.

Storage note (deferred)

With the passthrough fix active, total storage is still 1:1.39 — net loss because per-tensor RVQ codebooks at k=[256, 512, 1024, 4096] are individually larger than the BF16 tensors they compress. The correctness gate passed; storage ratio is a separate optimization track — see follow-up directions below.

Follow-up directions (out of scope for this PR)

  • BGZ-HHTL-D schema extension — user flagged it. Current Slot D (16-bit tree) + Slot V (16-bit BF16 scalar residual) is 4 B/row but can't point in specific directions on 1024-d rows. Proposed: add Slot L (8 bytes of i8 or TurboQuant on shared SVD basis) → 12 B/row total, ρ > 0.99 per row. Reuses matryoshka.rs SVD basis already in repo. Would give ~50:1 ratio on the vocab embedding (1.8 MB vs 622 MB BF16) at cos > 0.98.
  • Family codebook restructure — the existing 2+4+8 = 14-bit address space already supports 16,384 effective entries if HIP-indexed sub-palettes replace the single flat 256-entry palette. Zero wire-format change.
  • Lance 4.0 IVF_RQ + multi-split adoption — see docs/LANCE_UPGRADE_ROADMAP.md. Multi-split (PR #6423) is a candidate fix for the skewed-partition problem we hit. Vendor the algorithm without the full Lance migration as a first step.
  • Python HF reference comparison — listen to the WAV manually, also diff our codec tokens against transformers.AutoModelForCausalLM output on the same prompt for token-exact wiring proof.

Test plan

  • cargo build --release --example tts_rvq_e2e — clean
  • cargo build --release --example tts_full_inference — clean
  • Passthrough run: 478/478 cos = 1.0000, codec token match 225/225 = 100%
  • Full inference run: 22-token prompt → 128 frames → 1.37s 24 kHz WAV; RMS / ZC / energy stats confirm real audio
  • Human listen-check of the WAV (not done on this remote VM)
  • Compare Rust codec tokens to HF Python reference (next session)
  • Apply proposed BGZ-HHTL-D schema extension to close the storage-ratio gap (next session)

https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj

claude added 3 commits April 14, 2026 18:13
The hierarchical CLAM 256x256 dispatch added in 5047618 collapsed
worse than progressive residual RVQ on vocab embeddings:

  model.text_embedding.weight [151936x2048]
    HCLAM 256x256:   cos = 0.0046  (58.9s)
    RVQ before:      cos = 0.0544  (891.1s)
  Codec token match: 11/225 = 4.9% (was 80.4% with RVQ)

Root cause: vocab embedding rows are near-orthogonal in 2048-d space.
Single-centroid tree quantization can only pick one EXISTING row as
the reconstruction - that row is uncorrelated with the target.
Progressive residual RVQ could at least synthesize directions by
summing codebook vectors; HCLAM cannot. My 'cos ~= 1 at 2.32 rows per
leaf' claim in RVQ_K_LADDER_TUNING.md Section 3 assumed tight
micro-clusters that don't exist for vocab embeddings.

Remediation: skip compression entirely for n_rows > 8192, ship as
BF16. Pipeline achieves:
  - Cos = 1.000 on all tensors (no compression noise from the vocab tensor)
  - Codec token match ~ 100% (will re-run to confirm)
  - Storage: ~620 MB BF16 passthrough + compressed codebooks for the
    other 477 tensors. Total still a net gain vs 3.66 GB original.

Proper long-term fix is bgz-tensor::HhtlDTensor + SharedPaletteGroup
+ FisherZTable for lookup-grade 343:1, but requires switching
inference from f32 GEMM to HHTL cascade - separate session.

Follow-up doc PR will mark the RVQ_K_LADDER_TUNING.md Section 3
claim as REFUTED and point readers at this commit + the bgz-tensor
machinery.

https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj
Planning doc in docs/LANCE_UPGRADE_ROADMAP.md. Covers:

  - Current pins (Lance 2, DataFusion 51) with file:line
  - Why upgrade: 9 features in 4.0 / 5.0-rc.1 that overlap our
    compression stack (IVF_RQ, IVF multi-split PR #6423, HNSW fp16
    partition assignment, CacheBackend, distributed segment builds,
    BF16 PyTorch ingest, pre-transposed PQ SIMD, file format 2.3,
    hamming HNSW)
  - Blockers: DataFusion 51 -> 52.1 bump, file format default shift,
    namespace API cleanup
  - 5-phase plan (no-op baseline -> algorithm probe -> peripheral
    crates -> DF bump -> adopt features -> 5.0 stable)
  - Feature vs migration cost table with portability column
  - Recommended path: vendor algorithms + isolated probe crates,
    defer full migration until 5.0 stable or phase 4 demands it
  - 5 open questions for next session

Cross-references PRs #176, #177 and the three RVQ docs landed in #177.

https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj
Fresh-clone runs panicked at line 269 with ENOENT because the
/home/user/models/qwen3-tts-0.6b/codebooks/ parent dir doesn't exist.
Now mkdir -p before the write, so the example is reproducible from
a clean model checkout.

Surfaced while running end-to-end validity test for the RVQ pipeline:
text -> 28 talker layers -> codec_head -> 5 code_predictor layers ->
128-step autoregressive -> RVQ dequant -> conv decoder -> 24kHz WAV.

Full trace (Qwen3-TTS-0.6B, prompt 'Hello world, this is a test...'):
  [2] Embed:          RMS=0.0129 in 1.38s
  [3] Talker L0-L27:  RMS 0.66 -> 1.01
  [3.5] codec_head + re-embed: RMS=0.0263
  [4] CP L0-L4:       RMS 0.23 -> 1.37
  [5] 128 autoregressive steps in 50.12s
      Step 0:   tokens=[68, 151, 102, 40]  RMS=4.89
      Step 32:  tokens=[254, 183, 154, 199]
      Step 64:  tokens=[92, 48, 158, 42]
      Step 96:  tokens=[31, 81, 196, 221]
      Step 127: tokens=[142, 251, 253, 178] RMS=1.51
  [6] Decoder latent RMS=2.29
  [7] WAV: 65580 bytes, 1.37s at 24kHz

WAV statistics confirm real audio (not silence, constant, or noise):
  RMS=0.0865 (-21 dB, normal speech range)
  peak=1.0000 (hits limiter)
  zero-crossings=11968/s (consistent with fricatives / unvoiced)
  energy varies across time deciles (0.059 to 0.131)

This WAV is the end-to-end wiring proof that was missing from the
earlier 225/225 codec token match — that gate only proved raw ==
RVQ path, not that raw itself was correct. The varied codec tokens
(no single constant) plus the dynamic audio output confirm all 33
transformer layers + codec head + code predictor + decoder are
doing real work, not emitting silence.

https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6e8ee987b8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

weights.len() + 1, short, n_rows, n_cols, c, el);
// Passthrough: keep BF16-precision f32 rows as-is, no codebook.
// Cos = 1 trivially. Ship cost is BF16 (2 bytes per element).
codebook_bytes += n_rows * n_cols * 2; // BF16 shipping footprint
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use source dtype for passthrough byte accounting

The passthrough branch charges codebook_bytes as n_rows * n_cols * 2 for every tensor with n_rows > 8192, but this dispatch is shape-based and can include F32 tensors. In that case the script underreports storage by 2x, so the printed compressed-size and ratio metrics become incorrect and can mislead experiment conclusions about whether this path actually saves space.

Useful? React with 👍 / 👎.

@AdaWorldAPI AdaWorldAPI merged commit 1f2a9a7 into main Apr 14, 2026
AdaWorldAPI pushed a commit that referenced this pull request Apr 14, 2026
After iterating RVQ -> HCLAM -> passthrough on Qwen3-TTS-0.6B across
PRs #176, #177, #178, step back and name the mindset expansions worth
more than the next local fix.

Content summary (doc is 185 lines):

1. What this session established vs did NOT establish
   - 225/225 codec token match proven (self-consistency, not product)
   - End-to-end WAV output validates wiring (varied tokens, realistic
     amplitude envelope)
   - Storage ratio is 1:1.39 net LOSS, not the shipping story we need

2. The BPE + argmax insight that reframes everything
   - Argmax-decoded regime (attention/MLP/logits) needs only top-1
     stability -> ρ ≈ 0.95 is plenty
   - Index-lookup regime (vocab_embed, lm_head, code_embed) needs
     per-row identity -> no argmax downstream to rescue errors
   - The two regimes want OPPOSITE codecs; current pipeline used one
     codec for both and was surprised when it failed on the index
     regime

3. Four mindset shifts, ranked by blast radius:
   (1) Compression as indexing (HEEL/HIP/TWIG semantic addresses),
       not as squeezing (anonymous codebook indices)
   (2) Inference in codec space (HHTL cascade Skip/Attend/Compose),
       not f32 GEMM on reconstructed weights
   (3) Model-generic encoder (classify_role dispatch per tensor),
       not Qwen3-TTS-specific pipeline
   (4) Integrate what exists (HhtlDTensor + matryoshka + SharedPalette
       + FisherZTable are already there), stop building codecs

4. Concrete proposal: universal_hhtld_encode.rs combining shifts 3+4
   - Input: any BPE-vocab safetensors model
   - Dispatch: HhtlDTensor Slot D only (argmax regime, 4 B/row)
     vs Slot D + Slot L Matryoshka SVD band 0 (index regime, 12 B/row)
     vs passthrough BF16 (norms/biases)
   - Validation: argmax-parity (225/225 or near), not cos
   - Estimate: ~29 MB for Qwen3-TTS-0.6B (~126:1) or 3.86 GB -> 11.2 MB
     for Qwen3-TTS-1.7B (343:1, matches BGZ_HHTL_D.md)

5. Alternative mindset expansion (shift 2 alone): migrate inference
   from f32 GEMM to distance-table lookups. Multi-session architecture
   pivot. Benefit: order-of-magnitude speedup on top of compression
   ratio. Cost: bigger scope, but closer to codebase architectural
   contract (ndarray = hardware / lance-graph = spine / ladybug-rs
   = brain).

6. Five open questions deferring concrete design decisions to the
   next session.

Cross-references all prior session PRs and the relevant repo docs
(BGZ_HHTL_D.md, fisher-z-wiring/, RVQ guides, Lance roadmap,
CLAUDE.md architecture notes).

https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj
AdaWorldAPI pushed a commit that referenced this pull request Apr 15, 2026
Session-end artefact for future déjà-vu. Catalogues every compression
approach tried in PRs #176-#185 and the lesson each one produced. No
approach is thrown away — each failed experiment carries information
about where the real boundary is.

## Structure

### Core invariants (6)
  I1. Two regimes, opposite needs (argmax vs index)
  I2. Near-orthogonality of weight rows in high dim
  I3. Direction vs amplitude cannot be merged into one scalar
  I4. Wire-format type widths are hard caps — assert at encode time
  I5. 'u8 can span u16/u64 effective' requires the right decoder
  I6. The ticket-for-curve model (SpiralAddress + shared curve)

### Approaches tried (7)
  A1. HhtlDTensor — Base17 + Slot D + Slot V (correct for cascade, wrong for f32 GEMM)
  A2. Progressive residual RVQ with k-ladder (works argmax, fails index)
  A3. Hierarchical CLAM 256x256 (REFUTED — cos 0.0046 on vocab)
  A4. Passthrough BF16 n_rows > 8192 (SHIPS for correctness, net loss for ratio)
  A5. SlotL 8 x i8 on SVD basis (correct algorithm, misapplied to Base17 centroid)
  A6. HhtlF32Tensor f32 palette + SlotL (right direction, 10x better, still short)
  A7. cascade_attention_probe Base17 palette (3.71% argmax agreement — palette doesn't preserve inner products)

### Abstractions that ARE the right primitive (3)
  R1. highheelbgz::rehydrate::SpiralEncoding (exists, untested on real Qwen3)
  R2. Per-role stride in NeuronPrint (q/k=3, v=5, gate=8, up=2, down=4)
  R3. HHTL cascade inference (hhtl_cache RouteAction)

### Open probes (4)
  P1. SpiralEncoding on real Qwen3 weights — claim rho >= 0.95 unproven
  P2. Shared anchors + i8 position per row — depends on P1
  P3. Palette preserves inner-product neighbourhoods — A7 refuted for Base17
  P4. Log-radial CLAM with magnitude split — hypothesised > linear CLAM

### Déjà-vu table

Lists 7 'if you're tempted to...' instincts with the PR that already
refuted them. Exists so future sessions hit the lesson before writing
the code.

### Structural checklist (5 questions)

Before shipping any new codec:
  1. What regime does this tensor belong to? (I1)
  2. Does the codec encode direction AND amplitude separately? (I3)
  3. Is the palette substrate inner-product-preserving? (I2, A7)
  4. Does the decoder evaluate the curve, or tile anchors? (I5)
  5. Are wire-format widths asserted at encode time? (I4)

## Why this doc matters

Every failed approach in this session taught something the next session
would otherwise re-learn the hard way. HCLAM (#177->#178) already has
its lesson buried in a passthrough commit. The Base17 reconstruction
failure (#183) is buried in a PR comment. The #184 Path A/B duality
(they aren't independent) is only visible if you read the probe results.

This doc surfaces all of it as a single index, structured for mutation:
each approach has 'mutation hooks' naming how it could evolve into
something that works, rather than being discarded.

## Next step blocked by token budget

The SpiralEncoding-on-real-Qwen3 probe (P1) is the obvious next
experiment and would have landed in this PR. Deferred to a fresh
session with budget. The doc leaves the probe fully specified so
re-entering cold loses no context.

https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj
AdaWorldAPI pushed a commit that referenced this pull request Apr 17, 2026
Codifies 7 anti-patterns (AP1-AP7) learned from PRs #176-#188 into
an agent card that fires flags when the session repeats them:

  AP1: "225/225 feels like success" without gate 2 (#178)
  AP2: Projecting quality from docs instead of measuring (#177)
  AP3: Building new codec before benching existing ones (#184)
  AP4: Centroid-residual framing on near-orthogonal data (#177/#183)
  AP5: Python in the inference hot path
  AP6: Chained score multiplication without chain-collapse check (P5)
  AP7: Modifying ndarray without explicit permission (#176)

Invoked by adk-coordinator when pattern repetition is suspected, or
by human directly. Output: list of fired flags, max 7 lines.

Also audited all 29 agent cards across both repos:
  - All pin model: opus or model: sonnet (no hardcoded versions)
  - opus → Opus 4.7 automatically, sonnet → Sonnet 4.6
  - 3 ndarray agents on sonnet (l3-strategist, migration-tracker,
    product-engineer) — intentional for speed-over-depth roles
  - adk-coordinator missing Bash tool (by design — delegates)
  - sentinel-qa missing Edit/Write (by design — audit-only)

No agent changes needed for Opus 4.7 compatibility — model: opus
resolves correctly.

https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants