D1+D2+D5: CAM-PQ calibration pipeline — honest negative result#220
Conversation
Enforces invariant I1: index-regime tensors (embed_tokens, lm_head, token_embd, wte/wpe) MUST stay Passthrough — identity lookup can't survive any codec. Argmax-regime (attention Q/K/V/O, MLP gate/up/down) routes to CamPq. Norms/conv/small → Skip. Order of rules matters: index-regime match comes BEFORE the ambiguous-large-2D fallback so lm_head (2D, 151936×hidden) isn't misrouted. Covered by lm_head_not_misrouted_as_campq test. 8 tests covering Qwen/Llama/GPT-2/GGUF naming conventions. 133/133 contract tests pass. Zero deps preserved. First deliverable (D1) of the CAM-PQ production wiring plan merged in PR #219.
…sult D2 — cam_pq_calibrate binary: reads safetensors, classifies tensors via route_tensor (D1), trains a CamCodebook per argmax-regime tensor, encodes all rows to 6-byte fingerprints, measures ICC_3_1 and relative L2 error, writes codebooks + fingerprints + manifest.json. D5 — full-size validation on Qwen3-TTS-0.6B: FAILS. 234 argmax-regime tensors measured. Mean ICC = 0.195, zero tensors meet the ≥0.99 gate. Relative L2 error 0.70–0.90. Root cause: PR #218 bench measured ICC 0.9998 on 128 rows trained and measured on those same 128 rows — a trivially-correct fit (128 ≤ 256 centroids → every row gets its own centroid). At production tensor sizes (1024–3072 rows), the 6×256 codebook is centroid-starved. cam_pq_row_count_probe.rs demonstrates the collapse: n=128 → icc_train=1.000, icc_all=-0.304 n=3072 → icc_train=-0.079 Also broadens route_tensor embedding match to catch codec_embedding, adding 2 new test cases (10 total, 133/133 contract tests pass). Infrastructure (CLI, serialization, measurement) is sound. The negative result is in the codec's capacity vs tensor row counts, not the tooling. Plan needs revision before D6/D7 effort. https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a78131fa76
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let (row_dim, n_rows) = match row_layout(&dims_u64) { | ||
| Some(v) => v, | ||
| None => { | ||
| eprintln!(" [skip: not a 2D matrix]"); | ||
| continue; |
There was a problem hiding this comment.
Handle non-2D CamPq routes without dropping tensors
When a tensor is classified as CamPq but is not 2D (for example, packed expert weights in MoE checkpoints), this branch logs a skip and continues before writing any artifact or manifest row. Because route_tensor matches attention/MLP names without checking rank (crates/lance-graph-contract/src/cam.rs), these tensors can be silently omitted, producing incomplete calibration outputs that downstream tooling cannot fully reconstruct.
Useful? React with 👍 / 👎.
| let bytes: Vec<u8> = data.iter().flat_map(|f| f.to_le_bytes()).collect(); | ||
| w.write_all(&bytes)?; |
There was a problem hiding this comment.
Stream passthrough tensor writes instead of materializing bytes
This code converts the entire &[f32] tensor into a separate Vec<u8> before writing, which roughly doubles peak memory for each passthrough tensor. For large embedding/lm_head tensors this can push calibration runs into OOM or severe memory pressure even though the data could be written incrementally in chunks.
Useful? React with 👍 / 👎.
The LAB-ONLY surface isn't just quarantine scaffolding — it's the codec-research iteration testbed. Its reason for existing is the cost of the alternative: every codec candidate re-measured through a cargo build cycle burns minutes per iteration. With the lab REST/gRPC + wire DTOs, a single binary serves dozens of candidates against the same safetensors in seconds per call. PR #220 falsified PR #219's ICC-0.9998 claim via exactly this path: the calibration CLI + /v1/shader/calibrate endpoint surfaced mean ICC 0.195 / 0/234 pass rate on full Qwen3-TTS-0.6B tensors before any production consumer linked the codec. Two purposes now named explicitly in the doc: 1. Iteration velocity (positive) — lab surface = curl-friendly research loop, no rebuild per candidate. 2. Canonical firewall (guard) — consumers still walk UnifiedStep via OrchestrationBridge; they never see Wire* per-op DTOs. Changes: - New subsection "Why the Lab Surface Exists (positive purpose — not just quarantine)" with the #219 → #220 worked-example table. - Decision Procedure item 3 reframed: research ops and curl-friendly debug shortcuts are a legitimate use of the lab surface, with a graduation rule (full-size validation → new StepDomain variant; lab endpoint stays for continued iteration, production moves to bridge). https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
… I11 measurability The prior "positive purpose" framing was too narrow (codec iteration velocity). The actual architecture the lab surface buys is three-part: REST/gRPC API — no rebuild per codec candidate Planner — real dispatch path under test (not a toy bench) JIT — swap kernels at runtime without relinking Two loads share this stack; neither is secondary: 1. Codec certification. Reconstruction ICC on real safetensors is necessary but not sufficient — the cert gate is token agreement vs Passthrough on full decode. PR #219's 0.9998 was synthetic / overfit-on-training; PR #220's 0.195 was real-weight but still reconstruction-only. The next load-bearing measurement is the token-level comparison, which is only tractable on this stack. At 8-17 min/rebuild × ~200 codec invariants to tune, iteration without the API is infeasible. 2. Thinking harvest (the AGI magic bullet). The same API + Planner + JIT externalises the planner's 36-style / 13-verb / NARS trace. POST a Cypher query, get {rows, thinking_trace} back. The trace is log / replay / NARS-revise-able — which is the architectural shape of a system that learns its own meta-inference. This is the REST/Cypher injection path we can revive at near-zero cost now that PR #221 landed the REST/gRPC scaffolding. I11 (new invariant): Measurable stack, not a black box. Every layer (L0 ndarray → L4 planner) emits a harvest-ready trace through the lab surface. Proposed changes that shrink trace for perf/simplicity are rejected — the trace contract is what makes the feedback loop mechanisable. Also refined: Decision Procedure item 3 (codec research is a legitimate positive use, not a grudging exception); rule-of-thumb measurement order (reconstruction error → reconstruction ICC → token agreement) with token agreement as the cert gate. https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
Operationalises PR #220's "What's Needed to Fix" list (wider codebook, residual PQ, Hadamard pre-rotation, OPQ) as a parameter sweep through the lab endpoint — every codec difference is a JIT kernel, not a cargo rebuild. Phase 0 hardens the Wire surface once; Phases 1-4 run unlimited candidates without further rebuilds; Phase 5 graduates winners to the canonical OrchestrationBridge surface. Structure: Phase 0 — API hardening (one rebuild, then frozen): D0.1 CodecParams in WireCalibrate D0.2 WireTokenAgreement endpoint (I11 cert gate) D0.3 WireSweep streaming + Lance append D0.4 surface freeze Phase 1 — JIT codec kernels (rebuild-free): D1.1 CodecKernelCache via JitCompiler (Cranelift) D1.2 Rotation primitives (Identity / Hadamard / OPQ) D1.3 Residual PQ via JIT composition Phase 2 — Token-agreement harness (the I11 cert gate): D2.1 Reference-model loader (ndarray safetensors) D2.2 Decode-and-compare loop (top-k, per-layer MSE) D2.3 Handler wiring Phase 3 — Sweep driver + Lance logger Phase 4 — DataFusion frontier analysis Phase 5 — Graduation to OrchestrationBridge (per winner only) ~1,920 LOC total; 1 upfront rebuild; unlimited candidates afterwards. Compare to naive path (4 fixes × 8-17 min × N tweaks = hundreds of hours). All work behind --features lab until graduation. INTEGRATION_PLANS.md prepended per APPEND-ONLY rule, citing PR #224 dependency for the architectural framing. https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
…hanges
Corrections after hand-grep vs curated knowledge (encoding-ecosystem.md,
codec-findings-2026-04-20.md, rotation_vs_error_correction.md) and
user directives "its all there, dont touch, just be aware how to use
crate::simd", "wire accordingly into the lab infra", "via struct of
arrays":
- slice::array_windows::<N>() IS real — stdlib, stable Rust 1.77,
const-generic. I conflated it with a missing ndarray::array_window
(singular); corrected.
- AMX in ndarray is INTEL (Sapphire Rapids TDPBUSD/TDPBF16PS via
stable inline asm on Rust 1.94, per src/simd_amx.rs header),
NOT Apple. rust-lang #126622 keeps AMX intrinsics nightly;
inline asm at src/hpc/amx_matmul.rs is the stable consumer path.
Verified on kernel 6.18.5 with XCR0 bits 17+18 set.
- Real primitive names (no hallucinated matmul_tiled /
hadamard_butterfly): tile_dpbusd, tile_dpbf16ps, vnni_pack_bf16
for tier-1 AMX; vnni_matvec / matvec_dispatch for tier-2 VNNI;
F32x16 / U8x64 / Fingerprint<N> for tier-3 AVX-512 baseline.
- Polyfill hierarchy per user directive
(simd_amx > simd_avx512 > simd_avx2 fallback):
Tier 1: Intel AMX tile (256 MACs/instr)
Tier 2: AVX-512 VNNI (64 MACs/instr)
Tier 3: AVX-512 baseline F32x16 (16 MACs/instr, mandatory
default per ndarray's .cargo/config.toml
target-cpu=x86-64-v4)
Tier 4: AVX-2 F32x8 fallback
Tier 5: scalar reference
- Rule A wires SoA: the &[u8] slice array_windows iterates comes
from a BindSpace column (FingerprintColumns / QualiaColumn /
MetaColumn / EdgeColumn) per the AGI-as-SoA identity. No new
data structures — the SoA column IS the input surface.
- Dropped all "Phase 0 ndarray prerequisite" language. Everything
the sweep needs exists in ndarray today; this plan wires the
existing surface into cognitive-shader-driver (REST handlers +
CodecKernelCache + CodecResearchBridge). Zero ndarray changes.
- Added reality-check against codec-findings-2026-04-20.md so the
sweep does NOT re-derive measured winners: Had-Q5×D-R already
ICC ≈ 0.99 with shared codebook; I8-Hadamard leads for per-row-
only at ICC ≈ 0.9; zipper serves bundling axis, not argmax;
fractal leaf descriptors are DEAD (sign-flip invariant). The
sweep focuses on #220's four unmeasured candidates (wider
codebook / residual PQ / Hadamard pre-rotation / OPQ) and on
the missing axis — token agreement, not reconstruction ICC.
https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
Eight concrete YAML configs for configs/codec/*.yaml that Phase 0 will consume: 00_baseline_passthrough — regression anchor (top1=1.000 exactly) 01_pr220_baseline — negative control, reproduces #220 ICC 0.195 02_pr219_overfit_reproducer — negative control, split-test must FAIL 10_fix_a_wider_codebook — #220 (a) 1024 centroids 11_fix_b_residual_pq — #220 (b) residual depth=1 12_fix_c_hadamard_rotation — #220 (c) Hadamard pre-rotation 13_fix_d_opq_rotation — #220 (d) OPQ learned rotation 20_composite_a_plus_b — composition probe for combinatorial lift 30_cross_product_sweep — SweepGrid for D3.1 initial sweep Each YAML: - Names lane_width explicitly (Rule E) so the JIT compiles the right SIMD tier. BF16x32 for OPQ (AMX bf16 tile path) — others default to F32x16. - Carries a notes: block stating the expected measurement outcome, so Phase 0's regression detection has ground truth to check against (e.g., baseline reproducer must produce ICC ≈ 0.195, overfit reproducer must FAIL the split-test). - Separates calibration_rows from measurement_rows where relevant (pr219_overfit_reproducer sets them equal so the pipeline refuses to report the ICC, demonstrating the guard that prevents PR #219's overfit-on-training artefact from recurring). 30_cross_product_sweep specifies the initial 54-candidate grid (1 subspace × 3 centroids × 3 residuals × 3 rotations × 1 distance × 2 lane widths). Expected JIT compile budget: ~800 ms one-time; everything after is cache hits per Rule A/B. Operating principle reiterated at the end: adding a candidate is authoring a YAML; changing params is editing YAML; Rust reads YAML once at ingress (Rule F) and never re-serialises. Sweep logger appends result rows to Lance — the only egress beyond the REST response. https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
…dation
First Phase 0 code deliverable from codec-sweep-via-lab-infra-v1 plan.
Zero-dep contract-side types the lab API (cognitive-shader-driver)
will carry into JIT compilation.
Adds to crates/lance-graph-contract/src/cam.rs (~290 LOC):
Enums (Rule E — Wire surface IS the SIMD surface, object-oriented):
LaneWidth { F32x16, U8x64, F64x8, BF16x32 } — mirrors ndarray::simd::*
Distance { AdcU8, AdcI8 } — CODING_PRACTICES gap 5
(sign-handling /
bipolar cancellation)
Rotation { Identity, Hadamard{dim}, Opq{blob,dim} }
Structs:
ResidualSpec { depth, centroids }
CodecParams { subspaces, centroids, residual, lane_width,
pre_rotation, distance, calibration_rows,
measurement_rows, seed }
Builder (CODING_PRACTICES gap 3 — fluent API, not raw-struct):
CodecParamsBuilder::new()
.subspaces(u32).centroids(u32).residual(ResidualSpec)
.lane_width(LaneWidth).rotation(Rotation).distance(Distance)
.calibration_rows(u32).measurement_rows(u32).seed(u64)
.build() -> Result<CodecParams, CodecParamsError>
Validation fires BEFORE any JIT compile (D0.7 precision ladder):
- ZeroDimension — subspaces == 0 or centroids == 0
- OpqRequiresBf16 — OPQ routes through tile_dpbf16ps;
only LaneWidth::BF16x32 is valid
- HadamardDimNotPow2 — Sylvester construction needs dim = 2^k
- CalibrationEqualsMeasurement — overfit guard: refuses to emit
ICC when calibration_rows ==
measurement_rows (reproduces PR #219's
128-row trained-and-tested artifact)
Methods on CodecParams:
kernel_signature() -> u64 — JIT cache key (Rule E); excludes
seed so calibration-sample changes
don't invalidate cached kernels
is_matmul_heavy() -> bool — true for OPQ or centroids > 512;
drives Tier-1 AMX dispatch decision
(Rule C polyfill hierarchy)
Rotation::is_matmul() -> bool — Identity and Hadamard are false
(butterfly stays on Tier-3 F32x16);
only Opq returns true
14 new tests covering:
- builder default matches PR #220 baseline shape
- each validation variant fires correctly
- OPQ + BF16x32 accepted; OPQ + F32x16 rejected with typed error
- Hadamard + non-pow2 dim rejected
- overfit guard fires on calibration == measurement
- kernel_signature stable across identical builds
- kernel_signature excludes seed (cache stays hot)
- kernel_signature changes with centroids / rotation kind
- is_matmul_heavy detects OPQ AND wide codebook (≥512 centroids)
Zero-dep preserved (stdlib only: std::collections::hash_map::
DefaultHasher for kernel_signature, core::fmt + core::error for
error types). No serde in the contract — YAML/JSON deserialisation
belongs to the consumer crate, which will produce CodecParams via
serde at the REST handler (Rule F — serialisation at edge only).
Tests: 147/147 contract suite passing (133 prior + 14 new).
https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
Operationalises PR #220's "What's Needed to Fix" list (wider codebook, residual PQ, Hadamard pre-rotation, OPQ) as a parameter sweep through the lab endpoint — every codec difference is a JIT kernel, not a cargo rebuild. Phase 0 hardens the Wire surface once; Phases 1-4 run unlimited candidates without further rebuilds; Phase 5 graduates winners to the canonical OrchestrationBridge surface. Structure: Phase 0 — API hardening (one rebuild, then frozen): D0.1 CodecParams in WireCalibrate D0.2 WireTokenAgreement endpoint (I11 cert gate) D0.3 WireSweep streaming + Lance append D0.4 surface freeze Phase 1 — JIT codec kernels (rebuild-free): D1.1 CodecKernelCache via JitCompiler (Cranelift) D1.2 Rotation primitives (Identity / Hadamard / OPQ) D1.3 Residual PQ via JIT composition Phase 2 — Token-agreement harness (the I11 cert gate): D2.1 Reference-model loader (ndarray safetensors) D2.2 Decode-and-compare loop (top-k, per-layer MSE) D2.3 Handler wiring Phase 3 — Sweep driver + Lance logger Phase 4 — DataFusion frontier analysis Phase 5 — Graduation to OrchestrationBridge (per winner only) ~1,920 LOC total; 1 upfront rebuild; unlimited candidates afterwards. Compare to naive path (4 fixes × 8-17 min × N tweaks = hundreds of hours). All work behind --features lab until graduation. INTEGRATION_PLANS.md prepended per APPEND-ONLY rule, citing PR #224 dependency for the architectural framing. https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
…hanges
Corrections after hand-grep vs curated knowledge (encoding-ecosystem.md,
codec-findings-2026-04-20.md, rotation_vs_error_correction.md) and
user directives "its all there, dont touch, just be aware how to use
crate::simd", "wire accordingly into the lab infra", "via struct of
arrays":
- slice::array_windows::<N>() IS real — stdlib, stable Rust 1.77,
const-generic. I conflated it with a missing ndarray::array_window
(singular); corrected.
- AMX in ndarray is INTEL (Sapphire Rapids TDPBUSD/TDPBF16PS via
stable inline asm on Rust 1.94, per src/simd_amx.rs header),
NOT Apple. rust-lang #126622 keeps AMX intrinsics nightly;
inline asm at src/hpc/amx_matmul.rs is the stable consumer path.
Verified on kernel 6.18.5 with XCR0 bits 17+18 set.
- Real primitive names (no hallucinated matmul_tiled /
hadamard_butterfly): tile_dpbusd, tile_dpbf16ps, vnni_pack_bf16
for tier-1 AMX; vnni_matvec / matvec_dispatch for tier-2 VNNI;
F32x16 / U8x64 / Fingerprint<N> for tier-3 AVX-512 baseline.
- Polyfill hierarchy per user directive
(simd_amx > simd_avx512 > simd_avx2 fallback):
Tier 1: Intel AMX tile (256 MACs/instr)
Tier 2: AVX-512 VNNI (64 MACs/instr)
Tier 3: AVX-512 baseline F32x16 (16 MACs/instr, mandatory
default per ndarray's .cargo/config.toml
target-cpu=x86-64-v4)
Tier 4: AVX-2 F32x8 fallback
Tier 5: scalar reference
- Rule A wires SoA: the &[u8] slice array_windows iterates comes
from a BindSpace column (FingerprintColumns / QualiaColumn /
MetaColumn / EdgeColumn) per the AGI-as-SoA identity. No new
data structures — the SoA column IS the input surface.
- Dropped all "Phase 0 ndarray prerequisite" language. Everything
the sweep needs exists in ndarray today; this plan wires the
existing surface into cognitive-shader-driver (REST handlers +
CodecKernelCache + CodecResearchBridge). Zero ndarray changes.
- Added reality-check against codec-findings-2026-04-20.md so the
sweep does NOT re-derive measured winners: Had-Q5×D-R already
ICC ≈ 0.99 with shared codebook; I8-Hadamard leads for per-row-
only at ICC ≈ 0.9; zipper serves bundling axis, not argmax;
fractal leaf descriptors are DEAD (sign-flip invariant). The
sweep focuses on #220's four unmeasured candidates (wider
codebook / residual PQ / Hadamard pre-rotation / OPQ) and on
the missing axis — token agreement, not reconstruction ICC.
https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
Eight concrete YAML configs for configs/codec/*.yaml that Phase 0 will consume: 00_baseline_passthrough — regression anchor (top1=1.000 exactly) 01_pr220_baseline — negative control, reproduces #220 ICC 0.195 02_pr219_overfit_reproducer — negative control, split-test must FAIL 10_fix_a_wider_codebook — #220 (a) 1024 centroids 11_fix_b_residual_pq — #220 (b) residual depth=1 12_fix_c_hadamard_rotation — #220 (c) Hadamard pre-rotation 13_fix_d_opq_rotation — #220 (d) OPQ learned rotation 20_composite_a_plus_b — composition probe for combinatorial lift 30_cross_product_sweep — SweepGrid for D3.1 initial sweep Each YAML: - Names lane_width explicitly (Rule E) so the JIT compiles the right SIMD tier. BF16x32 for OPQ (AMX bf16 tile path) — others default to F32x16. - Carries a notes: block stating the expected measurement outcome, so Phase 0's regression detection has ground truth to check against (e.g., baseline reproducer must produce ICC ≈ 0.195, overfit reproducer must FAIL the split-test). - Separates calibration_rows from measurement_rows where relevant (pr219_overfit_reproducer sets them equal so the pipeline refuses to report the ICC, demonstrating the guard that prevents PR #219's overfit-on-training artefact from recurring). 30_cross_product_sweep specifies the initial 54-candidate grid (1 subspace × 3 centroids × 3 residuals × 3 rotations × 1 distance × 2 lane widths). Expected JIT compile budget: ~800 ms one-time; everything after is cache hits per Rule A/B. Operating principle reiterated at the end: adding a candidate is authoring a YAML; changing params is editing YAML; Rust reads YAML once at ingress (Rule F) and never re-serialises. Sweep logger appends result rows to Lance — the only egress beyond the REST response. https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
…dation
First Phase 0 code deliverable from codec-sweep-via-lab-infra-v1 plan.
Zero-dep contract-side types the lab API (cognitive-shader-driver)
will carry into JIT compilation.
Adds to crates/lance-graph-contract/src/cam.rs (~290 LOC):
Enums (Rule E — Wire surface IS the SIMD surface, object-oriented):
LaneWidth { F32x16, U8x64, F64x8, BF16x32 } — mirrors ndarray::simd::*
Distance { AdcU8, AdcI8 } — CODING_PRACTICES gap 5
(sign-handling /
bipolar cancellation)
Rotation { Identity, Hadamard{dim}, Opq{blob,dim} }
Structs:
ResidualSpec { depth, centroids }
CodecParams { subspaces, centroids, residual, lane_width,
pre_rotation, distance, calibration_rows,
measurement_rows, seed }
Builder (CODING_PRACTICES gap 3 — fluent API, not raw-struct):
CodecParamsBuilder::new()
.subspaces(u32).centroids(u32).residual(ResidualSpec)
.lane_width(LaneWidth).rotation(Rotation).distance(Distance)
.calibration_rows(u32).measurement_rows(u32).seed(u64)
.build() -> Result<CodecParams, CodecParamsError>
Validation fires BEFORE any JIT compile (D0.7 precision ladder):
- ZeroDimension — subspaces == 0 or centroids == 0
- OpqRequiresBf16 — OPQ routes through tile_dpbf16ps;
only LaneWidth::BF16x32 is valid
- HadamardDimNotPow2 — Sylvester construction needs dim = 2^k
- CalibrationEqualsMeasurement — overfit guard: refuses to emit
ICC when calibration_rows ==
measurement_rows (reproduces PR #219's
128-row trained-and-tested artifact)
Methods on CodecParams:
kernel_signature() -> u64 — JIT cache key (Rule E); excludes
seed so calibration-sample changes
don't invalidate cached kernels
is_matmul_heavy() -> bool — true for OPQ or centroids > 512;
drives Tier-1 AMX dispatch decision
(Rule C polyfill hierarchy)
Rotation::is_matmul() -> bool — Identity and Hadamard are false
(butterfly stays on Tier-3 F32x16);
only Opq returns true
14 new tests covering:
- builder default matches PR #220 baseline shape
- each validation variant fires correctly
- OPQ + BF16x32 accepted; OPQ + F32x16 rejected with typed error
- Hadamard + non-pow2 dim rejected
- overfit guard fires on calibration == measurement
- kernel_signature stable across identical builds
- kernel_signature excludes seed (cache stays hot)
- kernel_signature changes with centroids / rotation kind
- is_matmul_heavy detects OPQ AND wide codebook (≥512 centroids)
Zero-dep preserved (stdlib only: std::collections::hash_map::
DefaultHasher for kernel_signature, core::fmt + core::error for
error types). No serde in the contract — YAML/JSON deserialisation
belongs to the consumer crate, which will produce CodecParams via
serde at the REST handler (Rule F — serialisation at edge only).
Tests: 147/147 contract suite passing (133 prior + 14 new).
https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
Retroactive hygiene for the recent PR arc + prospective enforcement so the gap never recurs. User directive: "should have happened to begin with." LATEST_STATE.md: - Header: "Last updated 2026-04-20 post PR #224 (PR #225 open)" - Recently Shipped table: prepended rows for #225 (open), #224, and #223 with full shipped-content summaries - Contract Inventory: expanded cam:: entry with all new codec- sweep types (LaneWidth / Distance / Rotation / ResidualSpec / CodecParams / CodecParamsBuilder / CodecParamsError) including the precision-ladder-fires-before-JIT invariant - Active Branches: recorded claude/teleport-session-setup-wMZfb and its three merged PRs - Active Integration Plans: added codec-sweep-via-lab-infra-v1 alongside elegant-herding-rocket-v1 - Immediate Next Work: codec-sweep Phase 0 remainder (D0.1/0.2/0.3/ 0.5) + the elegant-herding Phase 2 block PR_ARC_INVENTORY.md (APPEND-ONLY — PREPEND only): - #225 entry: plan + CodecParams/Builder/precision validation + rules A-F locked + decisions for future PRs - #224 entry: three-part lab stack + thinking harvest + I11 measurability locked - #223 entry: LAB-ONLY firewall + AGI-as-SoA + I1-I10 invariants locked (the cross-cutting architectural ruleset this workspace now enforces) STATUS_BOARD.md: - New section: codec-sweep-via-lab-infra-v1 with 18 D-ids across 5 phases (D0.6/D0.7 marked Shipped-in-#225; remainder Queued) EPIPHANIES.md (APPEND-ONLY — PREPEND only, 6 new dated entries): - Board hygiene is the driving seat, not cleanup (this session's self-reflection turned into a rule) - Codec cert is token agreement, not synthetic ICC (#219 → #220 arc; #225 CalibrationEqualsMeasurement typed rejection) - Lab REST surface is three-part (API + Planner + JIT), not just scaffolding - Thinking harvest via REST/Cypher = the AGI magic bullet - SoA never scalarises without ndarray (iron rule Rule C) - AGI is the glove, not the oracle — four-axis SoA is what you wear CLAUDE.md — new top-level § "The Stance — Driving Seat + AGI-as-Glove (P0, read first)": - Explicit driving-seat posture: the session STEERS the stack, doesn't observe it - AGI-as-glove doctrine concrete: topic → FingerprintColumns, angle → QualiaColumn, thinking → MetaColumn, planner → EdgeColumn. New capability lands as a new column, not a layer. - MANDATORY Board-Hygiene Rule as a table: every PR that adds a type / plan / D-id / epiphany / tech-debt / issue MUST update the corresponding board file IN THE SAME COMMIT. Retroactive hygiene (merge PR → later cleanup) is now an anti-pattern the rule forbids. - "Consult, don't guess" — agent/knowledge-first discipline: specialist-agent card → knowledge doc → board inventory → only then grep. Subagent spawn with curated docs beats main- thread grep. 147/147 contract suite still passing. Doc-only PR otherwise (Cargo.toml / src/* unchanged; the orphan serde_yaml/base64 deps from the timed-out bus-compiler subagent were reverted — they'll land with D0.1/D0.3 when the Wire code lands). https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
Two more Phase 0 deliverables from codec-sweep-via-lab-infra-v1.
66/66 cognitive-shader-driver tests pass under --features serve (+11 new).
D0.5 — auto_detect.rs (~300 LOC, CODING_PRACTICES gap 1):
Reads <model_path>/config.json (HuggingFace layout) and returns
ModelFingerprint { architecture, hidden_size, n_layers,
tokenizer_class, vocab_size, default_lane_width, default_distance }.
Architecture routing:
llama / qwen / qwen2 / qwen3 / mistral / mixtral → BF16x32 (AMX)
bert / modernbert / xlm-roberta / generic → F32x16 (AVX-512)
torch_dtype override wins over architecture heuristic.
Typed errors: ConfigMissing / Io / Parse / MissingField {path, field}.
Best-effort tokenizer_class from tokenizer_config.json.
8 tests: llama / qwen3-with-tokenizer / bert / modernbert / xlm-roberta
(d_model alias) / generic fallback / missing-config / missing-field.
D0.2 — WireTokenAgreement stub (~100 LOC, the I11 cert gate):
DTOs:
WireBaseline { Passthrough } — default, extensible
WireTokenAgreement { model_path, reference, candidate (WireCodecParams),
prompt_set_blob_id, n_tokens }
WireTokenAgreementResult { top1_rate, top5_rate,
divergence_positions, per_layer_mse,
candidate_latency_us, reference_latency_us,
stub, backend }
Phase 0 handler stub (not shipped yet): returns stub:true /
backend:"stub" deterministic result. Phase 2 D2.1-D2.3 land the
real decode-and-compare loop (reference model load + top-k
comparison + per-layer MSE).
Pass gates (for when the harness lands):
top1_rate ≥ 0.99 + top5_rate ≥ 0.999 vs Passthrough baseline.
This is the ACTUAL codec cert gate — reconstruction ICC is
necessary-but-not-sufficient (per #219/#220 lesson).
3 round-trip serde tests: full payload + stub-backend default +
baseline default.
Board hygiene (CLAUDE.md Mandatory rule):
STATUS_BOARD.md updated:
D0.1 Queued → Shipped (PR #227 — was stale)
D0.2 Queued → In PR (this branch)
D0.5 Queued → In PR (this branch)
Phase 0 state after this commit:
✅ D0.1 WireCalibrate + WireTensorView (PR #227)
✅ D0.6 CodecParamsBuilder (PR #225)
✅ D0.7 precision-ladder validation (PR #225)
✅ D0.5 auto_detect (this PR)
✅ D0.2 WireTokenAgreement stub (this PR)
⏳ D0.3 WireSweep streaming endpoint (next PR)
⏳ D0.4 surface freeze (gates after D0.3)
Rules honored:
Rule D — JSON/YAML/REST only, CodecParams carried through via WireCodecParams
Rule E — Wire surface IS the SIMD surface (lane_width on candidate)
Rule F — serde mirrors at ingress only; TryFrom → CodecParams at handler
https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
First Phase 2 deliverable — scaffold of the I11 cert gate harness. The PR #219 → #220 lesson landed as a typed-rejection wall: the stub result carries stub:true + backend:"stub" so no client can confuse Phase 0 stub output for a real measurement. crates/cognitive-shader-driver/src/token_agreement.rs (~320 LOC): ReferenceModel { path, path_hash, stub_token_count } ::load(&Path) -> Result<Self, TokenAgreementError> D2.1 stub: validates path exists, hashes display; does NOT parse safetensors yet. D2.2 replaces with real loader driven by auto_detect::detect() → ModelFingerprint. ::stub(tag, n_tokens) — builds stub model without touching fs TokenAgreementError: ModelPathMissing { path } EmptyPromptSet TokenCountMismatch { reference, candidate } NotImplementedYet { what } ← measure_full() until D2.2 TopKAgreement { top1_matches, top5_matches, total_positions, divergence_positions: Vec<u32> } ::compare(ref: &[Vec<u32>], cand: &[Vec<u32>]) -> Result<Self> Position-by-position: top1 = r[0] == c[0]; top5 = r[0] in c[..5]. Records divergence positions for failure-mode analysis (late-sequence drift vs random errors). ::top1_rate() / top5_rate() -> f32 ::meets_cert_gate() -> bool (top1 ≥ 0.99 AND top5 ≥ 0.999) ::aggregate(per_prompt) — sums counters; concatenates divergence with per-prompt offset so failures stay localised TokenAgreementHarness: ::new(reference, baseline, candidate, n_tokens) ::measure_stub() -> WireTokenAgreementResult { stub:true, .. } ::measure_full() -> NotImplementedYet (D2.2 scope) Tests (13 new): - reference_model_stub_builds_without_filesystem - reference_model_load_missing_path_yields_typed_error - topk_compare_identical_streams_is_perfect (full cert gate pass) - topk_compare_all_different_fails_cert_gate - topk_top5_matches_when_top1_misses_but_in_top5 (ref top-1 = 7; cand has 7 at position 3 in top-5 → top5 counts) - topk_mismatched_stream_lengths_yield_typed_error - topk_aggregate_sums_counters_and_offsets_divergence (prompt 2's divergence at pos 4 → aggregate pos 14 after prompt 1's 10) - cert_gate_passes_at_exact_thresholds (990/1000 = 0.99, 999/1000 = 0.999 — both boundaries hit) - cert_gate_fails_when_top1_below_threshold_even_if_top5_passes - cert_gate_fails_when_top5_below_threshold_even_if_top1_passes - harness_measure_stub_returns_machine_checkable_stub_flag (stub:true enforced; backend="stub"; all rates 0.0; zero latencies) - harness_measure_full_returns_not_implemented_pointing_at_d22 - harness_measure_stub_rejects_zero_n_tokens Board hygiene (CLAUDE.md Mandatory rule): STATUS_BOARD.md D2.1 Queued → In PR Phase state: Phase 0 ✅ complete (D0.1-D0.7 all shipped) Phase 1 scaffold ✅ (D1.1, D1.2, D1.3 shipped; D1.1b queued) Phase 2 ⏳ D2.1 (this PR), D2.2 + D2.3 queued Rules honored: Rule D — Measurement set comes from Wire DTOs (D0.2 WireTokenAgreement) Rule E — TopKAgreement exposes object-methods (top1_rate, meets_cert_gate) Rule F — No serialization between stages; per-prompt Vec<Vec<u32>> token streams are plain Rust owned; the serde happens at D2.3 handler entry / exit only https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
First Phase 2 deliverable — scaffold of the I11 cert gate harness. The PR #219 → #220 lesson landed as a typed-rejection wall: the stub result carries stub:true + backend:"stub" so no client can confuse Phase 0 stub output for a real measurement. crates/cognitive-shader-driver/src/token_agreement.rs (~320 LOC): ReferenceModel { path, path_hash, stub_token_count } ::load(&Path) -> Result<Self, TokenAgreementError> D2.1 stub: validates path exists, hashes display; does NOT parse safetensors yet. D2.2 replaces with real loader driven by auto_detect::detect() → ModelFingerprint. ::stub(tag, n_tokens) — builds stub model without touching fs TokenAgreementError: ModelPathMissing { path } EmptyPromptSet TokenCountMismatch { reference, candidate } NotImplementedYet { what } ← measure_full() until D2.2 TopKAgreement { top1_matches, top5_matches, total_positions, divergence_positions: Vec<u32> } ::compare(ref: &[Vec<u32>], cand: &[Vec<u32>]) -> Result<Self> Position-by-position: top1 = r[0] == c[0]; top5 = r[0] in c[..5]. Records divergence positions for failure-mode analysis (late-sequence drift vs random errors). ::top1_rate() / top5_rate() -> f32 ::meets_cert_gate() -> bool (top1 ≥ 0.99 AND top5 ≥ 0.999) ::aggregate(per_prompt) — sums counters; concatenates divergence with per-prompt offset so failures stay localised TokenAgreementHarness: ::new(reference, baseline, candidate, n_tokens) ::measure_stub() -> WireTokenAgreementResult { stub:true, .. } ::measure_full() -> NotImplementedYet (D2.2 scope) Tests (13 new): - reference_model_stub_builds_without_filesystem - reference_model_load_missing_path_yields_typed_error - topk_compare_identical_streams_is_perfect (full cert gate pass) - topk_compare_all_different_fails_cert_gate - topk_top5_matches_when_top1_misses_but_in_top5 (ref top-1 = 7; cand has 7 at position 3 in top-5 → top5 counts) - topk_mismatched_stream_lengths_yield_typed_error - topk_aggregate_sums_counters_and_offsets_divergence (prompt 2's divergence at pos 4 → aggregate pos 14 after prompt 1's 10) - cert_gate_passes_at_exact_thresholds (990/1000 = 0.99, 999/1000 = 0.999 — both boundaries hit) - cert_gate_fails_when_top1_below_threshold_even_if_top5_passes - cert_gate_fails_when_top5_below_threshold_even_if_top1_passes - harness_measure_stub_returns_machine_checkable_stub_flag (stub:true enforced; backend="stub"; all rates 0.0; zero latencies) - harness_measure_full_returns_not_implemented_pointing_at_d22 - harness_measure_stub_rejects_zero_n_tokens Board hygiene (CLAUDE.md Mandatory rule): STATUS_BOARD.md D2.1 Queued → In PR Phase state: Phase 0 ✅ complete (D0.1-D0.7 all shipped) Phase 1 scaffold ✅ (D1.1, D1.2, D1.3 shipped; D1.1b queued) Phase 2 ⏳ D2.1 (this PR), D2.2 + D2.3 queued Rules honored: Rule D — Measurement set comes from Wire DTOs (D0.2 WireTokenAgreement) Rule E — TopKAgreement exposes object-methods (top1_rate, meets_cert_gate) Rule F — No serialization between stages; per-prompt Vec<Vec<u32>> token streams are plain Rust owned; the serde happens at D2.3 handler entry / exit only https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
…ft guard)
Final Phase 3 scaffold deliverable — curl-driven lab iteration against
the shipped /v1/shader/sweep endpoint.
Files:
configs/codec/README.md — inventory + DoS-ceiling note +
anti-#219 stub:true flag explanation
configs/codec/00_pr220_baseline.yaml
- PR #220 baseline regression: 6 subspaces × 256 centroids ×
identity rotation. Expected ICC ≈ 0.195 mean when D2.2 lands
real decode-and-compare.
configs/codec/10_wider_codebook.yaml
- PR #220 fix (a): centroids ∈ {256, 512, 1024}. Cardinality 3,
three distinct kernel signatures → warm cache after one pass.
configs/codec/12_hadamard_pre_rotation.yaml
- PR #220 fix (c): Hadamard × centroids cross-product (2×2 = 4).
Hadamard stays Tier-3 F32x16 per Rule C.
scripts/codec_sweep.sh
- yq YAML → JSON conversion
- POST to ${SHADER_LAB_URL}/v1/shader/sweep (default localhost:3001)
- jq-pretty request + response
- Stub honesty check: prints results[0].stub flag
→ verifies Phase 0 returns true (machine-checkable anti-#219)
- Requires: yq (mikefarah/yq ≥ v4), curl, jq
wire.rs +1 test: sweep_request_yaml_shape_deserializes_via_serde_json
- Inline JSON fixture mirroring the canonical YAML → JSON shape
- If this test breaks, the YAML configs are stale relative to
the Rust DTOs → scripts/codec_sweep.sh would fail at runtime
- Caught a real drift during development: PascalCase "Identity"
vs the DTO's rename_all="lowercase" (YAMLs correctly use
lowercase; test fixture had the typo)
Phase state:
Phase 0 ✅ complete
Phase 1 scaffold ✅ (D1.1 / D1.2 / D1.3 shipped; D1.1b queued)
Phase 2 scaffold ✅ (D2.1 harness + D2.3 handler; D2.2 queued)
Phase 3 scaffold ✅ — D3.1 batch handler + D3.2 client driver shipped
⏳ D3.1b real Lance append writer queued
DoS-ceiling note: sweep handler rejects grids with cardinality
> 10_000 before enumeration (PR #238 P1 fix). README documents the
ceiling so config authors can budget axis lengths.
Rule D honored: adding a new codec candidate = authoring a new
YAML file in configs/codec/. Zero Rust changes. Zero rebuilds.
Rules F honored at the client boundary: YAML → JSON → HTTP ingress.
Single deserialisation at the shader-lab's handler; everything
after is in-memory Rust (WireSweepRequest → CodecParams → grid
enumerate() → per-candidate WireSweepResult).
https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
Summary
route_tensorclassifier inlance-graph-contract::cam: routes tensors toCamPq/Passthrough/Skipper invariant I1. 10 tests, 133/133 contract suite passes.cam_pq_calibrateCLI (--features calibrate): reads safetensors, trains per-tensorCamCodebook, encodes fingerprints, serializes codebooks + fingerprints +manifest.jsonwith SHA256 + ICC + reconstruction error.cam_pq_row_count_probe) demonstrates the root cause.Negative Result
PR #218's bench measured ICC 0.9998 on 128 rows trained and measured on those same 128 rows. With 256 centroids per subspace, 128 rows trivially fit — every row gets its own centroid. This does not generalize.
Full-size validation (234 argmax-regime tensors, Qwen3-TTS-0.6B):
Diagnostic on one
gate_proj[3072, 1024]:Root cause: 6×256 PQ is centroid-starved for production tensors (1024–3072 rows). The "128× compression" claim was extrapolated from a trivial in-training fit.
What's Sound
Infrastructure works correctly: the CLI, route classifier, codebook serialization format, ICC/reconstruction measurement harness. The negative result is in the codec's capacity, not the tooling.
What's Needed to Fix
Test plan
cargo test -p lance-graph-contract --lib route_tests— 10/10 passcargo test -p lance-graph-contract— 133/133 passcam_pq_calibratebuilds and runs on Qwen3-TTS-0.6B safetensorscam_pq_row_count_probereproduces the 128-row artifacthttps://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh