fix: remove tokenizer.json files from git (757K LOC → GitHub Release) Removed from git tracking: data/jina-v3-hdr/tokenizer.json (8.7 MB, XLM-RoBERTa 250K) data/bge-m3-hdr/tokenizer.json (8.7 MB, XLM-RoBERTa 250K) data/jina-v5-tokenizer.json (11.4 MB, Qwen3 151K — 757K lines!) data/xlm-roberta-de/tokenizer.json (8.7 MB, German NER) Files stay on disk (gitignored) for local development. tokenizer_registry.rs already has from_pretrained() fallback that downloads from HuggingFace if local file is missing. Upload to GitHub Release for offline environments. https://claude.ai/code/session_019RzHP8tpJu55ESTxhfUy1A#120
Merged
Conversation
…8 path) pooling.rs: Nucleus sampling with temperature + top-p (3 tests) Analytical: T=0.3, p=0.3 (precise, narrow) Creative: T=1.2, p=0.95 (exploratory, wide) Balanced: T=0.7, p=0.9 (metacognitive) Focused: ArgMax (deterministic, no randomness) Deterministic with seed for reproducibility. builder.rs: ThinkingPreset enum + .thinking_preset() method (1 test) Maps cognitive styles → pooling strategies. Temperature + nucleus sampling from HANDOVER_MAVERICK_SESSION.md. signed_engine.rs: from_f32_cosines() method The CORRECT i8 path: f32 cosines → round(cos × 127) → i8. Does NOT go through CDF u8. WARNING comment on from_unsigned(): relabels CDF ranks, not gate signs. Per KNOWLEDGE_SYNC_SIGNED_SESSION.md: u8-128 is wrong. 222 tests pass. https://claude.ai/code/session_019RzHP8tpJu55ESTxhfUy1A
Forked from stream_hdr_lens.rs. Outputs THREE files per model:
distance_table_NxN.u8 CDF percentile rank (existing HDR encoding)
distance_table_NxN.i8 round(cos × 127) DIRECT from f32 (NEW)
cosine_matrix_NxN.f32 raw f32 cosines for re-encoding experiments
The i8 table is the CORRECT signed path:
f32 cosine → round(cos × 127) → i8[-128, +127]
NO CDF. NO u8 intermediate. Signs preserved from source cosines.
Per KNOWLEDGE_SYNC_SIGNED_SESSION.md: from_unsigned() (u8 - 128)
relabels CDF ranks, does NOT recover gate signs. This script
produces real signed tables directly from GGUF streaming.
Also saves f32 cosine matrix for calibration (H1-H5 hypotheses
need the raw cosines to measure encoding loss per path).
Reports E/I ratio from the signed table (pos/neg/zero distribution).
Usage:
cargo run --release --manifest-path crates/thinking-engine/Cargo.toml \
--example stream_signed_lens -- \
jinaai/jina-reranker-v3-GGUF jina-reranker-v3-BF16.gguf
https://claude.ai/code/session_019RzHP8tpJu55ESTxhfUy1A
stream_signed_lens.rs now outputs 5 encoding lanes from one GGUF stream: Lane 1: u8 CDF (current HDR, pure rank, no metadata) Lane 2: i8 direct (round(cos×127), signs preserved) Lane 3: u8 γ+φ (golden ratio redistribution, auto-calibrated γ) Lane 4: i8 γ+φ signed (gamma+phi on signed range) Lane 5: SiLU correction Δ (cos(silu(gate)×up) - cos(raw), f32 per pair) PLUS: encoding_metadata.json with FULL provenance per lane: - cosine_range, cosine_mean from f32 source - role_gamma, phi_scale for γ+φ decode - E/I ratio for signed tables - SiLU note (zeros for token_embd, real deltas when streaming ffn_up+gate) PLUS: cosine_matrix_NxN.f32 (raw f32 cosines for calibration re-encoding) CRITICAL: This script is the ENCODING side. The CALIBRATION side needs Jina v5 ONNX ground truth FIRST (candle forward pass → f32 embedding cosines). Without ground truth, these 5 lanes are 5 ways to be wrong. Order: ground truth → encode → Spearman ρ per lane → ICC → bake winner. https://claude.ai/code/session_019RzHP8tpJu55ESTxhfUy1A
…wnloaded tokenizer_registry.rs (7 tests): ModelId: JinaV3, BgeM3, Reranker, JinaV5, ReaderLm, Qwopus TokenizerRegistry: load(), load_all(), encode(), encode_all() Cross-model: same text → different tokenizers → different token_ids tokenize_corpus(): batch tokenize through all loaded models Tokenizer files on disk: Jina v3: XLM-RoBERTa 250K (data/jina-v3-hdr/tokenizer.json, 8.7 MB) BGE-M3: XLM-RoBERTa 250K (data/bge-m3-hdr/tokenizer.json, 8.7 MB) Reranker: Qwen2 151K (data/Qwopus3.5-27B-v3-BF16-silu/tokenizer.json, 7 MB) Jina v5: Qwen3 151K (data/jina-v5-tokenizer.json, 11.4 MB) ← NEW All 4 tokenizers load and produce valid token_ids within vocab range. Cross-model test confirms XLM-RoBERTa and Qwen2 produce different tokens. This is the INPUT side of the calibration chain: text → tokenizer → token_ids → codebook_index → centroids The OUTPUT side needs candle forward pass → f32 embedding = ground truth. 229 tests pass. https://claude.ai/code/session_019RzHP8tpJu55ESTxhfUy1A
ground_truth.rs (5 tests): GroundTruthEmbedding: f32 embedding + cosine similarity GroundTruthPair: text_a + text_b + model + f32 cosine CalibrationCorpus: pairs + cosines() for Spearman comparison GroundTruthSource: Candle | Precomputed | Api | ExpertAssigned spearman_vs_ground_truth(): baked distances vs ground truth cosines spearman_rank_correlation(): standalone Spearman ρ implementation NOTE: Jina v5 = Qwen3-0.6B base (NOT XLM-RoBERTa, that's v3/old). Candle forward pass placeholder — needs model weights downloaded. Precomputed path works now (load embeddings from file). 234 tests pass total. https://claude.ai/code/session_019RzHP8tpJu55ESTxhfUy1A
FP32 ground truth: Kijai/WanVideo_comfy (2.53 GB, no BF16 truncation) BF16 production: DeepBeepMeep/Wan2.1 (2.39 GB, combined CLIP) Medical: DICOM → ViT patches → codebook → SPO → NARS Cross-modal: text ↔ image in same CLIP space Order: text ground truth FIRST → vision → cross-modal https://claude.ai/code/session_019RzHP8tpJu55ESTxhfUy1A
tokenizer_registry.rs: ModelId::ModernBert — OLMo tokenizer, 50K vocab, code-friendly ModelId::ClipVision — XLM-RoBERTa text side, ViT-Huge-14 vision has_gate_modulation() — GeGLU flag (ModernBERT, Qwen, Qwopus = true) onnx_repo() — ONNX availability (ModernBERT, Jina v5) 3 new tests: vocab sizes, gate flags, ONNX repos auto_detect.rs: Architecture::ClipVit added to detection builder.rs: Lens::ModernBert + Lens::ClipVision (not baked yet, errors clearly) ModernBERT is NOT a replacement for Jina or Reranker — it's a THIRD ground truth source. Different architecture (GeGLU, 28 deep+thin layers), different tokenizer (OLMo, code-friendly), different training data. Cross-validating 3+ models via Cronbach α tests whether lenses agree. ONNX: ort-community/ModernBERT-large-ONNX-ORT has FP32/FP16/INT8/UINT8 — directly tests H1 (BF16 truncation) and H3 (i8 vs u8 rank preservation) 234 tests pass. https://claude.ai/code/session_019RzHP8tpJu55ESTxhfUy1A
cronbach.rs (8 tests): cronbach_alpha(): standard psychometric α for K items × N subjects quorum_scores(): per-centroid-pair agreement score (u8) across baked tables QuorumLevel: High/Medium/Low/Ambiguous with needs_leaf_validation() cronbach_analysis(): full report with per-pair variance + disagreement count REAL DATA from 3 baked lenses (Jina v3 × BGE-M3 × Reranker, 256×256): High quorum: 2,658 pairs (4.1%) — all 3 lenses agree Medium: 16,000 pairs (24.4%) — mostly agree Low: 22,024 pairs (33.6%) — significant disagreement Ambiguous: 24,854 pairs (37.9%) — no consensus 71.5% of centroid pairs have Low or Ambiguous quorum. → Multi-lens superposition is NOT redundant — lenses see genuinely different structure. H5 prediction confirmed: α < 0.70 for most pairs. → Each pair should carry its quorum_score as HHTL metadata. → 4.1% High pairs can skip LEAF validation (fast cascade safe). → 71.5% need either more lenses (ModernBERT, CLIP) or LEAF. 242 tests pass. https://claude.ai/code/session_019RzHP8tpJu55ESTxhfUy1A
…beled i8 All 7 pairs produce cos=1.000 (identical energy distributions). Rumi↔Rumi = Rumi↔TCP = CRISPR↔Bach = all the same. Root cause: from_unsigned(u8-128→i8) relabels CDF ranks. CDF encoding forces uniform distribution → symmetrical around 0. No real negative cosines → no real inhibition → no discrimination. Temperature/Nucleus sampling (Pooling::Nucleus) does NOT help: It samples FROM the collapsed peak, doesn't break the collapse. NEXT: run stream_signed_lens.rs on real GGUF (Reranker cos[-0.886,+0.826]) to get i8 tables from f32 cosines with REAL negative values. Needs HuggingFace HTTPS access (blocked in sandbox). The pipeline (tokenizer→engine→pooling) is correctly wired. The encoding (u8 CDF→i8 relabel) is wrong. Fix the encoding first. https://claude.ai/code/session_019RzHP8tpJu55ESTxhfUy1A
…es on CDF think_with_temperature(max_cycles, T) on both engines: T=0.3: winner-take-all sharpening (analytical) T=1.0: standard normalization (balanced) T=2.0: uniform exploration (creative) Applied as softmax(energy/T) AFTER accumulation, BEFORE normalization. Log-sum-exp trick for numerical stability. End-to-end smoke test with T=0.3: STILL cos=1.000 for all pairs. Temperature exponentiates small differences — but CDF-encoded tables have NO differences to exponentiate. All rows look identical after CDF uniform redistribution. CONFIRMED: The bottleneck is the TABLE ENCODING, not the engine. - Temperature: works correctly (sharpens peaks) - Inhibition: works correctly (clamps negatives) - Nucleus sampling: works correctly (samples from nucleus) - Tokenizer: works correctly (real XLM-RoBERTa BPE) - Codebook lookup: works correctly (token→centroid mapping) BUT: the distance table after CDF encoding is too uniform. Fix: stream_signed_lens.rs on real GGUF → i8 from f32 cosines. Needs HuggingFace HTTPS access (blocked in sandbox). 242 tests pass. https://claude.ai/code/session_019RzHP8tpJu55ESTxhfUy1A
RoleTemperatures struct: t_gate, t_attn, t_ffn, t_down Gate IS the thermostat. Its temperature controls decision sharpness. Other roles follow but with independent T values. ThinkingPreset now sets BOTH pooling AND per-role temperatures: Analytical: T_gate=0.1 → narrow gate → few features → focused (FLOW) Creative: T_gate=1.5 → wide gate → many features → exploratory (HOLD) Balanced: T_gate=0.7 → moderate → adaptive Focused: T_gate=0.05 → near-zero → maximum discrimination CollapseGate mapping: FLOW = low T_gate (commit fast, sharp decisions) HOLD = high T_gate (explore, soft decisions) BLOCK = T_gate → ∞ (uniform, no discrimination) ConfiguredEngine::process() uses T_gate for think_with_temperature(): T≈1.0: standard cycle (no softmax overhead) T≠1.0: softmax(energy/T) per cycle (temperature-as-excitation) The gate table stored separately was ALWAYS for this: different T per role during the cycle, not just at output sampling. 10 builder tests pass (2 new: per_role_temperature, creative_vs_analytical). 244 tests total. https://claude.ai/code/session_019RzHP8tpJu55ESTxhfUy1A
Removed from git tracking: data/jina-v3-hdr/tokenizer.json (8.7 MB, XLM-RoBERTa 250K) data/bge-m3-hdr/tokenizer.json (8.7 MB, XLM-RoBERTa 250K) data/jina-v5-tokenizer.json (11.4 MB, Qwen3 151K — 757K lines!) data/xlm-roberta-de/tokenizer.json (8.7 MB, German NER) Files stay on disk (gitignored) for local development. tokenizer_registry.rs already has from_pretrained() fallback that downloads from HuggingFace if local file is missing. Upload to GitHub Release for offline environments. https://claude.ai/code/session_019RzHP8tpJu55ESTxhfUy1A
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.