Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
202 changes: 202 additions & 0 deletions SESSION_VISION_SENSOR_VIT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,202 @@
# SESSION: Vision Sensor — ViT-Huge-14 for Medical Imaging + Multimodal

## THE ARCHITECTURE

```
Text sensor (current):
text → tokenizer → token_ids → codebook_index → centroids → distance table → think

Vision sensor (planned):
image → ViT patches (14×14 px) → patch embeddings → codebook_index → centroids → distance table → think

Same engine. Different sensor. Same MatVec. Same domino cascade.
Models are SENSORS. The matrix is the BRAIN.
```

## GROUND TRUTH MODELS

### Text (Jina v5 — Qwen3-0.6B)

```
Model: jinaai/jina-embeddings-v5-text-small-text-matching
Base: Qwen3-0.6B
Format: safetensors (1.19 GB) + ONNX f32 (2.39 GB) + GGUF F16 (1.2 GB)
Tokenizer: Qwen3 BPE (151K vocab, 11.4 MB)
Dim: 1024
Pooling: last-token
Tool: candle (loads safetensors directly, no ONNX needed)
Status: tokenizer downloaded, candle wired, forward pass TODO
```

### Vision (ViT-Huge-14 from CLIP)

```
FP32 ground truth:
Repo: Kijai/WanVideo_comfy
File: open-clip-xlm-roberta-large-vit-huge-14_visual_fp32.safetensors
Size: 2.53 GB
Precision: FP32 (24-bit mantissa, NO BF16 truncation)
Tool: candle (loads safetensors) OR rten (after ONNX conversion)

BF16 production:
Repo: DeepBeepMeep/Wan2.1
File: models_clip_open-clip-xlm-roberta-large-vit-huge-14-bf16.safetensors
Size: 2.39 GB
Precision: BF16 (7-bit mantissa, ±0.008 rank flips at boundaries)
Includes: BOTH text encoder (XLM-RoBERTa) + visual encoder (ViT-Huge-14)

Architecture:
ViT-Huge-14:
Patch size: 14×14 pixels
Each patch = one "token" (like BPE subword for text)
~630M parameters
Trained contrastively with XLM-RoBERTa (CLIP objective)
```

### Cross-Modal (CLIP — text ↔ image in same space)

```
The CLIP training objective:
For (text, image) pairs:
text_emb = XLM-RoBERTa(text)
image_emb = ViT-Huge-14(image)
loss = contrastive(text_emb, image_emb)

After training:
cos(text_emb, image_emb) = semantic similarity across modalities
"amyloid plaque in temporal lobe" ↔ brain MRI = high cosine
"amyloid plaque in temporal lobe" ↔ chest X-ray = low cosine

For our architecture:
Text codebook (XLM-RoBERTa) and vision codebook (ViT) share embedding space
Cross-modal distance table: text_centroid × image_centroid → similarity
One CompositeEngine with text lens + vision lens → superposition
```

## MEDICAL IMAGING PIPELINE

```
Phase 1: Image input
DICOM → PNG/TIFF → resize to ViT resolution
OR: direct from PACS/radiology viewer

Phase 2: ViT forward pass (rten, pure Rust)
Image → 14×14 patches → ViT encoder → f32 embedding per patch
Global: mean pool patch embeddings → 1024D image embedding
Local: per-patch embeddings for segmentation

Phase 3: Codebook + distance table
CLAM 256 centroids from ViT patch embeddings (same as text pipeline)
256×256 distance table (same HDR CDF or i8 signed encoding)
codebook_index: patch_embedding → centroid_id

Phase 4: ThinkingEngine
perturb(patch_centroid_ids) → think(10 cycles) → commit()
Same engine, same MatVec, same domino cascade
Qualia from convergence = visual gestalt of the image

Phase 5: SPO extraction
Dominant atoms → centroid labels → SPO triples
(lesion, ADJACENT_TO, ventricle)
(tumor, LARGER_THAN, 2cm)
NARS truth values from convergence confidence

Phase 6: Cross-modal query
Text: "Show me cases with amyloid plaques near the hippocampus"
→ Jina v5 tokenize → codebook → text_centroids
→ CLIP cross-modal similarity with image_centroids
→ Ranked retrieval from image database
```

## WHALE SONOGRAPHY (SESSION_WHALE_SONOGRAPHY.md)

```
Same pipeline applied to:
Ultrasound images → ViT patches → codebook → think
Age-cohort stratification via L4 experience
Longitudinal tracking via trajectory (trajectory-cartographer agent)

The ViT sensor treats ultrasound frames as images.
No special medical preprocessing — the codebook learns the topology.
```

## OSINT INTEGRATION

```
WikiLeaks documents often contain:
Text (cables, reports) → Jina/BGE-M3 text sensor
Images (maps, photos, diagrams) → ViT vision sensor
OCR'd text from images → text sensor (after ocrs/rten OCR)

Cross-modal CLIP similarity:
"drone strike coordinates" (text) ↔ satellite image (vision)
Both in same embedding space → one distance table query
```

## CALIBRATION (same pattern as text)

```
Vision ground truth:
FP32 safetensors → candle forward pass → f32 patch embeddings
Calibrate against: baked u8 CDF, i8 signed, γ+φ encoded tables
Same 5-lane encoder, same Spearman ρ, same ICC profiles

Text ground truth:
Jina v5 safetensors → candle forward pass → f32 text embeddings
Same calibration pipeline

Cross-modal ground truth:
CLIP FP32 → both encoders → cross-modal cosine
Calibrate: cross-modal distance table vs CLIP cosine
```

## THREE TOOLS FOR THREE SENSOR TYPES

```
Tool Text sensor Vision sensor Cross-modal
──── ─────────── ───────────── ───────────
candle Jina v5 forward pass ViT-Huge-14 forward CLIP joint
ort Reranker cross-encoder — —
rten — Medical ViT segmentation —

candle loads safetensors (text + vision).
ort loads ONNX (reranker only, cross-encoder architecture).
rten loads ONNX (medical imaging, pure Rust, AdaWorldAPI fork).
```

## IMPLEMENTATION ORDER

```
1. [NOW] Jina v5 text ground truth (candle + Qwen3 tokenizer)
2. [NEXT] Cross-model text calibration (Jina v3 ↔ v5 ↔ Reranker ↔ BGE-M3)
3. [NEXT] 5-lane encoding + Spearman ρ + ICC profiles
4. [THEN] ViT-Huge-14 vision ground truth (candle + FP32 safetensors)
5. [THEN] Medical imaging codebook (CLAM on ViT patch embeddings)
6. [THEN] Cross-modal CLIP distance table
7. [THEN] OSINT multimodal query (text + image in same search)
```

## FILES

```
Ground truth models:
jinaai/jina-embeddings-v5-text-small-text-matching (text, Qwen3)
Kijai/WanVideo_comfy/..._visual_fp32.safetensors (vision, ViT-Huge-14, FP32)
DeepBeepMeep/Wan2.1/..._bf16.safetensors (combined CLIP, BF16)

Tokenizer:
data/jina-v5-tokenizer.json (Qwen3 BPE, 151K vocab, 11.4 MB)
data/jina-v3-hdr/tokenizer.json (XLM-RoBERTa, 250K vocab, 8.7 MB)

Code:
src/tokenizer_registry.rs (6 models, cross-model tokenization)
src/ground_truth.rs (calibration DTOs, Spearman ρ)
src/composite_engine.rs (multi-lens including future vision lens)
src/tensor_bridge.rs (F32/I8/U8/Tensor bridge for candle output)
examples/stream_signed_lens.rs (5-lane encoder with γ+φ metadata)

Agents:
.claude/agents/family-codec-smith.md (HEEL/HIP/BRANCH/TWIG/LEAF encoding)
ndarray/.claude/agents/truth-architect.md (BF16 truth, causality)
ndarray/.claude/agents/cascade-architect.md (3-stroke search)
```
2 changes: 2 additions & 0 deletions crates/thinking-engine/data/.gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
*.onnx
*.onnx_data
tokenizer.json
*-tokenizer.json
1 change: 0 additions & 1 deletion crates/thinking-engine/data/bge-m3-hdr/tokenizer.json

This file was deleted.

1 change: 0 additions & 1 deletion crates/thinking-engine/data/jina-v3-hdr/tokenizer.json

This file was deleted.

1 change: 0 additions & 1 deletion crates/thinking-engine/data/xlm-roberta-de/tokenizer.json

This file was deleted.

162 changes: 162 additions & 0 deletions crates/thinking-engine/examples/end_to_end_signed.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
//! End-to-end test: real tokenizer → signed engine → nucleus sampling.
//!
//! Tests whether the full pipeline produces meaningful similarity:
//! Similar texts (Rumi↔Rumi) should have higher overlap than
//! unrelated texts (Rumi↔TCP).
//!
//! Uses: real XLM-RoBERTa tokenizer, Jina v3 HDR lens (converted to i8),
//! SignedThinkingEngine with Nucleus pooling (T=0.7, p=0.9).
//!
//! This is the SMOKE TEST before calibration. If this fails,
//! the 7-lane encoding and ONNX ICC are measuring noise.

use thinking_engine::jina_lens::{JINA_HDR_TABLE, jina_lookup_many, JINA_N_CENTROIDS};
use thinking_engine::signed_engine::SignedThinkingEngine;
use thinking_engine::pooling::Pooling;

fn main() {
println!("═══════════════════════════════════════════════════════════");
println!(" END-TO-END: real tokenizer → i8 signed → nucleus");
println!("═══════════════════════════════════════════════════════════\n");

// Load real XLM-RoBERTa tokenizer
let tok = match tokenizers::Tokenizer::from_file(
"crates/thinking-engine/data/jina-v3-hdr/tokenizer.json"
) {
Ok(t) => t,
Err(e) => { eprintln!("Tokenizer failed: {}. Aborting.", e); return; }
};
println!("Tokenizer: XLM-RoBERTa 250K loaded\n");

// Build signed engine from Jina HDR table
let signed_table: Vec<i8> = JINA_HDR_TABLE.iter()
.map(|&v| (v as i16 - 128) as i8)
.collect();
// NOTE: This is from_unsigned (CDF rank relabeling, not true signed).
// The real i8 path needs from_f32_cosines via stream_signed_lens.
// But this tests the ENGINE + POOLING pipeline, not the encoding quality.
let mut engine = SignedThinkingEngine::new(signed_table);

let pooling = Pooling::Nucleus {
temperature: 0.7,
top_p: 0.9,
seed: Some(42), // deterministic for comparison
};

// Calibration pairs (4 tiers)
let pairs: Vec<(&str, &str, &str)> = vec![
// TIER 1 — should be MOST similar
("The wound is the place where the light enters you",
"Where there is ruin there is hope for a treasure",
"Rumi↔Rumi"),
("A federal judge ruled the surveillance program unconstitutional",
"A US court declared the mass surveillance scheme violated the constitution",
"STS-B paraphrase"),
// TIER 2 — moderate
("Palantir built Gotham for intelligence agencies to map human networks",
"Edward Snowden revealed the NSA collected phone metadata of millions",
"Palantir↔Snowden"),
("Amyloid plaques accumulate in the brains of Alzheimer patients",
"Tau protein tangles disrupt neural communication in neurodegenerative disease",
"Alzheimer↔Tau"),
// TIER 3 — weak
("Newton showed that gravity follows an inverse square law",
"Quantum entanglement allows particles to share states across arbitrary distances",
"Newton↔Quantum"),
// TIER 4 — should be LEAST similar
("You are not a drop in the ocean you are the entire ocean in a drop",
"TCP uses a three-way handshake to establish a reliable connection between hosts",
"Rumi↔TCP"),
("CRISPR-Cas9 enables precise editing of genomic sequences at targeted loci",
"Bach composed the Well-Tempered Clavier as an exploration of all major and minor keys",
"CRISPR↔Bach"),
];

println!(" {:>20} {:>8} {:>8} {:>8} {:>6} {:>6}",
"Pair", "Jaccard", "Cos(E)", "TopK∩", "Inhib", "Cycles");
println!(" {:─>20} {:─>8} {:─>8} {:─>8} {:─>6} {:─>6}", "", "", "", "", "", "");

let mut results: Vec<(String, f32, f32, usize)> = Vec::new();

for (text_a, text_b, label) in &pairs {
let enc_a = tok.encode(*text_a, true).unwrap();
let enc_b = tok.encode(*text_b, true).unwrap();
let ids_a: Vec<u32> = enc_a.get_ids().to_vec();
let ids_b: Vec<u32> = enc_b.get_ids().to_vec();

let centroids_a = jina_lookup_many(&ids_a);
let centroids_b = jina_lookup_many(&ids_b);

// Think text A — with temperature excitation (T=0.3, sharp discrimination)
engine.reset();
engine.perturb(&centroids_a);
engine.think_with_temperature(10, 0.3);
let energy_a = engine.energy.clone();
let pooled_a = pooling.pool(&energy_a);
let inhib_a = engine.total_inhibitions;

// Think text B
engine.reset();
engine.perturb(&centroids_b);
engine.think_with_temperature(10, 0.3);
let energy_b = engine.energy.clone();
let pooled_b = pooling.pool(&energy_b);
let inhib_b = engine.total_inhibitions;

// Compare: Jaccard of pooled atoms
let atoms_a: std::collections::HashSet<u16> = pooled_a.atoms.iter()
.map(|&(idx, _)| idx).collect();
let atoms_b: std::collections::HashSet<u16> = pooled_b.atoms.iter()
.map(|&(idx, _)| idx).collect();
let intersection = atoms_a.intersection(&atoms_b).count();
let union = atoms_a.union(&atoms_b).count().max(1);
let jaccard = intersection as f32 / union as f32;

// Compare: cosine of full energy vectors
let dot: f32 = energy_a.iter().zip(&energy_b).map(|(a, b)| a * b).sum();
let na: f32 = energy_a.iter().map(|x| x * x).sum::<f32>().sqrt();
let nb: f32 = energy_b.iter().map(|x| x * x).sum::<f32>().sqrt();
let cos_e = if na > 1e-10 && nb > 1e-10 { dot / (na * nb) } else { 0.0 };

// Compare: top-k overlap
let top_a: Vec<u16> = pooled_a.atoms.iter().take(5).map(|&(idx, _)| idx).collect();
let top_b: Vec<u16> = pooled_b.atoms.iter().take(5).map(|&(idx, _)| idx).collect();
let topk_overlap = top_a.iter().filter(|x| top_b.contains(x)).count();

println!(" {:>20} {:>8.3} {:>8.3} {:>5}/5 {:>6} {:>3}+{:<3}",
label, jaccard, cos_e, topk_overlap,
(inhib_a + inhib_b) / 2,
pooled_a.atoms.len(), pooled_b.atoms.len());

results.push((label.to_string(), jaccard, cos_e, topk_overlap));
}

// Verdict
println!("\n═══════════════════════════════════════════════════════════");
println!(" VERDICT");
println!("═══════════════════════════════════════════════════════════");

// Check monotonicity: tier 1 > tier 2 > tier 3 > tier 4
let tier1_avg = (results[0].2 + results[1].2) / 2.0;
let tier2_avg = (results[2].2 + results[3].2) / 2.0;
let tier3_avg = results[4].2;
let tier4_avg = (results[5].2 + results[6].2) / 2.0;

println!(" Tier 1 (paraphrase): cos={:.3}", tier1_avg);
println!(" Tier 2 (thematic): cos={:.3}", tier2_avg);
println!(" Tier 3 (weak): cos={:.3}", tier3_avg);
println!(" Tier 4 (unrelated): cos={:.3}", tier4_avg);
println!();

let monotonic = tier1_avg >= tier2_avg && tier2_avg >= tier3_avg && tier3_avg >= tier4_avg;
if monotonic {
println!(" → MONOTONIC: tiers decrease correctly. Engine discriminates.");
println!(" → Ready for 7-lane encoding + ONNX ICC calibration.");
} else if tier1_avg > tier4_avg {
println!(" → PARTIALLY DISCRIMINATIVE: tier1 > tier4 but not monotonic.");
println!(" → Engine sees some signal. May improve with better encoding.");
} else {
println!(" → NOT DISCRIMINATIVE: tier1 ≤ tier4. Engine is confused.");
println!(" → Fix encoding or table granularity before calibration.");
}
}
Loading