From 03d9aa5a2a586a41d87f6ed671ec237ec02ba7db Mon Sep 17 00:00:00 2001 From: Claude Date: Sun, 5 Apr 2026 23:23:30 +0000 Subject: [PATCH] =?UTF-8?q?docs:=20signed=20i8=20formulas=20per=20role=20?= =?UTF-8?q?=E2=80=94=20Q/K/V/Gate/Up/Down=20encoding=20+=20MatVec?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Complete formulas for building signed i8 distance tables: Q: raw cosine → i8 (extern, no gate) K: silu(gate) × K → cosine → i8 (intern, gate-modulated) V: silu(gate) × V → cosine → i8 (intern, gate-modulated) Gate: raw cosine → i8 (IS the gate, topology reference) Up: silu(gate) × Up → cosine → i8 (strongest effect, 33% Δ) Down: raw cosine → i8 (funnel, receives gated result) Per-role scale factors from Qwopus BF16 measured ranges. Gate gets highest resolution (scale=552) because range is narrowest. Signed MatVec + clamp(0) = excitation/inhibition dynamics. Complete layer_forward_signed() showing gate as NARS trust modulator. https://claude.ai/code/session_01ChLvBfpJS8dQhHxRD4pYNp --- .claude/KNOWLEDGE_SYNC_SIGNED_SESSION.md | 196 +++++++++++++++++++++++ 1 file changed, 196 insertions(+) diff --git a/.claude/KNOWLEDGE_SYNC_SIGNED_SESSION.md b/.claude/KNOWLEDGE_SYNC_SIGNED_SESSION.md index 546ff6fa..d7aa3576 100644 --- a/.claude/KNOWLEDGE_SYNC_SIGNED_SESSION.md +++ b/.claude/KNOWLEDGE_SYNC_SIGNED_SESSION.md @@ -231,3 +231,199 @@ HANDOVER DOCS: .claude/HANDOVER_MAVERICK_SESSION.md → i8 architecture, Maverick plan, temperature fix .claude/HANDOVER_CALIBRATION_SESSION.md → H1-H5 hypotheses, Cronbach α protocol ``` + +--- + +## SIGNED i8 FORMULAS PER ROLE + +### The encoding formula + +For each weight row, the signed i8 value preserves the ACTUAL cosine polarity: + +``` +scale_factor = 127.0 / max(|cosine_values|) +i8_value = round(cosine × scale_factor).clamp(-128, +127) +``` + +Per-role scale factors (from Qwopus 27B L0 measured cosine ranges): + +``` +Role Cosine Range max(|cos|) Scale Factor +──── ──────────── ────────── ──────────── +attn_qkv [-0.62, +0.69] 0.69 184.1 +ffn_gate [-0.23, +0.18] 0.23 552.2 ← HIGHEST RESOLUTION +ffn_up [-0.08, +0.08] 0.08 1587.5 ← (but tiny range) +ffn_down [-0.18, +0.10] 0.18 705.6 +ssm_out [-0.20, +0.28] 0.28 453.6 +``` + +Gate gets the most resolution because its range is narrow and centered at zero — +exactly where the SiLU decision boundary lives. + +### What each role's sign MEANS + +``` +Q (Query) — "what is the world asking?" + EXTERN. Input-dependent. The world asks what it asks. + i8 encoding: round(cos(Q_row_i, Q_row_j) × scale) → i8 + + +i8: "query i and query j ask SIMILAR things" + -i8: "query i and query j ask OPPOSITE things" + 0: "unrelated queries" + + NO gate modulation. Q is raw. + Formula: table_Q[i][j] = i8(cos(Q_centroid_i, Q_centroid_j) × scale_Q) + + +K (Key) — "what do I know?" (gate-modulated) + INTERN. Self-filtered knowledge index. + i8 encoding: silu(gate) × K, THEN cosine, THEN i8 + + +i8: "knowledge i and knowledge j are CO-ACCESSIBLE through the gate" + -i8: "gate opens i but BLOCKS j (or vice versa)" + 0: "no gate relationship" + + Formula: + activated_K_i = silu(gate_centroid_i) ⊙ K_centroid_i (elementwise) + activated_K_j = silu(gate_centroid_j) ⊙ K_centroid_j + table_K[i][j] = i8(cos(activated_K_i, activated_K_j) × scale_K) + + WHY silu(gate) × K: + gate[d] = +0.3 → silu(0.3) = 0.16 → K[d] × 0.16 → feature d PASSES (reduced) + gate[d] = -0.1 → silu(-0.1) = -0.047 → K[d] × -0.047 → feature d INVERTED + gate[d] = 0.0 → silu(0.0) = 0.0 → K[d] × 0.0 → feature d MASKED + + Two keys with SAME gate opening pattern → positive cosine → excitation + Two keys where gate opens OPPOSITE features → negative cosine → inhibition + + +V (Value) — "what do I give?" (gate-modulated) + Same as K but for content: + Formula: + activated_V_i = silu(gate_centroid_i) ⊙ V_centroid_i + table_V[i][j] = i8(cos(activated_V_i, activated_V_j) × scale_V) + + +Gate — "what am I ALLOWED to activate?" + The gate IS the lens. Not a codebook entry. + i8 encoding: raw gate-to-gate cosine (how similar are two gate patterns?) + + +i8: "same gate opening pattern" (same features allowed) + -i8: "OPPOSITE gate patterns" (what one allows, the other blocks) + 0: "unrelated gate patterns" + + Formula: table_Gate[i][j] = i8(cos(Gate_centroid_i, Gate_centroid_j) × scale_Gate) + + NOTE: 68.9% of gate values are near zero. + This means most gate dimensions are in the SiLU decision zone. + The SIGN of these near-zero values is the entire gate decision. + i8 preserves this sign. u8 destroys it. + + +Up — "how do I expand?" (gate × SiLU modulated) + INTERN. The FFN expansion. Gate × SiLU × Up is the activation. + i8 encoding: silu(gate) × up, THEN cosine, THEN i8 + + Formula: + activated_Up_i = silu(gate_centroid_i) ⊙ Up_centroid_i + activated_Up_j = silu(gate_centroid_j) ⊙ Up_centroid_j + table_Up[i][j] = i8(cos(activated_Up_i, activated_Up_j) × scale_Up) + + This is where the 33% error lives: + Raw cos(Up_i, Up_j) std = 0.021 + cos(silu(gate)×Up_i, silu(gate)×Up_j) std = 0.051 ← 2.4× MORE SPREAD + 99.2% of table cells change. Mean Δ = 84.2 u8 levels. + + Without gate modulation: Up table is WRONG by 33%. + With gate modulation: Up table captures actual FFN activation topology. + + +Down — "how do I compress?" + EXTERN (funnel). Receives gate×up result, compresses back. + i8 encoding: raw cosine (no gate modulation needed) + + Formula: table_Down[i][j] = i8(cos(Down_centroid_i, Down_centroid_j) × scale_Down) + + NO gate modulation. Down receives the already-gated signal. + Like Q, it's a raw cosine encoding. +``` + +### The MatVec with signed tables + +```rust +/// Signed MatVec: positive entries excite, negative entries inhibit. +fn signed_matvec(table: &[i8], energy: &[f32], n: usize) -> Vec { + let mut next = vec![0.0f32; n]; + for i in 0..n { + if energy[i].abs() < 1e-8 { continue; } + let row = &table[i * n..(i + 1) * n]; + for j in 0..n { + // SIGNED: table[i][j] > 0 = excitation, < 0 = inhibition + next[j] += (row[j] as f32 / 127.0) * energy[i]; + } + } + // CLAMP: inhibited atoms die (negative energy → 0) + for e in &mut next { + *e = e.max(0.0); + } + next +} +``` + +### The complete forward pass per layer + +```rust +fn layer_forward_signed( + hidden: &mut [f32], + table_q: &[i8], // raw (extern) + table_gate: &[i8], // raw gate topology + table_up: &[i8], // silu(gate)×up (intern, gate-modulated) + table_down: &[i8], // raw (funnel) + residual_scale: f32, // 0.1 typical +) { + let n = hidden.len(); + + // 1. Attention sublayer (Q topology routes) + let mut attn = hidden.to_vec(); + rms_norm(&mut attn); + attn = signed_matvec(table_q, &attn, n); + + // 2. Gate modulates attention via NARS truth + // (gate topology tells us which attention paths to trust) + let gate_energy = signed_matvec(table_gate, &hidden, n); + for i in 0..n { + // Gate as confidence: high gate energy = trust this attention path + let gate_trust = gate_energy[i].max(0.0) / (gate_energy[i].abs() + 1.0); + attn[i] *= gate_trust; + } + + // 3. Residual connection + for i in 0..n { hidden[i] += attn[i] * residual_scale; } + + // 4. FFN sublayer (up is gate-modulated, down is raw) + let mut ffn_in = hidden.to_vec(); + rms_norm(&mut ffn_in); + let up_out = signed_matvec(table_up, &ffn_in, n); // ALREADY gate-corrected + let ffn_out = signed_matvec(table_down, &up_out, n); + + // 5. Residual connection + for i in 0..n { hidden[i] += ffn_out[i] * residual_scale; } +} +``` + +### Summary: which roles get gate × SiLU, which don't + +``` +Role Gate Modulation Formula for i8 table +──── ─────────────── ──────────────────── +Q NONE (extern) i8(cos(Q_i, Q_j) × scale) +Gate NONE (IS the gate) i8(cos(Gate_i, Gate_j) × scale) +K silu(gate) × K i8(cos(silu(g)⊙K_i, silu(g)⊙K_j) × scale) +V silu(gate) × V i8(cos(silu(g)⊙V_i, silu(g)⊙V_j) × scale) +Up silu(gate) × Up i8(cos(silu(g)⊙Up_i, silu(g)⊙Up_j) × scale) +Down NONE (funnel) i8(cos(Down_i, Down_j) × scale) + +⊙ = elementwise multiply +silu(x) = x / (1 + exp(-x)) +scale = 127.0 / max(|cosine_values|) +```