Skip to content

feat: MoE architecture — 4096 experts, top-128 sparse, 4-group hierarchy 4096 experts → router selects top-128 → each runs 16 internal layers → 4 hierarchical groups where expert outputs compose → collapse → token Architecture maps Qwopus's actual MoE structure: Router: 4096×4096 input distance table (which experts respond?) Expert internals: 256×256 per-layer tables (how does each expert think?) Sparse activation: only top-128 of 4096 fire (97% sparsity) Hierarchical meeting: experts compose at 4 intermediate points Performance: 42s for 3 prompts in release (128×16 MatVec per token) → production needs SIMD batching or GPU Output quality still blocked by attractor collapse ("!" dominates). Needs: thinking style temperature + persona routing + ONNX gate correction. https://claude.ai/code/session_01ChLvBfpJS8dQhHxRD4pYNp#110

Merged
AdaWorldAPI merged 3 commits into
mainfrom
claude/setup-embedding-pipeline-Fa65C
Apr 5, 2026

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

No description provided.

claude added 3 commits April 5, 2026 13:43
…ared

Three gate modes on Qwopus 27B (64 layers, real tokenizer):

  No Gate:     attn → up → down (no gate at all)
  Gate Filter: attn → gate×SiLU×up → down (multiplicative, current)
  Gate NARS:   gate modulates NARS truth → NARS gates attention+FFN

Gate NARS approach:
  - Gate topology → per-centroid agreement score with active neighbors
  - Agreement → NARS freq↑ conf↑ (trusted path, strengthen)
  - Disagreement → conf↓ (uncertain, explore)
  - NARS expectation modulates attention output
  - NARS confidence gates FFN magnitude
  - Confidence decays 1%/layer (prevents premature crystallization)

Results on "The meaning of life is":
  No Gate:     entropy=4.733, top=[85,5,53] (diverse, unfocused)
  Gate Filter: entropy=4.534, top=[97,121,81] (focused, code-heavy)
  Gate NARS:   entropy=4.639, top=[0,4,5] (intermediate, common tokens)

All three discriminate between prompts. Gate NARS produces unique routing.

https://claude.ai/code/session_01ChLvBfpJS8dQhHxRD4pYNp
The system DRIVES ITSELF:
  16M compositions → interference pattern → peaks emerge
  peaks = TENSION (free energy = |hidden - ghost_prediction|²)
  high tension → must keep composing → next cycle
  ghost learns (EMA) → tension drops → thought resolves → stop

Autoregressive: each collapsed token re-enters context for next step.
Recency-weighted: recent tokens contribute more to hidden state.
Free energy decay: 1.0 → 0.12 across 8-12 steps = thought completes.

Results (nonsense but mechanically correct):
  "meaning of life" → generates 9 tokens, stops on natural break
  "AI will" → 12 tokens, stops when tension resolves (fe=0.15)
  "once upon a time" → 8 tokens, tension resolves quickly (fe=0.12)

Output quality blocked by:
  - 256 routing centroids too coarse (need 4096 in loop)
  - First-token-in-cluster selection (need frequency-weighted)
  - No temperature sampling (argmax locks onto attractors)

The metabolic loop IS the driving force. Resolution IS the answer.

https://claude.ai/code/session_01ChLvBfpJS8dQhHxRD4pYNp
4096 experts → router selects top-128 → each runs 16 internal layers
→ 4 hierarchical groups where expert outputs compose → collapse → token

Architecture maps Qwopus's actual MoE structure:
  Router: 4096×4096 input distance table (which experts respond?)
  Expert internals: 256×256 per-layer tables (how does each expert think?)
  Sparse activation: only top-128 of 4096 fire (97% sparsity)
  Hierarchical meeting: experts compose at 4 intermediate points

Performance: 42s for 3 prompts in release (128×16 MatVec per token)
  → production needs SIMD batching or GPU

Output quality still blocked by attractor collapse ("!" dominates).
Needs: thinking style temperature + persona routing + ONNX gate correction.

https://claude.ai/code/session_01ChLvBfpJS8dQhHxRD4pYNp
@AdaWorldAPI AdaWorldAPI merged commit 3d4cfc9 into main Apr 5, 2026
AdaWorldAPI pushed a commit that referenced this pull request Apr 19, 2026
Append FINDING to EPIPHANIES.md. PR #213 + ndarray PR #110 demonstrated
the dumb-bookkeeper pattern: ~90 seconds, Haiku, enumerate+match+append.
Result is a grep-addressable index of every shipped artifact keyed by
the prompt-file brief that birthed it.

For every future "what did we ship about X" query the ledger replaces
a full-codebase grep with a single line — ~25 tokens vs ~25M tokens.
Seven orders of magnitude cheaper.

https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
AdaWorldAPI pushed a commit that referenced this pull request Apr 19, 2026
Concrete three-pass recipe added to the cca2a skill:

  Pass 1 — Haiku bookkeeper (~90 s, mechanical): enumerate prompt files,
           match against git log, append one ledger line per pair.
  Pass 2 — Opus meta-synthesizer (read-only inputs): annotate the "none"
           rows from Pass 1 with superseded/open/stale classification.
  Pass 3 — Main thread consumer (sub-second per query): grep the ledger
           for every "what's open / shipped / about X" question.

Closes three token-waste channels simultaneously:
  - cold-start (20-30 turns → 3-5 turns)
  - find-code (~25M tokens → ~25 tokens, 10⁷×)
  - ambient arc knowledge (30-50% → 0%)

First deployments:
  - PR #213 (lance-graph, 41 prompts mapped, 90 s)
  - PR #110 (ndarray, 25 prompts mapped, 90 s)

Linked from SKILL.md under "What to read when".

https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants