All notable changes to nmoe are documented here.
- Speedrun leaderboard (
LEADERBOARD.json): Canonical benchmarks for super (256 experts) and ultra (4096 experts) configs on B200 and H100 - Muon optimizer (
nmoe/opt.py): Moonlight recipe (SGD-Nesterov + Newton-Schulz orthogonalization + consistent update RMS) - CLI (
ncommand):n speedrun,n train,n mon,n viz,n research - Bootstrap script (
scripts/bootstrap.sh): One-command setup for cloud GPU instances - CORE eval integration: 22-task eval suite runs automatically after speedrun
- nmon TUI (
tools/nmon): Terminal monitoring for training runs - nviz dashboard (
nviz/): Web-based metrics visualization
- Default attention: MLA (Multi-head Latent Attention) for all layers
- Tokenizer:
o200k_harmony(vocab_size=201088) - Expert optimizer: Muon for fp8/nvfp4, AdamW for bf16
git clone https://github.com/Noumena-Network/nmoe.git && cd nmoe
bash scripts/bootstrap.sh
n speedrun- Choice-based eval (
nmoe/eval/choices.py): MMLU, MMLU-pro, ARC, HellaSwag, BoolQ, OpenBookQA, WinoGrande, COPA - Generation-based eval (
nmoe/eval/run_gen.py): GSM8K, HumanEval, GPQA - Code execution sandbox (
nmoe/eval/execution.py) for HumanEval - HF dataset adapters (
nmoe/eval/adapters.py) for all choice tasks - Training integration via
eval_everyconfig innmoe/train.py
- Choice-based eval uses chunked logsumexp to avoid materializing
(seq, vocab)logits - Multi-GPU eval keeps ranks in lockstep by construction (no barriers in inner loops)
- Requires
datasetslibrary:uv pip install datasets
- Chunked reduce-scatter in ZeRO-2 (
nmoe/zero2.py): configurable viaNMOE_ZERO2_RS_CHUNK_MB(default 2GB) - Third-party import path setup in
nmoe/runtime.py
nmoe/checkpoint.py: eager CPU copy before background queue; clean thread shutdownnmoe/train.py: support--resume/--no-resumeboolean flagsnmoe/opt.py: return dense LR for logging (multi-group compatibility)nmoe/quant.py: direct csrc import to avoid circular dependencies
- Fused SwiGLU+quant epilogue in W13 GEMM via CuTeDSL/PTX (FP8/NVFP4)
- Capacity-bounded scratch allocation with fail-fast on undersized buffers
tests/test_blockscaled_expert_worst_case.py: worst-case routing test (all tokens → single expert)
nmoe/blockscaled/grouped.py: strict packed SFA validation[M_pad, sf_k_pad](E× memory reduction vs strided)nmoe/moe.py: forward uses fused gather+quant path; removed dead backward quantization code
- CUDA OOM under worst-case routing at
batch_size=256, seq_len=4096on 8×B200 - Eliminated large BF16 intermediate
[M_pad, 2*Dff]in expert forward path
- Global/local attention mixing via
attn_global_every,attn_local, andattn_local_window - MLA backward workspace cache to reduce allocator churn (
nmoe/attention/mla.py)
nmoe/model.py: per-layer attention selection (last layer always global)nmoe/attention/rope.py: register RoPEcos/sinas non-persistent buffers (pre-cast, device-safe)nmoe/attention/swa.py: stricter shape/rope-dim validation and avoids per-forwardstart_qallocation
- Flow profiles and mixture configurations (
configs/flow_profiles.toml,configs/tasks.toml) - TOML env var expansion with security guardrails (
nmoe/config.load_toml()) - Data prep (HF → shards) utilities and CLI (
nmoe/data/*)
- All config paths now use
${NMOE_DATA_PATH:-/data}pattern (env-driven) nmoe/data/__init__.py: lazy import refactor for dependency hygiene
- Env var expansion restricted to
NMOE_*prefix only - Fail-fast on unresolved
${...}placeholders - No internal paths or credentials in tracked configs
- SonicMoE-style RDEP kernel fusion for expert-parallel dispatch/return
- Blockscaled FP8/NVFP4 GEMM support for Blackwell (sm_100a)
- Single-node communication modes: single-GPU and IPC (intra-node NVLink)
- GPU-side atomics for sync (zero NCCL in hot path)
nmoe/csrc/swizzle.cuh: consolidated CUTLASS SF swizzle mathnmoe/bench_moe_e2e.py: end-to-end MoE throughput/latency benchmarktests/test_rdep_accuracy.py: forward/backward accuracy teststests/bench_rdep_kernels.py: kernel-level benchmarks
nmoe/moe.py: simplified gather/scatter path with*_return_scatter_from_pad_*nmoe/blockscaled/grouped.py: CuTeDSL 4.3.4 JIT compilation with proper arg typing- Backward pass optimization: dGate identity via new
gather_dy_nogate_*+dgate_from_adA_bf16
- CuTeDSL 4.3.4 JIT arg marshalling (wrap pointers/strides with
Int64/Int32)