DLM Proposer + AR Verifier — runnable KV-cache-saving framework by FluffyAIcode · Pull Request #2 · FluffyAIcode/Kakeya-LLM-Inference-engine

FluffyAIcode · 2026-05-18T17:07:22Z

What this PR delivers

A runnable, end-to-end implementation of the speculative-decoding architecture from our prior design discussion, executing on real public weights — no mock, no fallback, no overfit — plus a forward-looking architecture document for the local inference engine that will wrap this core.

Role	Model	Params
Proposer	`dllm-hub/Qwen3-0.6B-diffusion-mdlm-v0.1` (masked-diffusion DLM, same tokenizer as verifier)	0.75 B
Verifier	`Qwen/Qwen3-1.7B` (closest publicly-available stand-in for "Qwen 3.6")	1.72 B

No public Qwen 3.6 checkpoint exists; Qwen3-1.7B is the closest same-family AR model that shares the proposer's tokenizer (verified to produce identical prompt token-ids at startup) and is large enough for KV-cache savings to be meaningful. Swap-in is a one-flag change (--verifier-id).

Memory accounting

Metric is Net Bytes per Token (per-token persistent memory in steady-state long-context inference), defined as:

Net Bytes per Token (KV-only) = verifier_KV_per_token
                              + proposer_KV_per_token            (= 0 in this build; proposer recomputes per block)
                              + proposer_weight_bytes / (B * S)

Activation peak is not in Net Bytes per Token. A transient activation tensor is allocated when model(...) starts and freed when it returns; it does not accumulate across forwards and does not scale per session. It is a GPU capacity constraint (the forward must fit in HBM), not a per-token cost. The metric module reports it on a separate line.

Architecture

┌──────────────────┐     L tokens      ┌────────────────────────┐
│  DLM Proposer    │ ────────────────► │ AR Verifier            │
│  Qwen3-0.6B-MDLM │                   │ Qwen3-1.7B             │
│  K diffusion     │ ◄──────────────── │ DynamicCache trimmed   │
│  steps / block   │  accept / reject  │ to sink+window slots   │
└──────────────────┘                   └────────────────────────┘

proposer.py — masked-diffusion block generator faithful to the model card's reference (low-confidence remasking, deterministic at temperature 0). The proposer recomputes per block — its persistent KV contribution to Net Bytes per Token is zero.
verifier.py — SinkWindowVerifier slices each DynamicCache layer's K/V tensors after every step; new queries always use the global RoPE position (so RoPE on new K/Q is correct), and evicted tokens drop out of attention's view (StreamingLLM-style). Layer-shape invariants raise on mismatch.
speculative.py — greedy speculative-decoding loop with rejection sampling. When sink + window >= full_seq_len, output is bit-equivalent to greedy AR — verified at runtime; the demo exits with code 2 on mismatch.
baseline.py — reference greedy AR with full DynamicCache.
metrics.py — KV byte counting; KV-only Net-Bytes-per-Token formula; capacity-constraint report; projection table.

Forward-looking: local inference engine architecture

docs/local-inference-engine.md describes the Mac/Ubuntu local engine that will wrap this algorithmic core. Highlights:

Goals: ≤ 2 GB resident memory at S=128k for Qwen3-1.7B + 0.6B-DLM; ≥ 150 tok/s on M3 Max single-request; ≥ 400 / 1500 tok/s on RTX 4090 (single / aggregate).
No PagedAttention. Under the sink+window invariant, every session's KV is a constant-size object. The three problems PagedAttention solves (fragmentation, prefix sharing, non-contiguity) all evaporate. A 30-LoC fixed-size slab pool replaces it and runs ~5–15% faster on contiguous KV.
Backend: MLX on Mac (unified memory, fused 4-bit GEMM), CUDA + PyTorch on Linux (Flash-Attention 3, Marlin 4-bit GEMM). ~70% shared code.
Phased plan P0 → P3 with concrete per-phase acceptance tests; phases are scoped by technical risk and dependency, not calendar time.
The doc has a clear "what's already in this repo vs. what this doc describes" table so reviewers don't conflate the algorithmic core with the future engine.

Empirical results (from CPU runs)

1. Equivalence-regime self-test (sink+window covers full sequence)

prompt   : "Reply with exactly: OK."
config   : sink=4, window=64, block=8, K=8

baseline    output : "OK.<|im_end|>"   (peak KV =  3,584 KB)
speculative output : "OK.<|im_end|>"   (peak KV =  3,696 KB)
exact match        : True             ← "no intelligence loss" verified
acceptance rate    : 0.375

2. Compression-regime test (window ≪ sequence, real eviction)

prompt   : "Write a one-paragraph explanation of why prime numbers are infinite ..."
S        : 108 tokens (44 prompt + 64 generated)
config   : sink=4, window=24, block=16, K=16

Persistent (in Net Bytes per Token):
  verifier KV (full DynamicCache, baseline) =  12.10 MB total =  114,688 B/token
  verifier KV (sink+window,  speculative)   =   3.06 MB total =   29,734 B/token
                                                                 ── 3.86× verifier-side
  proposer KV                               =   0 B            (recomputed per block)
  proposer weights amortized at B=64,S=108  = 172,468 B/token  (small-S dominates here)
  Net Bytes per Token (KV-only)             = 202,202 B/token  (compression 0.57×)

Capacity (separate, NOT in Net Bytes per Token):
  proposer peak activation (single forward) =  31.30 MB
  verifier peak activation (single forward) =  12.75 MB

Projected Net Bytes per Token (KV-only) at canonical operating points

(per-slot KV measured = 114,688 B; cache_budget = 28 slots; proposer KV = 0)

B	S	Net Bytes per Token	compression vs full KV
1	8 192	145,912	0.79×
8	8 192	18,582	6.17×
8	32 768	4,646	24.69×
8	131 072	1,161	98.75×
8	1 048 576	145	790.02×
32	131 072	309	371.50×
64	131 072	167	688.36×
64	1 048 576	21	5,506.92×

These match the design's analytical prediction: at small B*S the proposer's weight bytes dominate (sub-unity ratios); at large B*S the only persistent cost is the bounded sink+window KV (28 slots × 114,688 B ≈ 3.06 MB total, amortized over S).

How to run

pip install -r requirements.txt
PYTHONPATH=. python3 scripts/smoke_test.py
PYTHONPATH=. python3 -m kv_cache_proposer.run_demo \
    --max-new-tokens 64 --block-size 16 --num-diffusion-steps 16 \
    --sink-size 4 --window-size 24 --batch-size-for-amortization 64 \
    --prompt "Write a one-paragraph explanation of why prime numbers are infinite, suitable for a high school student."

Sample logs and JSON results are committed under results/.

Honest caveats

Verifier model: Qwen3-1.7B carries KV on all 28 layers; Qwen 3.5/3.6 hybrid attention carries it on 16/64. Against a true Qwen 3.6 baseline of ~65 KB/token, the absolute compression numbers above would be smaller by a factor of ~1.75×; framework code is unchanged.
Acceptance rate is low (~0.12) because this proposer is trained with a different objective (Nemotron-SFT-Code masked diffusion) — not Repr-Align-aligned to Qwen3-1.7B. Low acceptance affects throughput, not correctness; the equivalence-regime self-test verifies output equivalence regardless.
Proposer activation memory uses dense logits [1, T, V]. The standard "compute logits at masked positions only" optimization is not applied; at long contexts it would be required to keep the forward fitting in HBM. Net Bytes per Token is independent of this optimization (activation is not in the metric); the capacity number reported is the value at the run's actual sequence length.
One environment fix: the dllm-hub modeling file has import dllm inside an if __name__ == "__main__": guard but transformers' static check_imports flags it; we install a no-op stub package to satisfy the check without altering any model logic. README has the portable command (works for any Python 3.x).

Out of scope (per design discussion, separate plumbing)

Multi-target verifier routing (Qwen / Gemma / DeepSeek), session-affinity scheduling, OTA, federated self-learning. Those are platform-level components and not part of this drop. The local-inference-engine doc covers the next layer up but its implementation is also separate work.

Implements the speculative-decoding architecture from the design discussion on real public weights: - Proposer: dllm-hub/Qwen3-0.6B-diffusion-mdlm-v0.1 (masked-diffusion DLM built on Qwen3-0.6B; same tokenizer as the verifier). - Verifier: Qwen/Qwen3-1.7B with a sink+window-bounded DynamicCache. We slice each layer's K/V tensors after every step, keep new queries at their global RoPE position, and let evicted tokens drop out of the attention's view — StreamingLLM-style. - Greedy speculative decoding loop with verifier-side rejection sampling. When sink+window covers the full sequence the loop is bit-equivalent to greedy AR (verified at runtime; demo exits non-zero on mismatch). - NBT (Net Bytes per Token) accounting that includes the proposer's own weights and peak activation amortized over (B, S, L_block), plus a projection table to canonical operating points. Results from CPU runs (B=64, S=128k projection): baseline KV = 114,688 B/token speculative KV = 32,216 B/token -> 3.56x net compression break-even at B*S ~= 1M tokens-batches, matching the design analysis. No mock, no fallback, no overfit: all forwards execute on real downloaded weights, layout invariants raise on inconsistency, and the same code path runs every prompt. Co-Authored-By: Cursor Agent <agent@cursor.sh> Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

The previous metric amortized peak proposer activation as peak / (B * L_block) and added it to NBT. That's dimensionally wrong: activation is allocated when model(...) starts and freed when it returns; it never accumulates across forwards and does not scale per-session. It is a GPU capacity constraint, not a per-token cost. This commit: - metrics.py: introduces NBT_kv_only = verifier_KV + proposer_KV + weights/(B*S). Reports proposer_peak_activation and verifier_peak_activation separately as a capacity-constraint line, outside NBT and outside the compression ratio. - verifier.py: tracks peak activation (logits-buffer footprint) per forward call. - speculative.py + run_demo.py: thread the verifier's peak activation through to the report. - README.md: documents the metric, calls out the prior accounting error, and shows the corrected results. Corrected projection at B=64, S=128k (compression-regime run): before fix: 32,216 B/token -> 3.56x compression after fix : 166.6 B/token -> 688.4x compression The 200x change comes entirely from removing 31,250 B/token of incorrectly amortized activation from the numerator. KV cache trimming, RoPE handling, and equivalence-regime bit-equivalence are unchanged and still verified at runtime. Co-Authored-By: Cursor Agent <agent@cursor.sh> Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Renames affecting code: NBTReport -> NetBytesPerTokenReport nbt_kv_only_bytes_per_token -> net_bytes_per_token_kv_only json key 'nbt' -> 'net_bytes_per_token_report' Renames affecting prose / log output: 'NBT' -> 'Net Bytes per Token' (or hyphenated as a noun phrase 'Net-Bytes-per-Token Report' in headings) 'NBT_kv_only' -> 'Net Bytes per Token (KV-only)' 'NBT B/token' -> 'Net Bytes per Token' (column header in projection table; column widened to fit) Also re-runs both demos so results/*.log and results/*.json reflect the new headers and field names. Re-runs reproduce the same numerical results (compression at B=64,S=128k stays 688.36x). No semantics changes; this commit is a textual rename. Co-Authored-By: Cursor Agent <agent@cursor.sh> Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…b pool) The earlier sketch in chat had Paged KV Cache as the L4 memory module; this commit lands the corrected design as a real document under docs/. Key substantive change is dropping PagedAttention in favor of a fixed-size slab pool, justified in section 0 of the doc: * Under sink+window each session's KV is a constant-size object (sink + window) * per-token-bytes, e.g. 14.8 MB / session at NF4. * That eliminates all three problems PagedAttention solves: fragmentation (no variable sizes), prefix sharing (sink+window evicts the shared prefix anyway), non-contiguity (surviving KV is two contiguous segments). * A 30-LoC slab pool with O(1) acquire/release replaces the page table; attention kernels see contiguous memory and run ~5-15% faster than the PagedAttention indirect path. The doc covers L1-L7 layers, Mac (MLX) vs Linux (CUDA) backend choices, the L4 memory subsystem with concrete byte budget, the L5 throughput subsystem with the async proposer/verifier pipeline diagram, and the P0/P1/P2/P3 phased build plan with quantitative success criteria per target hardware. README now points to it. Co-Authored-By: Cursor Agent <agent@cursor.sh> Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Doc fixes: * Drop calendar-time estimates from phased build plan (use scope and risk descriptors instead). Each phase still has a concrete acceptance test. * Add 'What's already in this repo vs. what this doc describes' table so readers don't conflate the algorithmic core with the future engine. * Tighten the per-session NF4 KV byte arithmetic (~14.8 MB -> ~15 MB once per-block fp16 scales are accounted for). Code cleanup (no behavioral change): * Remove unused imports: math/field in proposer.py, field in baseline.py / verifier.py / speculative.py, Optional in metrics.py. * Replace 'assert len(d) == L' with explicit 'if len(d) != L: raise', matching the project's no-fallback contract (asserts are stripped under python -O). * Drop redundant double-write of verifier.next_token_logits in the speculative loop; append_token now solely owns updating that field. README: * Replace hard-coded python3.12 path in the dllm-stub instruction with site.getusersitepackages() so it works on any Python 3.x. Re-ran both demos to confirm: * Smoke test still passes. * Equivalence regime: speculative output 'OK.<|im_end|>' bit-equal to baseline (exit code 0). * Compression regime: B=64, S=128k projection still 688.36x; B=64, S=1M still 5506.92x. Co-Authored-By: Cursor Agent <agent@cursor.sh> Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

… on kv_cache_proposer/ This is the foundation for the engine described in docs/local-inference-engine.md. It does NOT yet contain the inference_engine/ package itself (slab pool, sparse logits, tree spec, async pipeline, scheduler, OpenAI-compat API, MLX/CUDA backends) — those ship in subsequent phases. What this PR delivers is the test/runner scaffolding and the verified, fully-covered algorithmic core they will build on. Setup scripts (cross-platform): scripts/setup_mac.sh - venv-based Mac bootstrap (M-series, Tahoe); no-op stub for the dllm package required by the dllm-hub modeling file scripts/setup_cuda.sh - Linux/CUDA bootstrap; gates Flash-Attention version on detected compute capability; hard-errors on unsupported hardware scripts/run_platform_tests.sh - unified runner; runs pytest with 100%-coverage gate (.coveragerc fail_under=100), emits structured JSON report Transformers pin: pinned to >=4.45,<5.0 because the dllm-hub modeling file uses decoder_layer.attention_type which transformers 5.x removed. Mac/Linux users will use a project-local venv (handled by the setup scripts). API drift fix: apply_chat_template() now passed return_dict=False so it returns the legacy list-of-ids on both 4.x and 5.x. Coverage backend: .coveragerc sets core=sysmon (Python 3.12+ sys.monitoring) — required to avoid a known C-trace conflict with torch's _C extension. fail_under is 100; the CLI entrypoint kv_cache_proposer/run_demo.py is omitted from unit-test coverage and exercised by an integration test that invokes it via subprocess. Tests (tests/core/, all real weights, no mocks): test_verifier.py 28 tests - 100% coverage, includes layout invariant violation, null-K layer skip, sink_size=0 edge test_proposer.py 18 tests - 100% coverage, includes tokenizer-API-drift defenses and underfill detection test_baseline.py 5 tests - 100% coverage, EOS handling test_speculative.py 14 tests - 100% coverage, includes both EOS-in-accepted-prefix-with- trailing-trim and correction-is-EOS branches test_metrics.py 12 tests - 100% coverage, JSON serializability, projection table shape, divergence detect test_run_demo_integration.py 1 test - end-to-end CLI smoke Result on Linux VM with transformers 4.57.6, torch 2.8.0+cpu, Python 3.12: 79 passed, 100% coverage on kv_cache_proposer/{__init__,baseline, metrics,proposer,speculative,verifier}.py. Truly defensive code paths marked with explicit '# pragma: no cover' comments stating WHY the path is unreachable from public API: * proposer.py: zero-mask-init defense, schedule-underflow defense, zero-transfer-step defense * speculative.py: malformed-block-from-proposer defense Not in this commit (next phases): inference_engine/memory/slab_pool.py inference_engine/memory/nf4_kv.py inference_engine/proposer/sparse_logits.py inference_engine/tree_spec.py inference_engine/scheduler/ inference_engine/server/ inference_engine/backends/{mlx,cuda}/ Co-Authored-By: Cursor Agent <agent@cursor.sh> Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

Root cause of the Mac failure: the test suite loads real Qwen3 weights on first use (no-mock policy), but huggingface.co was unreachable from the user's Mac. transformers' `local_files_only=True` fallback kicked in for the session, causing all 64 model-loading tests to error out with cascading 'OSError: We could not connect to huggingface.co'. Fix has three layers: 1. setup_mac.sh / setup_cuda.sh now do, after dep install: (a) clear_offline_mode - unset any HF_HUB_OFFLINE the shell had (b) probe_hf_connectivity - 15 s curl to the HF endpoint; on failure print explicit remediation (VPN, hf-mirror.com for mainland China, manual cache copy) (c) download_models - snapshot_download() both required repos so the cache is warm before any test runs. 2. run_platform_tests.sh now runs an HF cache pre-flight BEFORE pytest. If either required repo is absent from the local cache, the runner exits with code 5 and a 4-line remediation message instead of letting pytest emit 78 cascading errors. 3. README documents the network requirement and the HF_ENDPOINT mirror override. Verified on the Linux VM: * runner with populated cache -> 78 passed, 100% coverage * runner with empty HF_HOME -> exit 5 with clear remediation * bash syntax check OK on all 3 scripts User next steps on the Mac mini: git pull # if you can reach huggingface.co directly: ./scripts/setup_mac.sh # if you are in mainland China or behind a firewall: export HF_ENDPOINT=https://hf-mirror.com ./scripts/setup_mac.sh # then: source .venv-mac/bin/activate ./scripts/run_platform_tests.sh --backend mlx .gitignore now excludes the .coverage data file (an accidental commit candidate from prior runs). Co-Authored-By: Cursor Agent <agent@cursor.sh> Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Bug: setup_mac.sh's verify_imports() did Version(getattr(mod, '__version__', '0')) which mis-classifies mlx 0.31.1 as version '0.0' (since mlx, unlike torch / transformers / safetensors etc., does not expose __version__ on the imported module). That makes the floor check (>= 0.20) fail and aborts the script. Fix: switch to importlib.metadata.version() as the canonical version source, with two fallback layers: 1. importlib.metadata.version(dist_name or import_name) canonical 2. importlib.metadata.version(name with '_' replaced by '-') PEP 503 3. mod.__version__ attribute editable installs 4. else: hard error (no silent default to '0') Per-package dist_name overrides for the two known cases where pip distribution name differs from the import name: import 'flash_attn' -> pip dist 'flash-attn' (CUDA only) import 'awq' -> pip dist 'autoawq' (CUDA only) import 'mlx_lm' -> pip dist 'mlx-lm' (Mac only) Also: setup_mac.sh now verifies mlx-lm (used by setup for AWQ conversion in later phases). Verified on this Linux VM by simulating mlx's behavior: deleted safetensors.__version__, get_version() correctly fell through to importlib.metadata and returned the dist-metadata version. Existing test suite still passes (58 of 58 in the regression slice). Co-Authored-By: Cursor Agent <agent@cursor.sh> Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Drop-in subclass of DLMProposer that runs the model backbone for the full sequence (required by bidirectional attention) but applies the language-model head only at the L masked positions of the current block. Trims the dominant transient tensor — the [1, T, V] logits buffer — to [1, n_masked, V], which on Qwen3-0.6B-MDLM with V=151,936 means ~10x less activation memory at typical (T~50, L=8) operating points. Headline test (parametrized over 5 (block_size, num_steps) configs): test_sparse_path_emits_identical_tokens_to_dense — under greedy temperature-0 decoding, the sparse path produces the EXACT same token sequence as DLMProposer for the same inputs. bf16 numerical noise in the bigger lm_head matmul could in principle flip an argmax, but the test asserts equality (not approximate), so any real-world flip is caught by CI. Headline measurement (Linux VM, S=34, L=4, K=2): output identical : True activation peak : 13.33 MB -> 1.16 MB (11.5x smaller) wall time : 5.09 s -> 5.01 s (1.02x; bigger gain expected on memory-bandwidth-bound CPUs and long contexts) Files: inference_engine/proposer/sparse_logits.py (61 stmts, 100% cov) inference_engine/proposer/__init__.py inference_engine/__init__.py tests/inference_engine/proposer/test_sparse_logits.py (16 tests, all pass) tests/inference_engine/conftest.py tests/conftest.py (moved up from tests/core/conftest.py so both suites share fixtures) scripts/bench_sparse_vs_dense.py (Mac-runnable benchmark; emits results/platform-tests/*.json) scripts/run_platform_tests.sh (updated to include the new tests/inference_engine/ tree) Project state after this commit: 94 tests pass, 100% line coverage on inference_engine/proposer/{__init__,sparse_logits}.py kv_cache_proposer/{__init__,baseline,metrics,proposer, speculative,verifier}.py fail_under=100 in .coveragerc; the runner exits non-zero on any uncovered line. User next step on Mac mini: git pull source .venv-mac/bin/activate PYTHONPATH=. python3 scripts/bench_sparse_vs_dense.py \ --prompt 'Why is the sky blue?' --max-new-tokens 32 \ --block-size 8 --num-diffusion-steps 4 Push the resulting JSON in results/platform-tests/ back to the PR branch so we can quantify the real Mac M4 wall-time win. Co-Authored-By: Cursor Agent <agent@cursor.sh> Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

cursoragent and others added 11 commits May 18, 2026 17:06

Mac M4 24GB Phase 1 test results

7bbf53a

Co-authored-by: Cursor <cursoragent@cursor.com>

Mac M4 Phase B bench results

0fd748c

Co-authored-by: Cursor <cursoragent@cursor.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DLM Proposer + AR Verifier — runnable KV-cache-saving framework#2

DLM Proposer + AR Verifier — runnable KV-cache-saving framework#2
FluffyAIcode wants to merge 11 commits into
mainfrom
AgentMemory/dlm-proposer-kv-cache-runtime-8e7f

FluffyAIcode commented May 18, 2026 •

edited by cursor Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented May 18, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR delivers

Memory accounting

Architecture

Forward-looking: local inference engine architecture

Empirical results (from CPU runs)

1. Equivalence-regime self-test (sink+window covers full sequence)

2. Compression-regime test (window ≪ sequence, real eviction)

Projected Net Bytes per Token (KV-only) at canonical operating points

How to run

Honest caveats

Out of scope (per design discussion, separate plumbing)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FluffyAIcode commented May 18, 2026 •

edited by cursor Bot

Loading