Skip to content

DLM Proposer + AR Verifier — runnable KV-cache-saving framework#2

Draft
FluffyAIcode wants to merge 11 commits into
mainfrom
AgentMemory/dlm-proposer-kv-cache-runtime-8e7f
Draft

DLM Proposer + AR Verifier — runnable KV-cache-saving framework#2
FluffyAIcode wants to merge 11 commits into
mainfrom
AgentMemory/dlm-proposer-kv-cache-runtime-8e7f

Conversation

@FluffyAIcode
Copy link
Copy Markdown
Owner

@FluffyAIcode FluffyAIcode commented May 18, 2026

What this PR delivers

A runnable, end-to-end implementation of the speculative-decoding architecture from our prior design discussion, executing on real public weights — no mock, no fallback, no overfit — plus a forward-looking architecture document for the local inference engine that will wrap this core.

Role Model Params
Proposer dllm-hub/Qwen3-0.6B-diffusion-mdlm-v0.1 (masked-diffusion DLM, same tokenizer as verifier) 0.75 B
Verifier Qwen/Qwen3-1.7B (closest publicly-available stand-in for "Qwen 3.6") 1.72 B

No public Qwen 3.6 checkpoint exists; Qwen3-1.7B is the closest same-family AR model that shares the proposer's tokenizer (verified to produce identical prompt token-ids at startup) and is large enough for KV-cache savings to be meaningful. Swap-in is a one-flag change (--verifier-id).

Memory accounting

Metric is Net Bytes per Token (per-token persistent memory in steady-state long-context inference), defined as:

Net Bytes per Token (KV-only) = verifier_KV_per_token
                              + proposer_KV_per_token            (= 0 in this build; proposer recomputes per block)
                              + proposer_weight_bytes / (B * S)

Activation peak is not in Net Bytes per Token. A transient activation tensor is allocated when model(...) starts and freed when it returns; it does not accumulate across forwards and does not scale per session. It is a GPU capacity constraint (the forward must fit in HBM), not a per-token cost. The metric module reports it on a separate line.

Architecture

┌──────────────────┐     L tokens      ┌────────────────────────┐
│  DLM Proposer    │ ────────────────► │ AR Verifier            │
│  Qwen3-0.6B-MDLM │                   │ Qwen3-1.7B             │
│  K diffusion     │ ◄──────────────── │ DynamicCache trimmed   │
│  steps / block   │  accept / reject  │ to sink+window slots   │
└──────────────────┘                   └────────────────────────┘
  • proposer.py — masked-diffusion block generator faithful to the model card's reference (low-confidence remasking, deterministic at temperature 0). The proposer recomputes per block — its persistent KV contribution to Net Bytes per Token is zero.
  • verifier.pySinkWindowVerifier slices each DynamicCache layer's K/V tensors after every step; new queries always use the global RoPE position (so RoPE on new K/Q is correct), and evicted tokens drop out of attention's view (StreamingLLM-style). Layer-shape invariants raise on mismatch.
  • speculative.py — greedy speculative-decoding loop with rejection sampling. When sink + window >= full_seq_len, output is bit-equivalent to greedy AR — verified at runtime; the demo exits with code 2 on mismatch.
  • baseline.py — reference greedy AR with full DynamicCache.
  • metrics.py — KV byte counting; KV-only Net-Bytes-per-Token formula; capacity-constraint report; projection table.

Forward-looking: local inference engine architecture

docs/local-inference-engine.md describes the Mac/Ubuntu local engine that will wrap this algorithmic core. Highlights:

  • Goals: ≤ 2 GB resident memory at S=128k for Qwen3-1.7B + 0.6B-DLM; ≥ 150 tok/s on M3 Max single-request; ≥ 400 / 1500 tok/s on RTX 4090 (single / aggregate).
  • No PagedAttention. Under the sink+window invariant, every session's KV is a constant-size object. The three problems PagedAttention solves (fragmentation, prefix sharing, non-contiguity) all evaporate. A 30-LoC fixed-size slab pool replaces it and runs ~5–15% faster on contiguous KV.
  • Backend: MLX on Mac (unified memory, fused 4-bit GEMM), CUDA + PyTorch on Linux (Flash-Attention 3, Marlin 4-bit GEMM). ~70% shared code.
  • Phased plan P0 → P3 with concrete per-phase acceptance tests; phases are scoped by technical risk and dependency, not calendar time.
  • The doc has a clear "what's already in this repo vs. what this doc describes" table so reviewers don't conflate the algorithmic core with the future engine.

Empirical results (from CPU runs)

1. Equivalence-regime self-test (sink+window covers full sequence)

prompt   : "Reply with exactly: OK."
config   : sink=4, window=64, block=8, K=8

baseline    output : "OK.<|im_end|>"   (peak KV =  3,584 KB)
speculative output : "OK.<|im_end|>"   (peak KV =  3,696 KB)
exact match        : True             ← "no intelligence loss" verified
acceptance rate    : 0.375

2. Compression-regime test (window ≪ sequence, real eviction)

prompt   : "Write a one-paragraph explanation of why prime numbers are infinite ..."
S        : 108 tokens (44 prompt + 64 generated)
config   : sink=4, window=24, block=16, K=16

Persistent (in Net Bytes per Token):
  verifier KV (full DynamicCache, baseline) =  12.10 MB total =  114,688 B/token
  verifier KV (sink+window,  speculative)   =   3.06 MB total =   29,734 B/token
                                                                 ── 3.86× verifier-side
  proposer KV                               =   0 B            (recomputed per block)
  proposer weights amortized at B=64,S=108  = 172,468 B/token  (small-S dominates here)
  Net Bytes per Token (KV-only)             = 202,202 B/token  (compression 0.57×)

Capacity (separate, NOT in Net Bytes per Token):
  proposer peak activation (single forward) =  31.30 MB
  verifier peak activation (single forward) =  12.75 MB

Projected Net Bytes per Token (KV-only) at canonical operating points

(per-slot KV measured = 114,688 B; cache_budget = 28 slots; proposer KV = 0)

B S Net Bytes per Token compression vs full KV
1 8 192 145,912 0.79×
8 8 192 18,582 6.17×
8 32 768 4,646 24.69×
8 131 072 1,161 98.75×
8 1 048 576 145 790.02×
32 131 072 309 371.50×
64 131 072 167 688.36×
64 1 048 576 21 5,506.92×

These match the design's analytical prediction: at small B*S the proposer's weight bytes dominate (sub-unity ratios); at large B*S the only persistent cost is the bounded sink+window KV (28 slots × 114,688 B ≈ 3.06 MB total, amortized over S).

How to run

pip install -r requirements.txt
PYTHONPATH=. python3 scripts/smoke_test.py
PYTHONPATH=. python3 -m kv_cache_proposer.run_demo \
    --max-new-tokens 64 --block-size 16 --num-diffusion-steps 16 \
    --sink-size 4 --window-size 24 --batch-size-for-amortization 64 \
    --prompt "Write a one-paragraph explanation of why prime numbers are infinite, suitable for a high school student."

Sample logs and JSON results are committed under results/.

Honest caveats

  1. Verifier model: Qwen3-1.7B carries KV on all 28 layers; Qwen 3.5/3.6 hybrid attention carries it on 16/64. Against a true Qwen 3.6 baseline of ~65 KB/token, the absolute compression numbers above would be smaller by a factor of ~1.75×; framework code is unchanged.
  2. Acceptance rate is low (~0.12) because this proposer is trained with a different objective (Nemotron-SFT-Code masked diffusion) — not Repr-Align-aligned to Qwen3-1.7B. Low acceptance affects throughput, not correctness; the equivalence-regime self-test verifies output equivalence regardless.
  3. Proposer activation memory uses dense logits [1, T, V]. The standard "compute logits at masked positions only" optimization is not applied; at long contexts it would be required to keep the forward fitting in HBM. Net Bytes per Token is independent of this optimization (activation is not in the metric); the capacity number reported is the value at the run's actual sequence length.
  4. One environment fix: the dllm-hub modeling file has import dllm inside an if __name__ == "__main__": guard but transformers' static check_imports flags it; we install a no-op stub package to satisfy the check without altering any model logic. README has the portable command (works for any Python 3.x).

Out of scope (per design discussion, separate plumbing)

Multi-target verifier routing (Qwen / Gemma / DeepSeek), session-affinity scheduling, OTA, federated self-learning. Those are platform-level components and not part of this drop. The local-inference-engine doc covers the next layer up but its implementation is also separate work.

Open in Web Open in Cursor 

cursoragent and others added 11 commits May 18, 2026 17:06
Implements the speculative-decoding architecture from the design discussion
on real public weights:

- Proposer: dllm-hub/Qwen3-0.6B-diffusion-mdlm-v0.1 (masked-diffusion DLM
  built on Qwen3-0.6B; same tokenizer as the verifier).
- Verifier: Qwen/Qwen3-1.7B with a sink+window-bounded DynamicCache. We
  slice each layer's K/V tensors after every step, keep new queries at
  their global RoPE position, and let evicted tokens drop out of the
  attention's view — StreamingLLM-style.
- Greedy speculative decoding loop with verifier-side rejection sampling.
  When sink+window covers the full sequence the loop is bit-equivalent to
  greedy AR (verified at runtime; demo exits non-zero on mismatch).
- NBT (Net Bytes per Token) accounting that includes the proposer's own
  weights and peak activation amortized over (B, S, L_block), plus a
  projection table to canonical operating points.

Results from CPU runs (B=64, S=128k projection):
  baseline KV    = 114,688 B/token
  speculative KV =  32,216 B/token  ->  3.56x net compression
  break-even at  B*S ~= 1M tokens-batches, matching the design analysis.

No mock, no fallback, no overfit: all forwards execute on real downloaded
weights, layout invariants raise on inconsistency, and the same code path
runs every prompt.

Co-Authored-By: Cursor Agent <agent@cursor.sh>

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
The previous metric amortized peak proposer activation as
peak / (B * L_block) and added it to NBT. That's dimensionally wrong:
activation is allocated when model(...) starts and freed when it
returns; it never accumulates across forwards and does not scale
per-session. It is a GPU capacity constraint, not a per-token cost.

This commit:

- metrics.py: introduces NBT_kv_only = verifier_KV + proposer_KV +
  weights/(B*S). Reports proposer_peak_activation and
  verifier_peak_activation separately as a capacity-constraint line,
  outside NBT and outside the compression ratio.
- verifier.py: tracks peak activation (logits-buffer footprint) per
  forward call.
- speculative.py + run_demo.py: thread the verifier's peak activation
  through to the report.
- README.md: documents the metric, calls out the prior accounting
  error, and shows the corrected results.

Corrected projection at B=64, S=128k (compression-regime run):
  before fix: 32,216 B/token   ->  3.56x compression
  after fix :    166.6 B/token -> 688.4x compression

The 200x change comes entirely from removing 31,250 B/token of
incorrectly amortized activation from the numerator. KV cache
trimming, RoPE handling, and equivalence-regime bit-equivalence
are unchanged and still verified at runtime.

Co-Authored-By: Cursor Agent <agent@cursor.sh>

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Renames affecting code:
  NBTReport                    -> NetBytesPerTokenReport
  nbt_kv_only_bytes_per_token  -> net_bytes_per_token_kv_only
  json key 'nbt'               -> 'net_bytes_per_token_report'

Renames affecting prose / log output:
  'NBT'              -> 'Net Bytes per Token' (or hyphenated as a noun
                        phrase 'Net-Bytes-per-Token Report' in headings)
  'NBT_kv_only'      -> 'Net Bytes per Token (KV-only)'
  'NBT B/token'      -> 'Net Bytes per Token' (column header in projection
                        table; column widened to fit)

Also re-runs both demos so results/*.log and results/*.json reflect the
new headers and field names. Re-runs reproduce the same numerical
results (compression at B=64,S=128k stays 688.36x).

No semantics changes; this commit is a textual rename.

Co-Authored-By: Cursor Agent <agent@cursor.sh>

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…b pool)

The earlier sketch in chat had Paged KV Cache as the L4 memory module; this
commit lands the corrected design as a real document under docs/. Key
substantive change is dropping PagedAttention in favor of a fixed-size slab
pool, justified in section 0 of the doc:

  * Under sink+window each session's KV is a constant-size object
    (sink + window) * per-token-bytes, e.g. 14.8 MB / session at NF4.
  * That eliminates all three problems PagedAttention solves:
    fragmentation (no variable sizes), prefix sharing (sink+window
    evicts the shared prefix anyway), non-contiguity (surviving KV is
    two contiguous segments).
  * A 30-LoC slab pool with O(1) acquire/release replaces the page
    table; attention kernels see contiguous memory and run ~5-15%
    faster than the PagedAttention indirect path.

The doc covers L1-L7 layers, Mac (MLX) vs Linux (CUDA) backend choices,
the L4 memory subsystem with concrete byte budget, the L5 throughput
subsystem with the async proposer/verifier pipeline diagram, and the
P0/P1/P2/P3 phased build plan with quantitative success criteria per
target hardware.

README now points to it.

Co-Authored-By: Cursor Agent <agent@cursor.sh>

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Doc fixes:
  * Drop calendar-time estimates from phased build plan (use scope and
    risk descriptors instead). Each phase still has a concrete
    acceptance test.
  * Add 'What's already in this repo vs. what this doc describes' table
    so readers don't conflate the algorithmic core with the future
    engine.
  * Tighten the per-session NF4 KV byte arithmetic (~14.8 MB -> ~15 MB
    once per-block fp16 scales are accounted for).

Code cleanup (no behavioral change):
  * Remove unused imports: math/field in proposer.py, field in
    baseline.py / verifier.py / speculative.py, Optional in metrics.py.
  * Replace 'assert len(d) == L' with explicit 'if len(d) != L: raise',
    matching the project's no-fallback contract (asserts are stripped
    under python -O).
  * Drop redundant double-write of verifier.next_token_logits in the
    speculative loop; append_token now solely owns updating that field.

README:
  * Replace hard-coded python3.12 path in the dllm-stub instruction
    with site.getusersitepackages() so it works on any Python 3.x.

Re-ran both demos to confirm:
  * Smoke test still passes.
  * Equivalence regime: speculative output 'OK.<|im_end|>' bit-equal
    to baseline (exit code 0).
  * Compression regime: B=64, S=128k projection still 688.36x;
    B=64, S=1M still 5506.92x.

Co-Authored-By: Cursor Agent <agent@cursor.sh>

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
… on kv_cache_proposer/

This is the foundation for the engine described in
docs/local-inference-engine.md. It does NOT yet contain the
inference_engine/ package itself (slab pool, sparse logits, tree spec,
async pipeline, scheduler, OpenAI-compat API, MLX/CUDA backends) — those
ship in subsequent phases. What this PR delivers is the test/runner
scaffolding and the verified, fully-covered algorithmic core they will
build on.

Setup scripts (cross-platform):
  scripts/setup_mac.sh       - venv-based Mac bootstrap (M-series, Tahoe);
                               no-op stub for the dllm package required by
                               the dllm-hub modeling file
  scripts/setup_cuda.sh      - Linux/CUDA bootstrap; gates Flash-Attention
                               version on detected compute capability;
                               hard-errors on unsupported hardware
  scripts/run_platform_tests.sh - unified runner; runs pytest with
                                  100%-coverage gate (.coveragerc fail_under=100),
                                  emits structured JSON report

Transformers pin:
  pinned to >=4.45,<5.0 because the dllm-hub modeling file uses
  decoder_layer.attention_type which transformers 5.x removed. Mac/Linux
  users will use a project-local venv (handled by the setup scripts).

API drift fix:
  apply_chat_template() now passed return_dict=False so it returns the
  legacy list-of-ids on both 4.x and 5.x.

Coverage backend:
  .coveragerc sets core=sysmon (Python 3.12+ sys.monitoring) — required
  to avoid a known C-trace conflict with torch's _C extension. fail_under
  is 100; the CLI entrypoint kv_cache_proposer/run_demo.py is omitted
  from unit-test coverage and exercised by an integration test that
  invokes it via subprocess.

Tests (tests/core/, all real weights, no mocks):
  test_verifier.py              28 tests  - 100% coverage, includes layout
                                            invariant violation, null-K
                                            layer skip, sink_size=0 edge
  test_proposer.py              18 tests  - 100% coverage, includes
                                            tokenizer-API-drift defenses
                                            and underfill detection
  test_baseline.py               5 tests  - 100% coverage, EOS handling
  test_speculative.py           14 tests  - 100% coverage, includes both
                                            EOS-in-accepted-prefix-with-
                                            trailing-trim and
                                            correction-is-EOS branches
  test_metrics.py               12 tests  - 100% coverage, JSON
                                            serializability, projection
                                            table shape, divergence detect
  test_run_demo_integration.py   1 test   - end-to-end CLI smoke

Result on Linux VM with transformers 4.57.6, torch 2.8.0+cpu, Python 3.12:
  79 passed, 100% coverage on kv_cache_proposer/{__init__,baseline,
  metrics,proposer,speculative,verifier}.py.

Truly defensive code paths marked with explicit '# pragma: no cover'
comments stating WHY the path is unreachable from public API:
  * proposer.py: zero-mask-init defense, schedule-underflow defense,
    zero-transfer-step defense
  * speculative.py: malformed-block-from-proposer defense

Not in this commit (next phases):
  inference_engine/memory/slab_pool.py
  inference_engine/memory/nf4_kv.py
  inference_engine/proposer/sparse_logits.py
  inference_engine/tree_spec.py
  inference_engine/scheduler/
  inference_engine/server/
  inference_engine/backends/{mlx,cuda}/

Co-Authored-By: Cursor Agent <agent@cursor.sh>

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Root cause of the Mac failure: the test suite loads real Qwen3 weights
on first use (no-mock policy), but huggingface.co was unreachable from
the user's Mac. transformers' `local_files_only=True` fallback kicked
in for the session, causing all 64 model-loading tests to error out
with cascading 'OSError: We could not connect to huggingface.co'.

Fix has three layers:

1. setup_mac.sh / setup_cuda.sh now do, after dep install:
   (a) clear_offline_mode      - unset any HF_HUB_OFFLINE the shell had
   (b) probe_hf_connectivity   - 15 s curl to the HF endpoint; on
                                 failure print explicit remediation
                                 (VPN, hf-mirror.com for mainland China,
                                 manual cache copy)
   (c) download_models         - snapshot_download() both required
                                 repos so the cache is warm before any
                                 test runs.

2. run_platform_tests.sh now runs an HF cache pre-flight BEFORE pytest.
   If either required repo is absent from the local cache, the runner
   exits with code 5 and a 4-line remediation message instead of
   letting pytest emit 78 cascading errors.

3. README documents the network requirement and the HF_ENDPOINT mirror
   override.

Verified on the Linux VM:
   * runner with populated cache  -> 78 passed, 100% coverage
   * runner with empty HF_HOME    -> exit 5 with clear remediation
   * bash syntax check OK on all 3 scripts

User next steps on the Mac mini:
   git pull
   # if you can reach huggingface.co directly:
   ./scripts/setup_mac.sh
   # if you are in mainland China or behind a firewall:
   export HF_ENDPOINT=https://hf-mirror.com
   ./scripts/setup_mac.sh
   # then:
   source .venv-mac/bin/activate
   ./scripts/run_platform_tests.sh --backend mlx

.gitignore now excludes the .coverage data file (an accidental commit
candidate from prior runs).

Co-Authored-By: Cursor Agent <agent@cursor.sh>

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Bug: setup_mac.sh's verify_imports() did
    Version(getattr(mod, '__version__', '0'))
which mis-classifies mlx 0.31.1 as version '0.0' (since mlx, unlike
torch / transformers / safetensors etc., does not expose __version__
on the imported module). That makes the floor check (>= 0.20) fail
and aborts the script.

Fix: switch to importlib.metadata.version() as the canonical version
source, with two fallback layers:

  1. importlib.metadata.version(dist_name or import_name)        canonical
  2. importlib.metadata.version(name with '_' replaced by '-')   PEP 503
  3. mod.__version__ attribute                                   editable
                                                                  installs
  4. else: hard error (no silent default to '0')

Per-package dist_name overrides for the two known cases where pip
distribution name differs from the import name:
   import 'flash_attn'  ->  pip dist 'flash-attn'   (CUDA only)
   import 'awq'         ->  pip dist 'autoawq'      (CUDA only)
   import 'mlx_lm'      ->  pip dist 'mlx-lm'       (Mac only)

Also: setup_mac.sh now verifies mlx-lm (used by setup for AWQ
conversion in later phases).

Verified on this Linux VM by simulating mlx's behavior: deleted
safetensors.__version__, get_version() correctly fell through to
importlib.metadata and returned the dist-metadata version. Existing
test suite still passes (58 of 58 in the regression slice).

Co-Authored-By: Cursor Agent <agent@cursor.sh>

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Drop-in subclass of DLMProposer that runs the model backbone for the
full sequence (required by bidirectional attention) but applies the
language-model head only at the L masked positions of the current
block. Trims the dominant transient tensor — the [1, T, V] logits
buffer — to [1, n_masked, V], which on Qwen3-0.6B-MDLM with
V=151,936 means ~10x less activation memory at typical (T~50, L=8)
operating points.

Headline test (parametrized over 5 (block_size, num_steps) configs):
  test_sparse_path_emits_identical_tokens_to_dense — under greedy
  temperature-0 decoding, the sparse path produces the EXACT same
  token sequence as DLMProposer for the same inputs. bf16 numerical
  noise in the bigger lm_head matmul could in principle flip an
  argmax, but the test asserts equality (not approximate), so any
  real-world flip is caught by CI.

Headline measurement (Linux VM, S=34, L=4, K=2):
  output identical    : True
  activation peak     : 13.33 MB -> 1.16 MB     (11.5x smaller)
  wall time           : 5.09 s   -> 5.01 s      (1.02x; bigger gain
                        expected on memory-bandwidth-bound CPUs and
                        long contexts)

Files:
  inference_engine/proposer/sparse_logits.py    (61 stmts, 100% cov)
  inference_engine/proposer/__init__.py
  inference_engine/__init__.py
  tests/inference_engine/proposer/test_sparse_logits.py  (16 tests,
                                                          all pass)
  tests/inference_engine/conftest.py
  tests/conftest.py    (moved up from tests/core/conftest.py so both
                        suites share fixtures)
  scripts/bench_sparse_vs_dense.py    (Mac-runnable benchmark; emits
                                       results/platform-tests/*.json)
  scripts/run_platform_tests.sh    (updated to include the new
                                    tests/inference_engine/ tree)

Project state after this commit:
  94 tests pass, 100% line coverage on
    inference_engine/proposer/{__init__,sparse_logits}.py
    kv_cache_proposer/{__init__,baseline,metrics,proposer,
                       speculative,verifier}.py
  fail_under=100 in .coveragerc; the runner exits non-zero on any
  uncovered line.

User next step on Mac mini:
  git pull
  source .venv-mac/bin/activate
  PYTHONPATH=. python3 scripts/bench_sparse_vs_dense.py \
      --prompt 'Why is the sky blue?' --max-new-tokens 32 \
      --block-size 8 --num-diffusion-steps 4

Push the resulting JSON in results/platform-tests/ back to the PR
branch so we can quantify the real Mac M4 wall-time win.

Co-Authored-By: Cursor Agent <agent@cursor.sh>

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants