Add AGENTS.md with cloud-specific development instructions#3
Draft
FluffyAIcode wants to merge 4 commits into
Draft
Add AGENTS.md with cloud-specific development instructions#3FluffyAIcode wants to merge 4 commits into
FluffyAIcode wants to merge 4 commits into
Conversation
Implements the speculative-decoding architecture from the design discussion on real public weights: - Proposer: dllm-hub/Qwen3-0.6B-diffusion-mdlm-v0.1 (masked-diffusion DLM built on Qwen3-0.6B; same tokenizer as the verifier). - Verifier: Qwen/Qwen3-1.7B with a sink+window-bounded DynamicCache. We slice each layer's K/V tensors after every step, keep new queries at their global RoPE position, and let evicted tokens drop out of the attention's view — StreamingLLM-style. - Greedy speculative decoding loop with verifier-side rejection sampling. When sink+window covers the full sequence the loop is bit-equivalent to greedy AR (verified at runtime; demo exits non-zero on mismatch). - NBT (Net Bytes per Token) accounting that includes the proposer's own weights and peak activation amortized over (B, S, L_block), plus a projection table to canonical operating points. Results from CPU runs (B=64, S=128k projection): baseline KV = 114,688 B/token speculative KV = 32,216 B/token -> 3.56x net compression break-even at B*S ~= 1M tokens-batches, matching the design analysis. No mock, no fallback, no overfit: all forwards execute on real downloaded weights, layout invariants raise on inconsistency, and the same code path runs every prompt. Co-Authored-By: Cursor Agent <agent@cursor.sh> Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
The previous metric amortized peak proposer activation as peak / (B * L_block) and added it to NBT. That's dimensionally wrong: activation is allocated when model(...) starts and freed when it returns; it never accumulates across forwards and does not scale per-session. It is a GPU capacity constraint, not a per-token cost. This commit: - metrics.py: introduces NBT_kv_only = verifier_KV + proposer_KV + weights/(B*S). Reports proposer_peak_activation and verifier_peak_activation separately as a capacity-constraint line, outside NBT and outside the compression ratio. - verifier.py: tracks peak activation (logits-buffer footprint) per forward call. - speculative.py + run_demo.py: thread the verifier's peak activation through to the report. - README.md: documents the metric, calls out the prior accounting error, and shows the corrected results. Corrected projection at B=64, S=128k (compression-regime run): before fix: 32,216 B/token -> 3.56x compression after fix : 166.6 B/token -> 688.4x compression The 200x change comes entirely from removing 31,250 B/token of incorrectly amortized activation from the numerator. KV cache trimming, RoPE handling, and equivalence-regime bit-equivalence are unchanged and still verified at runtime. Co-Authored-By: Cursor Agent <agent@cursor.sh> Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Renames affecting code:
NBTReport -> NetBytesPerTokenReport
nbt_kv_only_bytes_per_token -> net_bytes_per_token_kv_only
json key 'nbt' -> 'net_bytes_per_token_report'
Renames affecting prose / log output:
'NBT' -> 'Net Bytes per Token' (or hyphenated as a noun
phrase 'Net-Bytes-per-Token Report' in headings)
'NBT_kv_only' -> 'Net Bytes per Token (KV-only)'
'NBT B/token' -> 'Net Bytes per Token' (column header in projection
table; column widened to fit)
Also re-runs both demos so results/*.log and results/*.json reflect the
new headers and field names. Re-runs reproduce the same numerical
results (compression at B=64,S=128k stays 688.36x).
No semantics changes; this commit is a textual rename.
Co-Authored-By: Cursor Agent <agent@cursor.sh>
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds
AGENTS.mdwith Cursor Cloud specific development instructions for the DLM Proposer + AR Verifier KV-cache inference engine.What's included
dllmstub requirement,PYTHONPATH=.requirement, model download size, memory constraintsEnvironment verification
All commands were tested end-to-end on a CPU environment:
Smoke tests — all 4 component tests passed (tokenizer agreement, verifier prefill, proposer block generation, cache layout invariants):
smoke_test_output.log
Equivalence-regime demo — baseline and speculative outputs are bit-identical (
exact match: True), PASS status:demo_run_output.log
To show artifacts inline, enable in settings.