Skip to content

Add AGENTS.md with cloud-specific development instructions#3

Draft
FluffyAIcode wants to merge 4 commits into
mainfrom
AgentMemory/setup-dev-env-68ce
Draft

Add AGENTS.md with cloud-specific development instructions#3
FluffyAIcode wants to merge 4 commits into
mainfrom
AgentMemory/setup-dev-env-68ce

Conversation

@FluffyAIcode
Copy link
Copy Markdown
Owner

Adds AGENTS.md with Cursor Cloud specific development instructions for the DLM Proposer + AR Verifier KV-cache inference engine.

What's included

  • Quick-reference table for running smoke tests and demo commands
  • Key gotchas: dllm stub requirement, PYTHONPATH=. requirement, model download size, memory constraints
  • Notes on lack of linter/formatter and test framework configuration

Environment verification

All commands were tested end-to-end on a CPU environment:

Smoke tests — all 4 component tests passed (tokenizer agreement, verifier prefill, proposer block generation, cache layout invariants):
smoke_test_output.log

Equivalence-regime demo — baseline and speculative outputs are bit-identical (exact match: True), PASS status:
demo_run_output.log

To show artifacts inline, enable in settings.

Open in Web Open in Cursor 

cursoragent and others added 4 commits May 18, 2026 17:06
Implements the speculative-decoding architecture from the design discussion
on real public weights:

- Proposer: dllm-hub/Qwen3-0.6B-diffusion-mdlm-v0.1 (masked-diffusion DLM
  built on Qwen3-0.6B; same tokenizer as the verifier).
- Verifier: Qwen/Qwen3-1.7B with a sink+window-bounded DynamicCache. We
  slice each layer's K/V tensors after every step, keep new queries at
  their global RoPE position, and let evicted tokens drop out of the
  attention's view — StreamingLLM-style.
- Greedy speculative decoding loop with verifier-side rejection sampling.
  When sink+window covers the full sequence the loop is bit-equivalent to
  greedy AR (verified at runtime; demo exits non-zero on mismatch).
- NBT (Net Bytes per Token) accounting that includes the proposer's own
  weights and peak activation amortized over (B, S, L_block), plus a
  projection table to canonical operating points.

Results from CPU runs (B=64, S=128k projection):
  baseline KV    = 114,688 B/token
  speculative KV =  32,216 B/token  ->  3.56x net compression
  break-even at  B*S ~= 1M tokens-batches, matching the design analysis.

No mock, no fallback, no overfit: all forwards execute on real downloaded
weights, layout invariants raise on inconsistency, and the same code path
runs every prompt.

Co-Authored-By: Cursor Agent <agent@cursor.sh>

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
The previous metric amortized peak proposer activation as
peak / (B * L_block) and added it to NBT. That's dimensionally wrong:
activation is allocated when model(...) starts and freed when it
returns; it never accumulates across forwards and does not scale
per-session. It is a GPU capacity constraint, not a per-token cost.

This commit:

- metrics.py: introduces NBT_kv_only = verifier_KV + proposer_KV +
  weights/(B*S). Reports proposer_peak_activation and
  verifier_peak_activation separately as a capacity-constraint line,
  outside NBT and outside the compression ratio.
- verifier.py: tracks peak activation (logits-buffer footprint) per
  forward call.
- speculative.py + run_demo.py: thread the verifier's peak activation
  through to the report.
- README.md: documents the metric, calls out the prior accounting
  error, and shows the corrected results.

Corrected projection at B=64, S=128k (compression-regime run):
  before fix: 32,216 B/token   ->  3.56x compression
  after fix :    166.6 B/token -> 688.4x compression

The 200x change comes entirely from removing 31,250 B/token of
incorrectly amortized activation from the numerator. KV cache
trimming, RoPE handling, and equivalence-regime bit-equivalence
are unchanged and still verified at runtime.

Co-Authored-By: Cursor Agent <agent@cursor.sh>

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Renames affecting code:
  NBTReport                    -> NetBytesPerTokenReport
  nbt_kv_only_bytes_per_token  -> net_bytes_per_token_kv_only
  json key 'nbt'               -> 'net_bytes_per_token_report'

Renames affecting prose / log output:
  'NBT'              -> 'Net Bytes per Token' (or hyphenated as a noun
                        phrase 'Net-Bytes-per-Token Report' in headings)
  'NBT_kv_only'      -> 'Net Bytes per Token (KV-only)'
  'NBT B/token'      -> 'Net Bytes per Token' (column header in projection
                        table; column widened to fit)

Also re-runs both demos so results/*.log and results/*.json reflect the
new headers and field names. Re-runs reproduce the same numerical
results (compression at B=64,S=128k stays 688.36x).

No semantics changes; this commit is a textual rename.

Co-Authored-By: Cursor Agent <agent@cursor.sh>

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants