Add AGENTS.md with cloud-specific development instructions by FluffyAIcode · Pull Request #3 · FluffyAIcode/Kakeya-LLM-Inference-engine

FluffyAIcode · 2026-05-22T05:20:17Z

Adds AGENTS.md with Cursor Cloud specific development instructions for the DLM Proposer + AR Verifier KV-cache inference engine.

What's included

Quick-reference table for running smoke tests and demo commands
Key gotchas: dllm stub requirement, PYTHONPATH=. requirement, model download size, memory constraints
Notes on lack of linter/formatter and test framework configuration

Environment verification

All commands were tested end-to-end on a CPU environment:

Smoke tests — all 4 component tests passed (tokenizer agreement, verifier prefill, proposer block generation, cache layout invariants):
smoke_test_output.log

Equivalence-regime demo — baseline and speculative outputs are bit-identical (exact match: True), PASS status:
demo_run_output.log

_{To show artifacts inline, enable in settings.}

Implements the speculative-decoding architecture from the design discussion on real public weights: - Proposer: dllm-hub/Qwen3-0.6B-diffusion-mdlm-v0.1 (masked-diffusion DLM built on Qwen3-0.6B; same tokenizer as the verifier). - Verifier: Qwen/Qwen3-1.7B with a sink+window-bounded DynamicCache. We slice each layer's K/V tensors after every step, keep new queries at their global RoPE position, and let evicted tokens drop out of the attention's view — StreamingLLM-style. - Greedy speculative decoding loop with verifier-side rejection sampling. When sink+window covers the full sequence the loop is bit-equivalent to greedy AR (verified at runtime; demo exits non-zero on mismatch). - NBT (Net Bytes per Token) accounting that includes the proposer's own weights and peak activation amortized over (B, S, L_block), plus a projection table to canonical operating points. Results from CPU runs (B=64, S=128k projection): baseline KV = 114,688 B/token speculative KV = 32,216 B/token -> 3.56x net compression break-even at B*S ~= 1M tokens-batches, matching the design analysis. No mock, no fallback, no overfit: all forwards execute on real downloaded weights, layout invariants raise on inconsistency, and the same code path runs every prompt. Co-Authored-By: Cursor Agent <agent@cursor.sh> Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

The previous metric amortized peak proposer activation as peak / (B * L_block) and added it to NBT. That's dimensionally wrong: activation is allocated when model(...) starts and freed when it returns; it never accumulates across forwards and does not scale per-session. It is a GPU capacity constraint, not a per-token cost. This commit: - metrics.py: introduces NBT_kv_only = verifier_KV + proposer_KV + weights/(B*S). Reports proposer_peak_activation and verifier_peak_activation separately as a capacity-constraint line, outside NBT and outside the compression ratio. - verifier.py: tracks peak activation (logits-buffer footprint) per forward call. - speculative.py + run_demo.py: thread the verifier's peak activation through to the report. - README.md: documents the metric, calls out the prior accounting error, and shows the corrected results. Corrected projection at B=64, S=128k (compression-regime run): before fix: 32,216 B/token -> 3.56x compression after fix : 166.6 B/token -> 688.4x compression The 200x change comes entirely from removing 31,250 B/token of incorrectly amortized activation from the numerator. KV cache trimming, RoPE handling, and equivalence-regime bit-equivalence are unchanged and still verified at runtime. Co-Authored-By: Cursor Agent <agent@cursor.sh> Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Renames affecting code: NBTReport -> NetBytesPerTokenReport nbt_kv_only_bytes_per_token -> net_bytes_per_token_kv_only json key 'nbt' -> 'net_bytes_per_token_report' Renames affecting prose / log output: 'NBT' -> 'Net Bytes per Token' (or hyphenated as a noun phrase 'Net-Bytes-per-Token Report' in headings) 'NBT_kv_only' -> 'Net Bytes per Token (KV-only)' 'NBT B/token' -> 'Net Bytes per Token' (column header in projection table; column widened to fit) Also re-runs both demos so results/*.log and results/*.json reflect the new headers and field names. Re-runs reproduce the same numerical results (compression at B=64,S=128k stays 688.36x). No semantics changes; this commit is a textual rename. Co-Authored-By: Cursor Agent <agent@cursor.sh> Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

cursoragent and others added 4 commits May 18, 2026 17:06

Add AGENTS.md with cloud-specific development instructions

642d44c

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AGENTS.md with cloud-specific development instructions#3

Add AGENTS.md with cloud-specific development instructions#3
FluffyAIcode wants to merge 4 commits into
mainfrom
AgentMemory/setup-dev-env-68ce

FluffyAIcode commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented May 22, 2026

What's included

Environment verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants