Skip to content

[Builtin] Sliding window and sink support for PagedKVCache#16729

Merged
tqchen merged 1 commit intoapache:mainfrom
MasterJH5574:tvm-dev/2024-03-15-kv-cache-sliding-window
Mar 16, 2024
Merged

[Builtin] Sliding window and sink support for PagedKVCache#16729
tqchen merged 1 commit intoapache:mainfrom
MasterJH5574:tvm-dev/2024-03-15-kv-cache-sliding-window

Conversation

@MasterJH5574
Copy link
Copy Markdown
Contributor

@MasterJH5574 MasterJH5574 commented Mar 15, 2024

This PR supports sliding window attention and attention sink for PagedKVCache, so that PagedKVCache can back models such as Mistral.

Meanwhile, this PR removes the "Attention" function (without fused-qkv) from AttentionKVCache interface, given its usage is now completely covered by the "AttentionWithFusedQKV" function. Considering the cost of maintenance, we decide to remove it for now. When in the future there is the need of this function, we will add it back.

This PR also unifies the global function names of the PagedKVCache with the KVState introduced earlier, and introduces a new KV cache raw info query function to get the current total sequence length in the KV cache.

@MasterJH5574 MasterJH5574 force-pushed the tvm-dev/2024-03-15-kv-cache-sliding-window branch 4 times, most recently from b0f5506 to 8ea3b5f Compare March 16, 2024 03:39
This PR supports sliding window attention and attention sink for
PagedKVCache, so that PagedKVCache can back models such as Mistral.

Meanwhile, this PR removes the "Attention" function (without
fused-qkv) from AttentionKVCache interface, given its usage is now
completely covered by the "AttentionWithFusedQKV" function.
Considering the cost of maintenance, we decide to remove it for now.
When in the future there is the need of this function, we will add
it back.

This PR also unifies the global function names of the PagedKVCache
with the KVState introduced earlier, and introduces a new KV cache
raw info query function to get the current total sequence length
in the KV cache.
@MasterJH5574 MasterJH5574 force-pushed the tvm-dev/2024-03-15-kv-cache-sliding-window branch from 8ea3b5f to 52fbbd7 Compare March 16, 2024 04:21
@tqchen tqchen merged commit b8f64c2 into apache:main Mar 16, 2024
CharlieFRuan added a commit to mlc-ai/web-llm that referenced this pull request Apr 1, 2024
)

This PR supports SWA under PagedKVCache, and with this, now all models
running on WebLLM will be compiled with PagedKVCache (rather than the
old KVCache that is no longer maintained).

Hence, we removed codes in `llm_chat.ts` that were for backward
compatibility. However, old wasms would still work with npm <= 0.2.29
since wasm versioning will be introduced with 0.2.30.

Note that API for `forwardTokensAndSample()` is changed since we no
longer need `curPos`.

Relevant PRs:
- mlc-ai/mlc-llm#1967
- apache/tvm#16729
thaisacs pushed a commit to thaisacs/tvm that referenced this pull request Apr 3, 2024
)

This PR supports sliding window attention and attention sink for
PagedKVCache, so that PagedKVCache can back models such as Mistral.

Meanwhile, this PR removes the "Attention" function (without
fused-qkv) from AttentionKVCache interface, given its usage is now
completely covered by the "AttentionWithFusedQKV" function.
Considering the cost of maintenance, we decide to remove it for now.
When in the future there is the need of this function, we will add
it back.

This PR also unifies the global function names of the PagedKVCache
with the KVState introduced earlier, and introduces a new KV cache
raw info query function to get the current total sequence length
in the KV cache.
atebites-hub pushed a commit to atebites-hub/web-llm that referenced this pull request Oct 4, 2025
…lc-ai#351)

This PR supports SWA under PagedKVCache, and with this, now all models
running on WebLLM will be compiled with PagedKVCache (rather than the
old KVCache that is no longer maintained).

Hence, we removed codes in `llm_chat.ts` that were for backward
compatibility. However, old wasms would still work with npm <= 0.2.29
since wasm versioning will be introduced with 0.2.30.

Note that API for `forwardTokensAndSample()` is changed since we no
longer need `curPos`.

Relevant PRs:
- mlc-ai/mlc-llm#1967
- apache/tvm#16729
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants