[Builtin] Sliding window and sink support for PagedKVCache#16729
Merged
tqchen merged 1 commit intoapache:mainfrom Mar 16, 2024
Merged
Conversation
tqchen
approved these changes
Mar 15, 2024
b0f5506 to
8ea3b5f
Compare
This PR supports sliding window attention and attention sink for PagedKVCache, so that PagedKVCache can back models such as Mistral. Meanwhile, this PR removes the "Attention" function (without fused-qkv) from AttentionKVCache interface, given its usage is now completely covered by the "AttentionWithFusedQKV" function. Considering the cost of maintenance, we decide to remove it for now. When in the future there is the need of this function, we will add it back. This PR also unifies the global function names of the PagedKVCache with the KVState introduced earlier, and introduces a new KV cache raw info query function to get the current total sequence length in the KV cache.
8ea3b5f to
52fbbd7
Compare
CharlieFRuan
added a commit
to mlc-ai/web-llm
that referenced
this pull request
Apr 1, 2024
) This PR supports SWA under PagedKVCache, and with this, now all models running on WebLLM will be compiled with PagedKVCache (rather than the old KVCache that is no longer maintained). Hence, we removed codes in `llm_chat.ts` that were for backward compatibility. However, old wasms would still work with npm <= 0.2.29 since wasm versioning will be introduced with 0.2.30. Note that API for `forwardTokensAndSample()` is changed since we no longer need `curPos`. Relevant PRs: - mlc-ai/mlc-llm#1967 - apache/tvm#16729
thaisacs
pushed a commit
to thaisacs/tvm
that referenced
this pull request
Apr 3, 2024
) This PR supports sliding window attention and attention sink for PagedKVCache, so that PagedKVCache can back models such as Mistral. Meanwhile, this PR removes the "Attention" function (without fused-qkv) from AttentionKVCache interface, given its usage is now completely covered by the "AttentionWithFusedQKV" function. Considering the cost of maintenance, we decide to remove it for now. When in the future there is the need of this function, we will add it back. This PR also unifies the global function names of the PagedKVCache with the KVState introduced earlier, and introduces a new KV cache raw info query function to get the current total sequence length in the KV cache.
atebites-hub
pushed a commit
to atebites-hub/web-llm
that referenced
this pull request
Oct 4, 2025
…lc-ai#351) This PR supports SWA under PagedKVCache, and with this, now all models running on WebLLM will be compiled with PagedKVCache (rather than the old KVCache that is no longer maintained). Hence, we removed codes in `llm_chat.ts` that were for backward compatibility. However, old wasms would still work with npm <= 0.2.29 since wasm versioning will be introduced with 0.2.30. Note that API for `forwardTokensAndSample()` is changed since we no longer need `curPos`. Relevant PRs: - mlc-ai/mlc-llm#1967 - apache/tvm#16729
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR supports sliding window attention and attention sink for PagedKVCache, so that PagedKVCache can back models such as Mistral.
Meanwhile, this PR removes the "Attention" function (without fused-qkv) from AttentionKVCache interface, given its usage is now completely covered by the "AttentionWithFusedQKV" function. Considering the cost of maintenance, we decide to remove it for now. When in the future there is the need of this function, we will add it back.
This PR also unifies the global function names of the PagedKVCache with the KVState introduced earlier, and introduces a new KV cache raw info query function to get the current total sequence length in the KV cache.