-
-
Notifications
You must be signed in to change notification settings - Fork 14.9k
[V1][Core] Autotune encoder cache budget #11895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 4 commits
495f669
8c67ecd
5938a1f
0e4ab3c
bd1ccf1
2a4b1d5
9ee3f3d
7614888
aaf3cef
e8f50f4
767b0d6
eb125b5
f539470
29ad359
3103622
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -1,7 +1,15 @@ | ||||||||||||||||||
| from typing import Dict, List, Set, Tuple | ||||||||||||||||||
| from typing import TYPE_CHECKING, Dict, List, Set, Tuple | ||||||||||||||||||
|
|
||||||||||||||||||
| from vllm.logger import init_logger | ||||||||||||||||||
| from vllm.multimodal import MULTIMODAL_REGISTRY | ||||||||||||||||||
| from vllm.utils import cdiv | ||||||||||||||||||
| from vllm.v1.request import Request | ||||||||||||||||||
|
|
||||||||||||||||||
| if TYPE_CHECKING: | ||||||||||||||||||
| from vllm.config import ModelConfig, SchedulerConfig | ||||||||||||||||||
|
|
||||||||||||||||||
| logger = init_logger(__name__) | ||||||||||||||||||
|
|
||||||||||||||||||
|
|
||||||||||||||||||
| class EncoderCacheManager: | ||||||||||||||||||
|
|
||||||||||||||||||
|
|
@@ -46,3 +54,61 @@ def get_freed_ids(self) -> List[Tuple[str, int]]: | |||||||||||||||||
| freed = self.freed | ||||||||||||||||||
| self.freed = [] | ||||||||||||||||||
| return freed | ||||||||||||||||||
|
|
||||||||||||||||||
|
|
||||||||||||||||||
| def compute_encoder_cache_budget( | ||||||||||||||||||
| model_config: "ModelConfig", | ||||||||||||||||||
| scheduler_config: "SchedulerConfig", | ||||||||||||||||||
| ) -> int: | ||||||||||||||||||
| """Compute the encoder cache budget based on the model and scheduler | ||||||||||||||||||
| configurations. | ||||||||||||||||||
| """ | ||||||||||||||||||
|
|
||||||||||||||||||
| encoder_cache_budget = 0 | ||||||||||||||||||
| if not model_config.is_multimodal_model: | ||||||||||||||||||
ywang96 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||
| return encoder_cache_budget | ||||||||||||||||||
|
|
||||||||||||||||||
| max_tokens_by_modality_dict = MULTIMODAL_REGISTRY.get_max_tokens_per_item_by_modality( # noqa: E501 | ||||||||||||||||||
| model_config) | ||||||||||||||||||
|
|
||||||||||||||||||
| modality, max_tokens_per_mm_item = max(max_tokens_by_modality_dict.items(), | ||||||||||||||||||
| key=lambda item: item[1]) | ||||||||||||||||||
|
|
||||||||||||||||||
| max_num_batched_tokens = scheduler_config.max_num_batched_tokens | ||||||||||||||||||
| max_num_reqs = scheduler_config.max_num_seqs | ||||||||||||||||||
|
|
||||||||||||||||||
| # In case that the biggest possible multimodal item takes space more | ||||||||||||||||||
| # than the batch size, then it needs to be cached and chunk prefilled. | ||||||||||||||||||
| if max_tokens_per_mm_item > max_num_batched_tokens: | ||||||||||||||||||
| num_items = 1 | ||||||||||||||||||
|
|
||||||||||||||||||
| # In case that the biggest possible multimodal item takes space less | ||||||||||||||||||
| # the batch size, then all items will be full prefilled except one. | ||||||||||||||||||
| else: | ||||||||||||||||||
| num_items = cdiv(max_num_batched_tokens, max_tokens_per_mm_item) | ||||||||||||||||||
|
||||||||||||||||||
|
|
||||||||||||||||||
| # NOTE: We need the encoder cache to be able to compute & hold ONE | ||||||||||||||||||
| # ADDITIONAL multimodal item, and is required only when: | ||||||||||||||||||
| # - Two requests in the current batch share the same prefix with such item | ||||||||||||||||||
| # as part of the prefix. | ||||||||||||||||||
| # - AND the prefix length is divisible by the block size, triggering the | ||||||||||||||||||
| # recomputation of the last block. | ||||||||||||||||||
| # - AND the part of the embeddings of the item is in this last block. | ||||||||||||||||||
|
|
||||||||||||||||||
| # This can be improved when we have a global encoder cache that does | ||||||||||||||||||
| # not associate items to request id only. | ||||||||||||||||||
| num_items += 1 | ||||||||||||||||||
|
||||||||||||||||||
| if num_encoder_tokens > encoder_budget: | |
| # The encoder budget is exhausted. We can only schedule the | |
| # decoder tokens up until the encoder input. | |
| # NOTE(woosuk): We assume that the encoder tokens should be | |
| # processed altogether, as the encoder usually uses | |
| # bidirectional attention. | |
| num_new_tokens = start_pos - num_computed_tokens | |
| break |
num_new_tokens to 7333 (start_pos) - 16016 (num_computed_tokens) = -8683, and then crash the server as we cannot have non-positive num_new_tokens.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both cases would need this.
Also for this comment
# This can be improved when we have a global encoder cache that does
# not associate items to request id only.This cannot address the issue fundamentally, because we also need to guarantee the item is always available in the encoder cache when we schedule the request. For example, an item used by request A and request B. Request A has finished so prefix and mm items are cached. However, due to encoder cache budget, one item in request A is evicted before request B comes. This would result in the same problem.
I guess this can somehow be avoided if we could guarantee all prefix cached mm items are always available in encoder cache as well, but fundamentally this has to be solved by supporting num_tokens=0 in the model runner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but fundamentally this has to be solved by supporting num_tokens=0 in the model runner.
That's a good callout! I've adjusted the comment accordingly.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be better to have a warning if num_items < max_num_reqs * max_mm_items_per_req, because it means we are overriding user configurations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we're actually not overriding user configurations because the user doesn't get to specify the encoder cache budget (neither they could do before this PR since it's hardcoded).
What we are doing here is simply to have the encoder budget calculation to respect max_num_reqs (Consider when max_num_reqs=1, the encoder cache will then only need to be able to compute & hold for one request every step)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I thought MULTIMODAL_REGISTRY.get_mm_limits_per_prompt(model_config) can be configured by users using mm_liimt? Is that a different config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - that's configured by user, but we're not overwriting this value.
Also keep in mind that this limit technically speaking is only still needed today because in V0 we don't support chunked prefill for multimodal models, so the sequence (and thus all multimodal items in it) needs to be prefilled as a whole, therefore profiling will need to be done accordingly.
In V1 chunked prefill is by nature, so this limit doesn't affect how we schedule requests at all, and only affect how engine profiling is done at this specific check, so technically we don't need it anymore, but we still want to keep this argument just so that users have a way to set a cap themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thanks for the clarification!
Uh oh!
There was an error while loading. Please reload this page.