[V1][Model] Add V1 support for Qwen2-VL#11668
[V1][Model] Add V1 support for Qwen2-VL#11668imkero wants to merge 4 commits intovllm-project:mainfrom
Conversation
Signed-off-by: imkero <kerorek@outlook.com>
Signed-off-by: imkero <kerorek@outlook.com>
Signed-off-by: imkero <kerorek@outlook.com>
|
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
|
|
||
|
|
||
| class Qwen2VLMultiModalProcessor(BaseMultiModalProcessor): | ||
| _placeholder_map: Optional[dict[str, list[int]]] = None |
There was a problem hiding this comment.
I think we should initialize this in the init method to avoid confusing it with a static class variable.
There was a problem hiding this comment.
Apart from this, the processor-related changes in the model file LGTM.
|
Hello @imkero! Much appreciated that you made this PR! The reason why I haven't spent too much on Qwen2-VL is that I want to see if there's a way to move MRope inside model file for Qwen2-VL since it is so specific to this model. You would also need to change the implementation of Feel free to take changes from here into this PR. |
| if not self._placeholder_map: | ||
| # NOTE: Only Qwen2VLProcessor in transformers 4.47.0 has | ||
| # image_token and video_token registered | ||
| encode_fn = hf_processor.tokenizer.encode | ||
| self._placeholder_map = { | ||
| "image": encode_fn(hf_processor.image_token), | ||
| "video": encode_fn(hf_processor.video_token), | ||
| } | ||
| placeholder = self._placeholder_map | ||
|
|
There was a problem hiding this comment.
Also, we can set this at initialization time.
| encoder_outputs.append(( | ||
| encoder_output[0] | ||
| [start_idx:end_idx], # embedding tensor | ||
| encoder_output[1], # modality |
There was a problem hiding this comment.
My thought is we don't necessarily need to have the modality key here.
We can leverage the fact that any two mm_position's from any modalities cannot possibily have overlaps, and now that
vllm/vllm/model_executor/models/utils.py
Lines 408 to 423 in 11d8a09
can apply the embedding replacement based on a list of token ids (so we can simply have
[self.config.image_token_id, self.config.video_token_id] here)
Therefore, all we need to do should be just sorting mm_position's and their correpsonding mm_inputs in the following code(which also needs to be modified to support video modality for Qwen2VL in this PR)
Lines 51 to 59 in 11d8a09
WDYT?
There was a problem hiding this comment.
On a second thought - let me actually work on this design for llava-onevision too
|
Hello @imkero! Please feel free to take a look at the updated code in https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llava_onevision.py for dealing with multiple modalities. In particular, I think you can pretty much adopt the same code below to Qwen-2VL without changing the interface for model runner and encoder cache. Let me know if you need any help and I'm happy to work on this PR as well if you don't have the bandwidth! vllm/vllm/model_executor/models/llava_onevision.py Lines 547 to 560 in cf5f000 vllm/vllm/model_executor/models/llava_onevision.py Lines 812 to 834 in cf5f000 vllm/vllm/model_executor/models/llava_onevision.py Lines 844 to 846 in cf5f000 |
|
@ywang96 Sorry for the late response. I'll continue working on this PR soon. |
|
This pull request has merge conflicts that must be resolved before it can be |
|
Hi, It seems the dummy data in profile running is not correct. Then, I print some values. Could you help me? I would appreciate it and hope that Qwen2-VL will be supported by v1 in time. Thank you Best regards |
@baifanxxx I'll start taking a look at this PR tomorrow. @imkero has already done a great job of adding MRoPE in v1 with torch compile support, so it shouldn't take us too long to get this PR into a functional stage! |
| dynamic_arg_dims={ | ||
| "input_ids": 0, | ||
| # dim 1 for mrope in shape (3, seq_len), else dim 0 in shape (seq_len, ) | ||
| "positions": lambda tensor: tensor.ndim - 1, |
There was a problem hiding this comment.
The value here will be passthrough
to pytorch's impl torch._dynamo.mark_dynamic(tensor, dim), and it seems to assume that dim is a non-negative integer.
There was a problem hiding this comment.
you can do the conversion here:
vllm/vllm/compilation/decorators.py
Line 177 in d06e824
iterate over the dims , and conver -1 to tensor.ndim - 1
|
hi, when I running qwen2-vl(2b) use this pr, I get error: Is there a problem with the profile_run? |
|
@baifanxxx @Zhiy-Zhang Actually I commented this assertion ( Also I modified the value of encoder_budget, and Qwen2-VL's image processor's |
Thank you very much for your reply. Has this change(“modified the value of encoder_budget, and Qwen2-VL's image processor's |
What's changed:
torch.compile(M-RoPE uses a 2d position tensor which differs from common RoPE, and they share same impl in Qwen2 LM'sforwardfn)profile_runfor Qwen2-VL launchgpu_model_runner(embeddings: torch.Tensor, modality: str)ingpu_model_runnerfor Qwen2-VLimage_tokenandvideo_tokenin Qwen2-VL's preprocessing for better performanceThis PR should make Qwen2-VL works in V1 with chunked prefill and prefix caching enabled.