[V1] Initial support of multimodal models for V1 re-arch#10699
[V1] Initial support of multimodal models for V1 re-arch#10699ywang96 merged 31 commits intovllm-project:mainfrom ywang96:v1-initial
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Roger Wang <ywang@roblox.com>
DarkLight1337
left a comment
There was a problem hiding this comment.
Model changes look good to me overall. We can consider passing the placeholder indices to get_input_embeddings so that the relationship between input_ids and multimodal_embeddings becomes more explicit.
We should add tests to ensure that these models actually work in V1.
Hmm I don't think that alone will help actually because V1 assumes chunked prefill by default. Unless we're passing additional information such as computed tokens per request, the model runner doesn't know the original indices of |
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Roger Wang <ywang@roblox.com>
WoosukKwon
left a comment
There was a problem hiding this comment.
LGTM! It'd be nice if @DarkLight1337 can also take another look.
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Roger Wang <ywang@roblox.com>
|
It seems that we have tests related to using image embeddings as input, indicating usage of this feature from the community (which will be a breaking change in V1 since we cannot take batched image embedding directly as input), thus I'm separating Qwen2VL out to unblock this PR. |
Signed-off-by: Roger Wang <ywang@roblox.com>
Sorry for the late reply, it looks good to me. |
…t#10699) Signed-off-by: Roger Wang <ywang@roblox.com>
This PR adds V1 support for selected image language models
Qwen2-VL (missing MRope implementation, will address in a separate PR)separated to [V1] Add V1 support of Qwen2-VL #12128