fix inference crashed on v100 with qwen3.5-0.8b by lvhan028 · Pull Request #4420 · InternLM/lmdeploy

lvhan028 · 2026-03-17T10:56:30Z

No description provided.

Copilot

Pull request overview

This PR addresses a V100 (SM70) decoding-time crash observed with Qwen3.5-0.8B by reducing resource usage in the SM70 HeadDim=256 decoding kernel and by improving Qwen3.5 weight export handling when embeddings are tied.

Changes:

Tune SM70 HeadDim=256 decoding kernel parameters (CTA_S and staging) to reduce shared-memory usage and avoid runtime launch failures.
For Qwen3.5 export, honor tie_word_embeddings by mapping the output head weight to the token embedding weight.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
src/turbomind/kernels/attention/kernel/decoding_sm70_256.cu	Lowers SM70 decoding kernel tile size / stages to reduce shared-memory footprint and prevent V100 launch aborts.
lmdeploy/turbomind/deploy/source_model/qwen.py	Sets `output_weight_key` to embeddings when `tie_word_embeddings` is enabled for Qwen3.5 export.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

src/turbomind/kernels/attention/kernel/decoding_sm70_256.cu

@@ -12,9 +12,9 @@
 namespace turbomind::attention {

 constexpr int kHeadDim = 256;


lmdeploy/turbomind/deploy/source_model/qwen.py

            self.tok_embeddings_key = 'model.language_model.embed_tokens.weight'
            self.norm_weight_key = 'model.language_model.norm.weight'
-
+        tie_word_embeddings = self.model_cfg.get('tie_word_embeddings', False)


fix inference crashed on v100 with qwen3.5-0.8b

f62990b

Copilot AI review requested due to automatic review settings March 17, 2026 10:56

Copilot started reviewing on behalf of lvhan028 March 17, 2026 10:56 View session

lvhan028 requested a review from lzhangzz March 17, 2026 10:57

Copilot AI reviewed Mar 17, 2026

View reviewed changes

lvhan028 mentioned this pull request Mar 18, 2026

[Bug] Qwen3.5 Turbomind missing V100 support #4408

Closed

3 tasks

lzhangzz approved these changes Mar 18, 2026

View reviewed changes

lvhan028 merged commit a30b976 into InternLM:main Mar 18, 2026
10 of 13 checks passed

lvhan028 added the Bug:P1 label Mar 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix inference crashed on v100 with qwen3.5-0.8b#4420

fix inference crashed on v100 with qwen3.5-0.8b#4420
lvhan028 merged 1 commit intoInternLM:mainfrom
lvhan028:fix-v100

lvhan028 commented Mar 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -12,9 +12,9 @@
		namespace turbomind::attention {

		constexpr int kHeadDim = 256;

Conversation

lvhan028 commented Mar 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants