[quantization] Fix attention_mask computation in QuantQwen3VLTextModel#630
Open
dvsav wants to merge 1 commit intoSamsung:mainfrom
Open
[quantization] Fix attention_mask computation in QuantQwen3VLTextModel#630dvsav wants to merge 1 commit intoSamsung:mainfrom
dvsav wants to merge 1 commit intoSamsung:mainfrom
Conversation
This PR fixes divergence between the original Qwen3VL model and the wrapped one by correcting attention_mask computation in QuantQwen3VLTextModel. TICO-DCO-1.0-Signed-off-by: d.savchenkov <d.savchenkov@partner.samsung.com>
6769bdc to
0ae8649
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR fixes the divergence between the original Qwen3VL model and the wrapped one (after
tico.prepare).Symptoms
Divergence between the original (unquantized) model and the wrapped model (after
tico.prepare). The divergence was detected by tico/quantization/wrapq/examples/qwen/trace_qwen.py script. In the below trace most of submodules are skipped for brevity:As you can see, the difference between the submodules' outputs has a large leap at
model.language_model.layers.0.self_attn.o_projsubmodule (PEIR growth from 7.13e-07 to 0.57).Root Cause Analysis
Debugging was aided by
trace_qwen.pyscript:python tico/quantization/wrapq/examples/qwen/trace_qwen.py \ --model "Qwen/Qwen3-VL-2B-Instruct" \ --no-trace-unquantized \ --no-trace-quantized \ --interesting-modules model.language_model.layers.0.self_attn.o_proj \ --breakpoint-on-interesting-modulesDebugging localized the root cause of the divergence -
attention_maskused inQwen3VLTextAttention.forwardwas different from that used inQuantQwen3VLTextAttention.forward:In the original (unquantized) model attention mask obtains its value in
Qwen3VLTextModel:In the current implementation of
QuantQwen3VLTextModela similar mask computation is conditional:What
create_causal_maskreturnscreate_causal_maskcalls several more functions:The Fix
Since we are also aiming to create a causal mask (not a sliding window mask, not a chunked attention mask, computation of which wouldn't be convertible to Circle), we can do that unconditionally:
Tracing Submodules' Divergence After The Fix
After the fix the divergence at
model.language_model.layers.0.self_attn.o_projdrops to normal values (PEIR 4.72e-07):As you can see, the PEIR now stays around the order of
1e-07.Unit Tests
Conversion to Circle