Trace data flow through Qwen3VL submodules by dvsav · Pull Request #615 · Samsung/TICO

dvsav · 2026-04-09T13:08:01Z

What

This PR introduces a script for tracing, debugging and validating quantized Qwen3VL model.
File: tico/quantization/wrapq/examples/qwen/trace_qwen.py.
The script supports printing out inputs and outputs of a model in a structured form as well as side-by-side compaprison of similarly named submodules in the original and quantized models.

Why

We've finished developing quant wrappers for Qwen3VL and now we need a way to validate quantized model in a way that allows for localizing and debugging any cases of significant divergence between the original (unquantized) model and the quantized one.

Design and Usage

Below are 2 basic examples of script usage:

Usage Scenario 1

# Basic scenario: print all modules' inputs and outputs and compare them.
python tico/quantization/wrapq/examples/qwen/trace_qwen.py \
    --model ~/models/qwen3-vl-2b

# The same as above, but downloading the model from huggingface hub.
python tico/quantization/wrapq/examples/qwen/trace_qwen.py \
    --model Qwen/Qwen3-VL-2B-Instruct

Usage Scenario 2

# Don't print outputs, only compare them. Quantization disabled for now, validation criterion: difference between submodules' outputs must be close to zero.
python tico/quantization/wrapq/examples/qwen/trace_qwen.py \
    --model ~/models/qwen3-vl-2b \
    --no-trace-unquantized \
    --no-trace-quantized

# Same as above, but with enabled quantization. Will show gradual divergence between submodules' outputs due to quantization error accumulation.
python tico/quantization/wrapq/examples/qwen/trace_qwen.py \
    --model ~/models/qwen3-vl-2b \
    --no-trace-unquantized \
    --no-trace-quantized \
    --enable-quantization

--model command-line argument is required and specifies the model (only Qwen3VL model species are supported) as either a model repository name (e.g. Qwen/Qwen3-VL-2B-Instruct) or a path to the cache directory containing model data (e.g ~/models/qwen3-vl-2b). In the former case the model is downloaded from huggingface (unless already cached in the default cache directory like ~/.cache/huggingface/hub/), In the latter case the model is read from the specified local directory.

Output

The script always prints the generated input data for the model (and that input is hard-coded):

********************************************************************************
*                                 MODEL INPUTS                                 *
********************************************************************************

input_ids:
    tensor([[644, 872, 198, 652, 998, 998, 998, 998, 998, 998, 998, 998, 998, 998,
             998, 998, 998, 998, 998, 998, 998, 998, 998, 998, 998, 998, 998, 998,
             998, 998, 998, 998, 998, 998, 998, 998, 998, 998, 998, 998, 998, 998,
             998, 998, 998, 998, 998, 998, 998, 998, 998, 998, 998, 998, 998, 998,
             998, 998, 998, 998, 998, 998, 998, 998, 998, 998, 998, 998, 998, 998,
             998, 998, 998, 998, 653, 785, 279, 168,  13, 645, 198, 644,  91, 198]])
    shape: torch.Size([1, 84])
    dtype: torch.int64

attention_mask:
    tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
             1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
             1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
             1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
    shape: torch.Size([1, 84])
    dtype: torch.int64

pixel_values:
    tensor([[-1., -1., -1.,  ..., -1., -1., -1.],
            [-1., -1., -1.,  ..., -1., -1., -1.],
            [-1., -1., -1.,  ..., -1., -1., -1.],
            ...,
            [-1., -1., -1.,  ..., -1., -1., -1.],
            [-1., -1., -1.,  ..., -1., -1., -1.],
            [-1., -1., -1.,  ..., -1., -1., -1.]])
    shape: torch.Size([280, 1536])
    dtype: torch.float32

image_grid_thw:
    tensor([[ 1, 20, 14]])
    shape: torch.Size([1, 3])
    dtype: torch.int64

By default the script prints out the detailed description (in JSON format) of input and output of each submodule of the original (unquantized) model and the quantized model. You can turn off printing that information via --no-trace-unquantized and --no-trace-quantized flags. The submodule's input and output description includes the following:

Submodule's name.
Submodule's type.
Input data (usually a tensor or a tuple of tensors).
kwargs (named arguments to the submodule's forward method).
Output data (usually a tensor or a class containing tensors).

Here's an example of the output for a single submodule:

{
    "module_name": "model.language_model.embed_tokens",
    "module_type": "Embedding",
    "inputs": {
        "0": {
            "type": "Tensor",
            "dtype": "torch.int64",
            "shape": "torch.Size([1, 84])",
            "statistics": {
                "mean": 903.5714111328125,
                "min": 13.0,
                "max": 998.0,
                "stddev": 241.6049346923828
            }
        },
        "type": "tuple"
    },
    "kwargs": {},
    "output": {
        "type": "Tensor",
        "dtype": "torch.float32",
        "shape": "torch.Size([1, 84, 64])",
        "statistics": {
            "mean": 0.0001286495680687949,
            "min": -0.07198204100131989,
            "max": 0.07216963917016983,
            "stddev": 0.02147550694644451
        }
    }
}

For each tensor in the above output we print the following information:

Tensor type.
Tensor's elements data type.
Tensor's shape.
statistics over all tensor's elements (min, max, mean, standard deviation).

We don't print the actual tensor elements for brevity (unless the tensor contains 0 or 1 element).

Unless --no-side-by-side is specified, the script compares (computes the deifference between) the outputs of similarly named submodules in the unquantized and quantized models. The difference is usually a tensor and we only print statistics over its elements (as mentioned above). Here's an example of this "side-by-side" comparison (just a few first and last submodules are shown for brevity):

--------------------------------------------------------------------------------
MODULE NAME                         DIFFERENCE
--------------------------------------------------------------------------------
model.language_model.embed_tokens   {'mean': '0.0', 'min': '0.0', 'max': '0.0', 'stddev': '0.0', 'PEIR': '0.0', 'type': 'dict'}
model.visual.patch_embed.proj       {'mean': '-1.0244548320770264e-08', 'min': '-2.205371856689453e-06', 'max': '1.2069940567016602e-06', 'stddev': '6.208474019331334e-07', 'PEIR': '3.024483349539585e-07', 'type': 'dict'}
...                                 ...
lm_head                             {'mean': '-6.053048728915655e-09', 'min': '-7.580965757369995e-07', 'max': '8.491333574056625e-07', 'stddev': '1.7772102012258983e-07', 'PEIR': '7.257339589703714e-07', 'type': 'dict'}

Large numbers in the difference can be an indicator of issues in quantized model.

Implementation Note

The script implementation is based on registering a hook callback function that is called on each submodule during the inference run of a model. The hook can print inputs and outputs of each submodule and also store the outputs in a dictionary (where keys are submodule names).
Two models are probed this way: the original (unquantized) model and the quantized model.
After that the outputs of similarly named submodules are compared.

Detailed Examination of Specific Submodules

You can specify submodules' names that are subject to a more detailed examination via --interesting-modules command-line flag. Submodules' names are space-separated. Here's an example:

python tico/quantization/wrapq/examples/qwen/trace_qwen.py \
    --model "~/models/qwen3-vl-2b" \
    --interesting-modules model.language_model model.visual

The descriptions of the "interesting submodules" are then printed with more details. Specifically, not only tensors' statistics is printed, but also the actual tensors' elements (note that the output can become quite verbose then).

You can also specify --breakpoint-on-interesting-modules command-line flag. This will make the script go to debug mode once it encounters any of the specified "interesting" submodules. Then you will be able to examine the stack trace and state of the program (e.g. eaxamine specific elements of the submodule's input tensor). The breakpoint occurs in the hook callback function, so you'll need to go a few frames above in the stack trace (use bt, up and down commands in PDB to navigate the stack trace) to get to the model's source code that has called the submodule of interest.

Differences from `tico/quantization/wrapq/utils/introspection.py`

introspection.extract_tensor extracts just the 1st tensor encountered in the passed output argument. trace_qwen saves/analyzes full data from submodule's input/output.
introspection has 2 separate functions for saving (save_fp_outputs) and comparing (compare_layer_outputs) submodules' outputs, while trace_qwen uses a single function trace_model_input_output for the above two and any other purposes that imply iterating over submodules' inputs and outputs.
introspection only checks the outputs of each submodule, but ignores the input and kwargs. trace_qwen prints/saves full info available to module hook.
introspection adds a hook to each submodule individually via m.register_forward_hook while trace_qwen leverages torch.nn.modules.module.register_module_forward_hook to add a hook to all submodules with a single call.
introspection functions return a list of RemovableHandle that need to be removed by the caller, while trace_qwen uses a context manager automatically removing the single handle.

Differences from `tico/quantization/wrapq/examples/debug_quant_outputs.py`

debug_quant_outputs loads a full-fledged model, while trace_qwen loads a much lighter model with reduced number of layers and attention heads.
debug_quant_outputs loads a dataset while trace_qwen uses a single crafted input sample (image + text).

mhs4670go · 2026-04-13T13:49:21Z

+
+    # List, Tuple: compare element-wise
+    if isinstance(lhs, Sequence):
+        for i, (lhs_val, rhs_val) in enumerate(zip(lhs, rhs)):


Using zip() here will silently drop extra elements if the lengths differ, so mismatches might go unnoticed.

if len(lhs) != len(rhs): raise ValueError(f"Length mismatch: {len(lhs)} != {len(rhs)}") for i, (lhs_val, rhs_val) in enumerate(zip(lhs, rhs)): ...

mhs4670go · 2026-04-13T13:53:53Z

+            delta_stats: TensorStatistics = get_tensor_statistics(delta)
+            interval = (lhs.max() - lhs.min()).item()
+            if interval != 0.0:
+                peir = delta_stats.max / interval


PEIR = max(\|a - b\|) / (max(a) - min(a))

delta_stats has absolute value of max?

👍 Thanks for catching that. I've corrected the code:

abs_delta: torch.Tensor = (lhs - rhs).abs().to(torch.float)

mhs4670go · 2026-04-13T13:54:51Z

+import tico.quantization
+import tico.quantization.config.ptq
+from tico.quantization.evaluation.metric import compute_peir
+from tico.quantization.evaluation.utils import plot_two_outputs


Suggested change

from tico.quantization.evaluation.utils import plot_two_outputs

mhs4670go · 2026-04-13T13:55:18Z

+            )
+    else:
+        print(
+            "[WARNING] Model name {args.model} does not refer to an existing directory. So, we'll try to download the model from huggingface."


Suggested change

"[WARNING] Model name {args.model} does not refer to an existing directory. So, we'll try to download the model from huggingface."

f"[WARNING] Model name {args.model} does not refer to an existing directory. So, we'll try to download the model from huggingface."

mhs4670go · 2026-04-13T14:11:24Z

Overall, I would not treat this as a completely separate debugging direction, but I also would not try to squeeze all of this logic into the current introspection API as-is.

The existing introspection.py is a generic quant-debug utility focused on wrapper outputs and metric-based FP-vs-QUANT comparison, and debug_quant_outputs.py is a dataset-driven example built on top of that. In contrast, trace_qwen.py is closer to a full tracing/debugging tool: it captures inputs / kwargs / outputs, supports structured serialization, side-by-side module comparison, and interactive debugging for interesting modules.

So my preference would be:

keep trace_qwen.py as a Qwen3VL-specific example/debug script,
but move the reusable tracing/comparison primitives into introspection.py (or a nearby shared utility module), (will be dedicated issue)
so that we avoid duplicating debugging infrastructure across scripts.

In short: share the core API, keep the model-specific script separate.

dvsav · 2026-04-14T09:15:38Z

In short: share the core API, keep the model-specific script separate.

As you've mentioned in #627,

This refactoring should be done after trace_qwen.py is merged, to avoid blocking the current PR.

So, I'm ready to get to the refactoring right after this PR is merged.

This PR introduces a script for tracing, debugging and validating quantized Qwen3VL model. TICO-DCO-1.0-Signed-off-by: d.savchenkov <d.savchenkov@partner.samsung.com>

mhs4670go

LGTM

dvsav changed the title ~~Trace Qwen3VL tensors flow via forward_hook~~ Trace data flow through Qwen3VL submodules Apr 9, 2026

dvsav force-pushed the validate_side_by_side branch 11 times, most recently from bb703b1 to fb817f5 Compare April 13, 2026 11:50

dvsav marked this pull request as ready for review April 13, 2026 12:32

mhs4670go reviewed Apr 13, 2026

View reviewed changes

dvsav force-pushed the validate_side_by_side branch 3 times, most recently from f64dabd to 46d167d Compare April 13, 2026 14:08

mhs4670go mentioned this pull request Apr 13, 2026

[quantization] Refactor tracing/debugging utilities to share common core between introspection and trace_qwen #627

Open

dvsav force-pushed the validate_side_by_side branch 2 times, most recently from dd31dee to 4a022bd Compare April 14, 2026 09:08

[quantization] Trace data flow through Qwen3VL submodules

f9333fc

This PR introduces a script for tracing, debugging and validating quantized Qwen3VL model. TICO-DCO-1.0-Signed-off-by: d.savchenkov <d.savchenkov@partner.samsung.com>

dvsav force-pushed the validate_side_by_side branch from 4a022bd to f9333fc Compare April 14, 2026 11:03

dvsav mentioned this pull request Apr 14, 2026

[quantization] Fix attention_mask computation in QuantQwen3VLTextModel #630

Open

mhs4670go approved these changes Apr 14, 2026

View reviewed changes

mhs4670go merged commit 260eb65 into Samsung:main Apr 14, 2026
7 checks passed

dvsav deleted the validate_side_by_side branch April 15, 2026 06:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trace data flow through Qwen3VL submodules#615

Trace data flow through Qwen3VL submodules#615
mhs4670go merged 1 commit intoSamsung:mainfrom
dvsav:validate_side_by_side

dvsav commented Apr 9, 2026 •

edited

Loading

Uh oh!

mhs4670go Apr 13, 2026

Uh oh!

mhs4670go Apr 13, 2026

Uh oh!

dvsav Apr 14, 2026 •

edited

Loading

Uh oh!

mhs4670go Apr 13, 2026

Uh oh!

dvsav Apr 14, 2026

Uh oh!

mhs4670go Apr 13, 2026

Uh oh!

dvsav Apr 14, 2026

Uh oh!

mhs4670go Apr 13, 2026

Uh oh!

dvsav Apr 14, 2026

Uh oh!

mhs4670go commented Apr 13, 2026

Uh oh!

dvsav commented Apr 14, 2026

Uh oh!

mhs4670go left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	"[WARNING] Model name {args.model} does not refer to an existing directory. So, we'll try to download the model from huggingface."
	f"[WARNING] Model name {args.model} does not refer to an existing directory. So, we'll try to download the model from huggingface."

Conversation

dvsav commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Design and Usage

Usage Scenario 1

Usage Scenario 2

Output

Implementation Note

Detailed Examination of Specific Submodules

Differences from tico/quantization/wrapq/utils/introspection.py

Differences from tico/quantization/wrapq/examples/debug_quant_outputs.py

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dvsav Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhs4670go commented Apr 13, 2026

Uh oh!

dvsav commented Apr 14, 2026

Uh oh!

mhs4670go left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dvsav commented Apr 9, 2026 •

edited

Loading

Differences from `tico/quantization/wrapq/utils/introspection.py`

Differences from `tico/quantization/wrapq/examples/debug_quant_outputs.py`

dvsav Apr 14, 2026 •

edited

Loading