Skip to content

Trace data flow through Qwen3VL submodules#615

Merged
mhs4670go merged 1 commit intoSamsung:mainfrom
dvsav:validate_side_by_side
Apr 14, 2026
Merged

Trace data flow through Qwen3VL submodules#615
mhs4670go merged 1 commit intoSamsung:mainfrom
dvsav:validate_side_by_side

Conversation

@dvsav
Copy link
Copy Markdown
Contributor

@dvsav dvsav commented Apr 9, 2026

What

This PR introduces a script for tracing, debugging and validating quantized Qwen3VL model.
File: tico/quantization/wrapq/examples/qwen/trace_qwen.py.
The script supports printing out inputs and outputs of a model in a structured form as well as side-by-side compaprison of similarly named submodules in the original and quantized models.

Why

We've finished developing quant wrappers for Qwen3VL and now we need a way to validate quantized model in a way that allows for localizing and debugging any cases of significant divergence between the original (unquantized) model and the quantized one.

Design and Usage

Below are 2 basic examples of script usage:

Usage Scenario 1

# Basic scenario: print all modules' inputs and outputs and compare them.
python tico/quantization/wrapq/examples/qwen/trace_qwen.py \
    --model ~/models/qwen3-vl-2b

# The same as above, but downloading the model from huggingface hub.
python tico/quantization/wrapq/examples/qwen/trace_qwen.py \
    --model Qwen/Qwen3-VL-2B-Instruct

Usage Scenario 2

# Don't print outputs, only compare them. Quantization disabled for now, validation criterion: difference between submodules' outputs must be close to zero.
python tico/quantization/wrapq/examples/qwen/trace_qwen.py \
    --model ~/models/qwen3-vl-2b \
    --no-trace-unquantized \
    --no-trace-quantized

# Same as above, but with enabled quantization. Will show gradual divergence between submodules' outputs due to quantization error accumulation.
python tico/quantization/wrapq/examples/qwen/trace_qwen.py \
    --model ~/models/qwen3-vl-2b \
    --no-trace-unquantized \
    --no-trace-quantized \
    --enable-quantization

--model command-line argument is required and specifies the model (only Qwen3VL model species are supported) as either a model repository name (e.g. Qwen/Qwen3-VL-2B-Instruct) or a path to the cache directory containing model data (e.g ~/models/qwen3-vl-2b). In the former case the model is downloaded from huggingface (unless already cached in the default cache directory like ~/.cache/huggingface/hub/), In the latter case the model is read from the specified local directory.

Output

The script always prints the generated input data for the model (and that input is hard-coded):

********************************************************************************
*                                 MODEL INPUTS                                 *
********************************************************************************

input_ids:
    tensor([[644, 872, 198, 652, 998, 998, 998, 998, 998, 998, 998, 998, 998, 998,
             998, 998, 998, 998, 998, 998, 998, 998, 998, 998, 998, 998, 998, 998,
             998, 998, 998, 998, 998, 998, 998, 998, 998, 998, 998, 998, 998, 998,
             998, 998, 998, 998, 998, 998, 998, 998, 998, 998, 998, 998, 998, 998,
             998, 998, 998, 998, 998, 998, 998, 998, 998, 998, 998, 998, 998, 998,
             998, 998, 998, 998, 653, 785, 279, 168,  13, 645, 198, 644,  91, 198]])
    shape: torch.Size([1, 84])
    dtype: torch.int64

attention_mask:
    tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
             1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
             1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
             1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
    shape: torch.Size([1, 84])
    dtype: torch.int64

pixel_values:
    tensor([[-1., -1., -1.,  ..., -1., -1., -1.],
            [-1., -1., -1.,  ..., -1., -1., -1.],
            [-1., -1., -1.,  ..., -1., -1., -1.],
            ...,
            [-1., -1., -1.,  ..., -1., -1., -1.],
            [-1., -1., -1.,  ..., -1., -1., -1.],
            [-1., -1., -1.,  ..., -1., -1., -1.]])
    shape: torch.Size([280, 1536])
    dtype: torch.float32

image_grid_thw:
    tensor([[ 1, 20, 14]])
    shape: torch.Size([1, 3])
    dtype: torch.int64

By default the script prints out the detailed description (in JSON format) of input and output of each submodule of the original (unquantized) model and the quantized model. You can turn off printing that information via --no-trace-unquantized and --no-trace-quantized flags. The submodule's input and output description includes the following:

  • Submodule's name.
  • Submodule's type.
  • Input data (usually a tensor or a tuple of tensors).
  • kwargs (named arguments to the submodule's forward method).
  • Output data (usually a tensor or a class containing tensors).

Here's an example of the output for a single submodule:

{
    "module_name": "model.language_model.embed_tokens",
    "module_type": "Embedding",
    "inputs": {
        "0": {
            "type": "Tensor",
            "dtype": "torch.int64",
            "shape": "torch.Size([1, 84])",
            "statistics": {
                "mean": 903.5714111328125,
                "min": 13.0,
                "max": 998.0,
                "stddev": 241.6049346923828
            }
        },
        "type": "tuple"
    },
    "kwargs": {},
    "output": {
        "type": "Tensor",
        "dtype": "torch.float32",
        "shape": "torch.Size([1, 84, 64])",
        "statistics": {
            "mean": 0.0001286495680687949,
            "min": -0.07198204100131989,
            "max": 0.07216963917016983,
            "stddev": 0.02147550694644451
        }
    }
}

For each tensor in the above output we print the following information:

  • Tensor type.
  • Tensor's elements data type.
  • Tensor's shape.
  • statistics over all tensor's elements (min, max, mean, standard deviation).

We don't print the actual tensor elements for brevity (unless the tensor contains 0 or 1 element).

Unless --no-side-by-side is specified, the script compares (computes the deifference between) the outputs of similarly named submodules in the unquantized and quantized models. The difference is usually a tensor and we only print statistics over its elements (as mentioned above). Here's an example of this "side-by-side" comparison (just a few first and last submodules are shown for brevity):

--------------------------------------------------------------------------------
MODULE NAME                         DIFFERENCE
--------------------------------------------------------------------------------
model.language_model.embed_tokens   {'mean': '0.0', 'min': '0.0', 'max': '0.0', 'stddev': '0.0', 'PEIR': '0.0', 'type': 'dict'}
model.visual.patch_embed.proj       {'mean': '-1.0244548320770264e-08', 'min': '-2.205371856689453e-06', 'max': '1.2069940567016602e-06', 'stddev': '6.208474019331334e-07', 'PEIR': '3.024483349539585e-07', 'type': 'dict'}
...                                 ...
lm_head                             {'mean': '-6.053048728915655e-09', 'min': '-7.580965757369995e-07', 'max': '8.491333574056625e-07', 'stddev': '1.7772102012258983e-07', 'PEIR': '7.257339589703714e-07', 'type': 'dict'}

Large numbers in the difference can be an indicator of issues in quantized model.

Implementation Note

The script implementation is based on registering a hook callback function that is called on each submodule during the inference run of a model. The hook can print inputs and outputs of each submodule and also store the outputs in a dictionary (where keys are submodule names).
Two models are probed this way: the original (unquantized) model and the quantized model.
After that the outputs of similarly named submodules are compared.

Detailed Examination of Specific Submodules

You can specify submodules' names that are subject to a more detailed examination via --interesting-modules command-line flag. Submodules' names are space-separated. Here's an example:

python tico/quantization/wrapq/examples/qwen/trace_qwen.py \
    --model "~/models/qwen3-vl-2b" \
    --interesting-modules model.language_model model.visual

The descriptions of the "interesting submodules" are then printed with more details. Specifically, not only tensors' statistics is printed, but also the actual tensors' elements (note that the output can become quite verbose then).

You can also specify --breakpoint-on-interesting-modules command-line flag. This will make the script go to debug mode once it encounters any of the specified "interesting" submodules. Then you will be able to examine the stack trace and state of the program (e.g. eaxamine specific elements of the submodule's input tensor). The breakpoint occurs in the hook callback function, so you'll need to go a few frames above in the stack trace (use bt, up and down commands in PDB to navigate the stack trace) to get to the model's source code that has called the submodule of interest.

Differences from tico/quantization/wrapq/utils/introspection.py

  • introspection.extract_tensor extracts just the 1st tensor encountered in the passed output argument. trace_qwen saves/analyzes full data from submodule's input/output.
  • introspection has 2 separate functions for saving (save_fp_outputs) and comparing (compare_layer_outputs) submodules' outputs, while trace_qwen uses a single function trace_model_input_output for the above two and any other purposes that imply iterating over submodules' inputs and outputs.
  • introspection only checks the outputs of each submodule, but ignores the input and kwargs. trace_qwen prints/saves full info available to module hook.
  • introspection adds a hook to each submodule individually via m.register_forward_hook while trace_qwen leverages torch.nn.modules.module.register_module_forward_hook to add a hook to all submodules with a single call.
  • introspection functions return a list of RemovableHandle that need to be removed by the caller, while trace_qwen uses a context manager automatically removing the single handle.

Differences from tico/quantization/wrapq/examples/debug_quant_outputs.py

  • debug_quant_outputs loads a full-fledged model, while trace_qwen loads a much lighter model with reduced number of layers and attention heads.
  • debug_quant_outputs loads a dataset while trace_qwen uses a single crafted input sample (image + text).

@dvsav dvsav changed the title Trace Qwen3VL tensors flow via forward_hook Trace data flow through Qwen3VL submodules Apr 9, 2026
@dvsav dvsav force-pushed the validate_side_by_side branch 11 times, most recently from bb703b1 to fb817f5 Compare April 13, 2026 11:50
@dvsav dvsav marked this pull request as ready for review April 13, 2026 12:32

# List, Tuple: compare element-wise
if isinstance(lhs, Sequence):
for i, (lhs_val, rhs_val) in enumerate(zip(lhs, rhs)):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using zip() here will silently drop extra elements if the lengths differ, so mismatches might go unnoticed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if len(lhs) != len(rhs):
    raise ValueError(f"Length mismatch: {len(lhs)} != {len(rhs)}")

for i, (lhs_val, rhs_val) in enumerate(zip(lhs, rhs)):
    ...

Copy link
Copy Markdown
Contributor Author

@dvsav dvsav Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 done

delta_stats: TensorStatistics = get_tensor_statistics(delta)
interval = (lhs.max() - lhs.min()).item()
if interval != 0.0:
peir = delta_stats.max / interval
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEIR = max(\|a - b\|) / (max(a) - min(a))

delta_stats has absolute value of max?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Thanks for catching that. I've corrected the code:

abs_delta: torch.Tensor = (lhs - rhs).abs().to(torch.float)

import tico.quantization
import tico.quantization.config.ptq
from tico.quantization.evaluation.metric import compute_peir
from tico.quantization.evaluation.utils import plot_two_outputs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from tico.quantization.evaluation.utils import plot_two_outputs

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 done

)
else:
print(
"[WARNING] Model name {args.model} does not refer to an existing directory. So, we'll try to download the model from huggingface."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"[WARNING] Model name {args.model} does not refer to an existing directory. So, we'll try to download the model from huggingface."
f"[WARNING] Model name {args.model} does not refer to an existing directory. So, we'll try to download the model from huggingface."

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 done

@dvsav dvsav force-pushed the validate_side_by_side branch 3 times, most recently from f64dabd to 46d167d Compare April 13, 2026 14:08
@mhs4670go
Copy link
Copy Markdown
Contributor

Overall, I would not treat this as a completely separate debugging direction, but I also would not try to squeeze all of this logic into the current introspection API as-is.

The existing introspection.py is a generic quant-debug utility focused on wrapper outputs and metric-based FP-vs-QUANT comparison, and debug_quant_outputs.py is a dataset-driven example built on top of that. In contrast, trace_qwen.py is closer to a full tracing/debugging tool: it captures inputs / kwargs / outputs, supports structured serialization, side-by-side module comparison, and interactive debugging for interesting modules.

So my preference would be:

  • keep trace_qwen.py as a Qwen3VL-specific example/debug script,
  • but move the reusable tracing/comparison primitives into introspection.py (or a nearby shared utility module), (will be dedicated issue)
  • so that we avoid duplicating debugging infrastructure across scripts.

In short: share the core API, keep the model-specific script separate.

@dvsav
Copy link
Copy Markdown
Contributor Author

dvsav commented Apr 14, 2026

In short: share the core API, keep the model-specific script separate.

As you've mentioned in #627,

This refactoring should be done after trace_qwen.py is merged, to avoid blocking the current PR.

So, I'm ready to get to the refactoring right after this PR is merged.

This PR introduces a script for tracing, debugging and validating quantized Qwen3VL model.

TICO-DCO-1.0-Signed-off-by: d.savchenkov <d.savchenkov@partner.samsung.com>
Copy link
Copy Markdown
Contributor

@mhs4670go mhs4670go left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mhs4670go mhs4670go merged commit 260eb65 into Samsung:main Apr 14, 2026
7 checks passed
@dvsav dvsav deleted the validate_side_by_side branch April 15, 2026 06:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants