Trace data flow through Qwen3VL submodules#615
Conversation
bb703b1 to
fb817f5
Compare
|
|
||
| # List, Tuple: compare element-wise | ||
| if isinstance(lhs, Sequence): | ||
| for i, (lhs_val, rhs_val) in enumerate(zip(lhs, rhs)): |
There was a problem hiding this comment.
Using zip() here will silently drop extra elements if the lengths differ, so mismatches might go unnoticed.
There was a problem hiding this comment.
if len(lhs) != len(rhs):
raise ValueError(f"Length mismatch: {len(lhs)} != {len(rhs)}")
for i, (lhs_val, rhs_val) in enumerate(zip(lhs, rhs)):
...| delta_stats: TensorStatistics = get_tensor_statistics(delta) | ||
| interval = (lhs.max() - lhs.min()).item() | ||
| if interval != 0.0: | ||
| peir = delta_stats.max / interval |
There was a problem hiding this comment.
PEIR = max(\|a - b\|) / (max(a) - min(a))
delta_stats has absolute value of max?
There was a problem hiding this comment.
👍 Thanks for catching that. I've corrected the code:
abs_delta: torch.Tensor = (lhs - rhs).abs().to(torch.float)| import tico.quantization | ||
| import tico.quantization.config.ptq | ||
| from tico.quantization.evaluation.metric import compute_peir | ||
| from tico.quantization.evaluation.utils import plot_two_outputs |
There was a problem hiding this comment.
| from tico.quantization.evaluation.utils import plot_two_outputs |
| ) | ||
| else: | ||
| print( | ||
| "[WARNING] Model name {args.model} does not refer to an existing directory. So, we'll try to download the model from huggingface." |
There was a problem hiding this comment.
| "[WARNING] Model name {args.model} does not refer to an existing directory. So, we'll try to download the model from huggingface." | |
| f"[WARNING] Model name {args.model} does not refer to an existing directory. So, we'll try to download the model from huggingface." |
f64dabd to
46d167d
Compare
|
Overall, I would not treat this as a completely separate debugging direction, but I also would not try to squeeze all of this logic into the current introspection API as-is. The existing So my preference would be:
In short: share the core API, keep the model-specific script separate. |
dd31dee to
4a022bd
Compare
As you've mentioned in #627,
So, I'm ready to get to the refactoring right after this PR is merged. |
This PR introduces a script for tracing, debugging and validating quantized Qwen3VL model. TICO-DCO-1.0-Signed-off-by: d.savchenkov <d.savchenkov@partner.samsung.com>
4a022bd to
f9333fc
Compare
What
This PR introduces a script for tracing, debugging and validating quantized Qwen3VL model.
File:
tico/quantization/wrapq/examples/qwen/trace_qwen.py.The script supports printing out inputs and outputs of a model in a structured form as well as side-by-side compaprison of similarly named submodules in the original and quantized models.
Why
We've finished developing quant wrappers for Qwen3VL and now we need a way to validate quantized model in a way that allows for localizing and debugging any cases of significant divergence between the original (unquantized) model and the quantized one.
Design and Usage
Below are 2 basic examples of script usage:
Usage Scenario 1
Usage Scenario 2
--modelcommand-line argument is required and specifies the model (only Qwen3VL model species are supported) as either a model repository name (e.g.Qwen/Qwen3-VL-2B-Instruct) or a path to the cache directory containing model data (e.g~/models/qwen3-vl-2b). In the former case the model is downloaded from huggingface (unless already cached in the default cache directory like~/.cache/huggingface/hub/), In the latter case the model is read from the specified local directory.Output
The script always prints the generated input data for the model (and that input is hard-coded):
By default the script prints out the detailed description (in JSON format) of input and output of each submodule of the original (unquantized) model and the quantized model. You can turn off printing that information via
--no-trace-unquantizedand--no-trace-quantizedflags. The submodule's input and output description includes the following:kwargs(named arguments to the submodule'sforwardmethod).Here's an example of the output for a single submodule:
For each tensor in the above output we print the following information:
We don't print the actual tensor elements for brevity (unless the tensor contains 0 or 1 element).
Unless
--no-side-by-sideis specified, the script compares (computes the deifference between) the outputs of similarly named submodules in the unquantized and quantized models. The difference is usually a tensor and we only print statistics over its elements (as mentioned above). Here's an example of this "side-by-side" comparison (just a few first and last submodules are shown for brevity):Large numbers in the difference can be an indicator of issues in quantized model.
Implementation Note
The script implementation is based on registering a hook callback function that is called on each submodule during the inference run of a model. The hook can print inputs and outputs of each submodule and also store the outputs in a dictionary (where keys are submodule names).
Two models are probed this way: the original (unquantized) model and the quantized model.
After that the outputs of similarly named submodules are compared.
Detailed Examination of Specific Submodules
You can specify submodules' names that are subject to a more detailed examination via
--interesting-modulescommand-line flag. Submodules' names are space-separated. Here's an example:python tico/quantization/wrapq/examples/qwen/trace_qwen.py \ --model "~/models/qwen3-vl-2b" \ --interesting-modules model.language_model model.visualThe descriptions of the "interesting submodules" are then printed with more details. Specifically, not only tensors' statistics is printed, but also the actual tensors' elements (note that the output can become quite verbose then).
You can also specify
--breakpoint-on-interesting-modulescommand-line flag. This will make the script go to debug mode once it encounters any of the specified "interesting" submodules. Then you will be able to examine the stack trace and state of the program (e.g. eaxamine specific elements of the submodule's input tensor). The breakpoint occurs in the hook callback function, so you'll need to go a few frames above in the stack trace (usebt,upanddowncommands in PDB to navigate the stack trace) to get to the model's source code that has called the submodule of interest.Differences from
tico/quantization/wrapq/utils/introspection.pyintrospection.extract_tensorextracts just the 1st tensor encountered in the passedoutputargument.trace_qwensaves/analyzes full data from submodule's input/output.introspectionhas 2 separate functions for saving (save_fp_outputs) and comparing (compare_layer_outputs) submodules' outputs, whiletrace_qwenuses a single functiontrace_model_input_outputfor the above two and any other purposes that imply iterating over submodules' inputs and outputs.introspectiononly checks the outputs of each submodule, but ignores the input and kwargs.trace_qwenprints/saves full info available to module hook.introspectionadds a hook to each submodule individually viam.register_forward_hookwhiletrace_qwenleveragestorch.nn.modules.module.register_module_forward_hookto add a hook to all submodules with a single call.introspectionfunctions return a list ofRemovableHandlethat need to be removed by the caller, whiletrace_qwenuses a context manager automatically removing the single handle.Differences from
tico/quantization/wrapq/examples/debug_quant_outputs.pydebug_quant_outputsloads a full-fledged model, whiletrace_qwenloads a much lighter model with reduced number of layers and attention heads.debug_quant_outputsloads a dataset whiletrace_qwenuses a single crafted input sample (image + text).