Support llm-compressor symmetric quantized model inference in TurboMind#4305
Support llm-compressor symmetric quantized model inference in TurboMind#4305lvhan028 merged 5 commits intoInternLM:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds support for symmetric AWQ/GPTQ models quantized by llm-compressor when running inference with TurboMind by ensuring a zeros (zero-point) tensor exists for compressed weights.
Changes:
- Add a fallback path to generate
weight_zero_pointtensors when missing (intended for symmetric quantized compressed-tensors models). - Adjust
Parameter.take()to return the matched key list, enablingCompressedWeightto detect whetherweight_zero_pointexists. - Update
get_params()wiring to pass matched keys intoCompressedWeight.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| import torch | ||
| weight_shapes = g('weight_shape') | ||
| result = [] | ||
| for weight_shape in weight_shapes: | ||
| row, col = weight_shape | ||
| tensor = torch.full((row, col // 128), 8, dtype=torch.uint8) |
There was a problem hiding this comment.
generate_zero_point() relies on g('weight_shape'), but weight_shape is not a supported kind in the input policies (e.g. process_compressed_tensor) and many readers (e.g. LlamaReader._attn) use dict.get, so this can return None or a tensor of shape values. In either case, row, col = weight_shape will produce tensors/None and torch.full((row, col // 128), ...) will raise at runtime. Prefer deriving the zeros tensor shapes from g('weight_scale') (or g('weight_packed')) which are guaranteed present for CompressedWeight, and allocate on the same device as the other tensors.
| import torch | |
| weight_shapes = g('weight_shape') | |
| result = [] | |
| for weight_shape in weight_shapes: | |
| row, col = weight_shape | |
| tensor = torch.full((row, col // 128), 8, dtype=torch.uint8) | |
| weight_scales = g('weight_scale') | |
| # Normalize to a tuple of tensors for uniform handling | |
| if isinstance(weight_scales, torch.Tensor): | |
| weight_scales = (weight_scales, ) | |
| result = [] | |
| for scale in weight_scales: | |
| # Match the shape and device of the corresponding scale tensor | |
| tensor = torch.full(scale.shape, 8, dtype=torch.uint8, device=scale.device) |
|
|
||
|
|
||
| def generate_zero_point(g): | ||
| import torch |
There was a problem hiding this comment.
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
Motivation
Make symmetric awq/gptq model quantized by llm-compressor can be inferenced in turboMind.
Modification
lmdeploy/lmdeploy/turbomind/deploy/parameter.py: add module that need to init weight_zero_point if model quantized from llm-compressor is symmetric.
Use cases (Optional)
Checklist