Skip to content

Support llm-compressor symmetric quantized model inference in TurboMind#4305

Merged
lvhan028 merged 5 commits intoInternLM:mainfrom
43758726:add/awq_symmetric
Feb 2, 2026
Merged

Support llm-compressor symmetric quantized model inference in TurboMind#4305
lvhan028 merged 5 commits intoInternLM:mainfrom
43758726:add/awq_symmetric

Conversation

@43758726
Copy link
Copy Markdown
Collaborator

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Make symmetric awq/gptq model quantized by llm-compressor can be inferenced in turboMind.

Modification

lmdeploy/lmdeploy/turbomind/deploy/parameter.py: add module that need to init weight_zero_point if model quantized from llm-compressor is symmetric.

Use cases (Optional)

from lmdeploy import pipeline, TurbomindEngineConfig
engine_config = TurbomindEngineConfig()
pipe = pipeline("{awq/gptq model path quantzied by llm-compressor}", 
                backend_config=engine_config)
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)

Checklist

  1. Pre-commit or other linting tools are used to fix the potential lint issues.
  2. The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
  3. If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
  4. The documentation has been modified accordingly, like docstring or example tutorials.

Copilot AI review requested due to automatic review settings January 28, 2026 14:09
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds support for symmetric AWQ/GPTQ models quantized by llm-compressor when running inference with TurboMind by ensuring a zeros (zero-point) tensor exists for compressed weights.

Changes:

  • Add a fallback path to generate weight_zero_point tensors when missing (intended for symmetric quantized compressed-tensors models).
  • Adjust Parameter.take() to return the matched key list, enabling CompressedWeight to detect whether weight_zero_point exists.
  • Update get_params() wiring to pass matched keys into CompressedWeight.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +35 to +40
import torch
weight_shapes = g('weight_shape')
result = []
for weight_shape in weight_shapes:
row, col = weight_shape
tensor = torch.full((row, col // 128), 8, dtype=torch.uint8)
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generate_zero_point() relies on g('weight_shape'), but weight_shape is not a supported kind in the input policies (e.g. process_compressed_tensor) and many readers (e.g. LlamaReader._attn) use dict.get, so this can return None or a tensor of shape values. In either case, row, col = weight_shape will produce tensors/None and torch.full((row, col // 128), ...) will raise at runtime. Prefer deriving the zeros tensor shapes from g('weight_scale') (or g('weight_packed')) which are guaranteed present for CompressedWeight, and allocate on the same device as the other tensors.

Suggested change
import torch
weight_shapes = g('weight_shape')
result = []
for weight_shape in weight_shapes:
row, col = weight_shape
tensor = torch.full((row, col // 128), 8, dtype=torch.uint8)
weight_scales = g('weight_scale')
# Normalize to a tuple of tensors for uniform handling
if isinstance(weight_scales, torch.Tensor):
weight_scales = (weight_scales, )
result = []
for scale in weight_scales:
# Match the shape and device of the corresponding scale tensor
tensor = torch.full(scale.shape, 8, dtype=torch.uint8, device=scale.device)

Copilot uses AI. Check for mistakes.


def generate_zero_point(g):
import torch
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import of module torch is redundant, as it was previously imported on line 5.
This import of module lmdeploy.pytorch.check_env.torch is redundant, as it was previously imported on line 5.

Suggested change
import torch

Copilot uses AI. Check for mistakes.
@lvhan028 lvhan028 changed the title [Add] make llm-compressor symmetric model inference in TurboMind Support llm-compressor symmetric quantized model inference in TurboMind Feb 2, 2026
@lvhan028 lvhan028 merged commit 809d114 into InternLM:main Feb 2, 2026
4 of 5 checks passed
@lvhan028 lvhan028 added the enhancement New feature or request label Feb 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants