[Feat] Support llm-compressor AWQ models in TurboMind#4290
[Feat] Support llm-compressor AWQ models in TurboMind#4290lvhan028 merged 6 commits intoInternLM:mainfrom
Conversation
…be inderenced in turboMind
There was a problem hiding this comment.
Pull request overview
This PR adds support for AWQ models quantized by llm-compressor in TurboMind by treating them as a variant of the AWQ format. The implementation adds handlers for the "compressed-tensors" quantization format throughout the deployment pipeline.
Changes:
- Added support for parsing llm-compressor's "compressed-tensors" quantization config format
- Implemented weight unpacking logic specific to llm-compressor's packed tensor format
- Added parameter export handlers to convert llm-compressor weights to TurboMind format
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| lmdeploy/turbomind/deploy/converter.py | Adds parsing logic for compressed-tensors config and converts format to 'awq' internally |
| lmdeploy/turbomind/deploy/policy.py | Implements process_compressed_tensor function to unpack llm-compressor quantized weights |
| lmdeploy/turbomind/deploy/parameter.py | Adds QuantWeightCompressorOnly class to handle llm-compressor weight parameter export |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| _config_groups = quant_config.get('config_groups') | ||
| _group_0 = _config_groups.get('group_0') | ||
| _format = _group_0.get('format') | ||
| assert _format == 'pack-quantized', \ | ||
| f'format is {_format}' \ | ||
| f'only pack-quantized format is supported' \ | ||
| f'when quant_method is compressed-tensors' | ||
| _weights = _group_0.get('weights') |
There was a problem hiding this comment.
The code does not handle the case where group_0 might be None or missing. If _config_groups.get('group_0') returns None, the subsequent _group_0.get('format') will raise an AttributeError. Consider adding a check to ensure this nested key exists before accessing it.
| _config_groups = quant_config.get('config_groups') | |
| _group_0 = _config_groups.get('group_0') | |
| _format = _group_0.get('format') | |
| assert _format == 'pack-quantized', \ | |
| f'format is {_format}' \ | |
| f'only pack-quantized format is supported' \ | |
| f'when quant_method is compressed-tensors' | |
| _weights = _group_0.get('weights') | |
| _config_groups = quant_config.get('config_groups') or {} | |
| _group_0 = _config_groups.get('group_0') or {} | |
| _format = _group_0.get('format') | |
| assert _format == 'pack-quantized', \ | |
| f'format is {_format}' \ | |
| f'only pack-quantized format is supported' \ | |
| f'when quant_method is compressed-tensors' | |
| _weights = _group_0.get('weights') or {} |
| _config_groups = quant_config.get('config_groups') | ||
| _group_0 = _config_groups.get('group_0') | ||
| _format = _group_0.get('format') | ||
| assert _format == 'pack-quantized', \ | ||
| f'format is {_format}' \ | ||
| f'only pack-quantized format is supported' \ | ||
| f'when quant_method is compressed-tensors' | ||
| _weights = _group_0.get('weights') | ||
| _group_size = _weights.get('group_size') | ||
| _num_bits = _weights.get('num_bits') | ||
| _type = _weights.get('type') | ||
| assert _num_bits == 4 and _type == 'int', \ | ||
| 'only 4 bit integer weighttype is supported' \ |
There was a problem hiding this comment.
The code does not handle the case where weights might be None or missing. If _group_0.get('weights') returns None, the subsequent calls to _weights.get('group_size'), _weights.get('num_bits'), and _weights.get('type') will raise AttributeError. Consider adding a check to ensure this nested key exists before accessing it.
| _config_groups = quant_config.get('config_groups') | |
| _group_0 = _config_groups.get('group_0') | |
| _format = _group_0.get('format') | |
| assert _format == 'pack-quantized', \ | |
| f'format is {_format}' \ | |
| f'only pack-quantized format is supported' \ | |
| f'when quant_method is compressed-tensors' | |
| _weights = _group_0.get('weights') | |
| _group_size = _weights.get('group_size') | |
| _num_bits = _weights.get('num_bits') | |
| _type = _weights.get('type') | |
| assert _num_bits == 4 and _type == 'int', \ | |
| 'only 4 bit integer weighttype is supported' \ | |
| _config_groups = quant_config.get('config_groups') or {} | |
| _group_0 = _config_groups.get('group_0') or {} | |
| _format = _group_0.get('format') | |
| assert _format == 'pack-quantized', \ | |
| f'format is {_format}' \ | |
| f'only pack-quantized format is supported' \ | |
| f'when quant_method is compressed-tensors' | |
| _weights = _group_0.get('weights') or {} | |
| _group_size = _weights.get('group_size') | |
| _num_bits = _weights.get('num_bits') | |
| _type = _weights.get('type') | |
| assert _group_size is not None and _num_bits == 4 \ | |
| and _type == 'int', \ | |
| 'only 4 bit integer weighttype is supported ' \ |
| _num_bits = _weights.get('num_bits') | ||
| _type = _weights.get('type') | ||
| assert _num_bits == 4 and _type == 'int', \ | ||
| 'only 4 bit integer weighttype is supported' \ |
There was a problem hiding this comment.
The word "weighttype" should be two separate words: "weight type". This improves readability and follows standard English conventions.
| 'only 4 bit integer weighttype is supported' \ | |
| 'only 4 bit integer weight type is supported' \ |
| if model_format in ['awq', 'gptq']: | ||
| if model_format in ['awq', 'gptq', 'compressed-tensors']: | ||
| weight_type = 'int4' | ||
| dtype = 'float16' # force float16 for GPTQ/AWQ weights |
There was a problem hiding this comment.
The comment refers to "GPTQ/AWQ weights" but now compressed-tensors format is also included in this conditional branch. The comment should be updated to "GPTQ/AWQ/compressed-tensors weights" or more generically "int4 quantized weights" to accurately reflect the code behavior.
| dtype = 'float16' # force float16 for GPTQ/AWQ weights | |
| dtype = 'float16' # force float16 for int4 quantized weights |
| f'format is {_format}' \ | ||
| f'only pack-quantized format is supported' \ | ||
| f'when quant_method is compressed-tensors' | ||
| _weights = _group_0.get('weights') | ||
| _group_size = _weights.get('group_size') | ||
| _num_bits = _weights.get('num_bits') | ||
| _type = _weights.get('type') | ||
| assert _num_bits == 4 and _type == 'int', \ | ||
| 'only 4 bit integer weighttype is supported' \ |
There was a problem hiding this comment.
The error message strings are not properly joined and are missing spaces between the concatenated f-strings. This will produce a message like "format is pack-quantizedonly pack-quantized format is supportedwhen quant_method is compressed-tensors" instead of the intended message with proper spacing.
| f'format is {_format}' \ | |
| f'only pack-quantized format is supported' \ | |
| f'when quant_method is compressed-tensors' | |
| _weights = _group_0.get('weights') | |
| _group_size = _weights.get('group_size') | |
| _num_bits = _weights.get('num_bits') | |
| _type = _weights.get('type') | |
| assert _num_bits == 4 and _type == 'int', \ | |
| 'only 4 bit integer weighttype is supported' \ | |
| f'format is {_format}. ' \ | |
| 'only pack-quantized format is supported ' \ | |
| 'when quant_method is compressed-tensors' | |
| _weights = _group_0.get('weights') | |
| _group_size = _weights.get('group_size') | |
| _num_bits = _weights.get('num_bits') | |
| _type = _weights.get('type') | |
| assert _num_bits == 4 and _type == 'int', \ | |
| 'only 4 bit integer weight type is supported ' \ |
| f'format is {_format}' \ | ||
| f'only pack-quantized format is supported' \ | ||
| f'when quant_method is compressed-tensors' | ||
| _weights = _group_0.get('weights') | ||
| _group_size = _weights.get('group_size') | ||
| _num_bits = _weights.get('num_bits') | ||
| _type = _weights.get('type') | ||
| assert _num_bits == 4 and _type == 'int', \ | ||
| 'only 4 bit integer weighttype is supported' \ |
There was a problem hiding this comment.
The error message strings are not properly joined and are missing spaces between the concatenated strings. This will produce a message like "only 4 bit integer weighttype is supportedwhen format is pack-quantized" instead of the intended message with proper spacing.
| f'format is {_format}' \ | |
| f'only pack-quantized format is supported' \ | |
| f'when quant_method is compressed-tensors' | |
| _weights = _group_0.get('weights') | |
| _group_size = _weights.get('group_size') | |
| _num_bits = _weights.get('num_bits') | |
| _type = _weights.get('type') | |
| assert _num_bits == 4 and _type == 'int', \ | |
| 'only 4 bit integer weighttype is supported' \ | |
| f'format is {_format} ' \ | |
| f'only pack-quantized format is supported ' \ | |
| f'when quant_method is compressed-tensors' | |
| _weights = _group_0.get('weights') | |
| _group_size = _weights.get('group_size') | |
| _num_bits = _weights.get('num_bits') | |
| _type = _weights.get('type') | |
| assert _num_bits == 4 and _type == 'int', \ | |
| 'only 4 bit integer weighttype is supported ' \ |
| _config_groups = quant_config.get('config_groups') | ||
| _group_0 = _config_groups.get('group_0') | ||
| _format = _group_0.get('format') | ||
| assert _format == 'pack-quantized', \ | ||
| f'format is {_format}' \ | ||
| f'only pack-quantized format is supported' \ | ||
| f'when quant_method is compressed-tensors' | ||
| _weights = _group_0.get('weights') |
There was a problem hiding this comment.
The code does not handle the case where config_groups might be None or missing from the quant_config. If quant_config.get('config_groups') returns None, the subsequent _config_groups.get('group_0') will raise an AttributeError. Consider adding a check to ensure the nested structure exists before accessing it.
| _config_groups = quant_config.get('config_groups') | |
| _group_0 = _config_groups.get('group_0') | |
| _format = _group_0.get('format') | |
| assert _format == 'pack-quantized', \ | |
| f'format is {_format}' \ | |
| f'only pack-quantized format is supported' \ | |
| f'when quant_method is compressed-tensors' | |
| _weights = _group_0.get('weights') | |
| _config_groups = quant_config.get('config_groups') | |
| assert isinstance(_config_groups, dict), \ | |
| f'invalid quant_config: expected "config_groups" dict, ' \ | |
| f'got {_config_groups!r}' | |
| _group_0 = _config_groups.get('group_0') | |
| assert isinstance(_group_0, dict), \ | |
| f'invalid quant_config: expected "group_0" dict, ' \ | |
| f'got {_group_0!r}' | |
| _format = _group_0.get('format') | |
| assert _format == 'pack-quantized', \ | |
| f'format is {_format}' \ | |
| f'only pack-quantized format is supported' \ | |
| f'when quant_method is compressed-tensors' | |
| _weights = _group_0.get('weights') | |
| assert isinstance(_weights, dict), \ | |
| f'invalid quant_config: expected "weights" dict, ' \ | |
| f'got {_weights!r}' |
|
cc @zhulinJulia24 we need to add llm-compressor quantized model into our test case. |
Support llm-compressor AWQ models in TurboMind. ( Fix #3917)
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
Motivation
Make awq model quantized by llm-compressor can be inferenced in turboMind.
Modification
lmdeploy/lmdeploy/turbomind/deploy/converter.py: add module that read config.json from llm-compressor.
lmdeploy/lmdeploy/turbomind/deploy/policy.py: add module that unpack awq weights from llm-compressor.
lmdeploy/lmdeploy/turbomind/deploy/parameter.py: add module that can export weights from llm-compressor to turboMind.
Use cases (Optional)
Checklist