[quantization] Quantize` lm_head` by stamalakhov · Pull Request #631 · Samsung/TICO

stamalakhov · 2026-04-14T13:50:43Z

This PR quantizes lm_head in GPTQ to imorove accuracy.

./ccex test --include-internal -k quantization.algorithm.test_gptq


RUN unit tests with -k quantization.algorithm.test_gptq ...
test_gptq_config_validate_rejects_non_positive_weight_bits_override (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_gptq_config_validate_weight_bits_overrides (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_groupwise_conv1d (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_groupwise_conv2d (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_model (quantization.algorithm.test_gptq.GPTQTest) ... <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute
<frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute
ok
test_net (quantization.algorithm.test_gptq.GPTQTest) ... No specialized wrapper found for ModuleList; applying recursive wrapping.
ok
test_net_on_zero_inputs (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_normconv1d (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_normconv1d_with_logits (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_normconv2d (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_normconv2d_on_zero_inputs (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_normconv2d_with_logits (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_normconv3d (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_normconv3d_on_zero_inputs (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_normconv3d_with_logits (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_paddednormconv2d (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_paddednormconv3d (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_resolve_weight_bits_priority (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_transposed_conv2d (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_transposed_conv2d_with_logits (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_weight_bits_overrides_are_applied_per_module (quantization.algorithm.test_gptq.GPTQTest) ... ok

----------------------------------------------------------------------
Ran 21 tests in 119.973s

OK

Value tests:

HuggingFaceTB/SmolLM2-135M-Instruct

Config ID	PPL
FP32	17.40
GPTQ_MSE_w4A16_head4	27.74
GPTQ_MSE_w4A16_head_GPTQ_4	25.01
GPTQ_SMSE_w4A16_head4	27.19
GPTQ_SMSE_w4A16_head_GPTQ_4	24.14

TinyLlama/TinyLlama-1.1B-Chat-v1.0:

Config ID	PPL
FP32	7.97
GPTQ_MSE_w4A16_head4	8.66
GPTQ_MSE_w4A16_head_GPTQ_4	8.54
GPTQ_SMSE_w4A16_head4	8.52
GPTQ_SMSE_w4A16_head_GPTQ_4	8.42

unsloth/Llama-3.2-1B-Instruct:

Config ID	PPL
FP32	13.17
GPTQ_MSE_w4A16_head4	18.59
GPTQ_MSE_w4A16_head_GPTQ_4	18.30
GPTQ_SMSE_w4A16_head4	15.26
GPTQ_SMSE_w4A16_head_GPTQ_4	15.00

unsloth/Llama-3.2-3B-Instruct:

Config ID	PPL
FP32	11.05
GPTQ_MSE_w4A16_head4	12.96
GPTQ_MSE_w4A16_head_GPTQ_4	12.67
GPTQ_SMSE_w4A16_head4	12.31
GPTQ_SMSE_w4A16_head_GPTQ_4	12.17

Related: as a fallback to #624
TICO-DCO-1.0-Signed-off-by: s.malakhov s.malakhov@partner.samsung.com

This PR quantizes `lm_head` in GPTQ to improve accuracy. TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>

mhs4670go · 2026-04-16T16:10:31Z

+            static_groups=gptq_conf.static_groups,
+            verbose=gptq_conf.verbose,
+        )
+        quantizers[f"model.lm_head"] = gptq.quantizer


Isn't it like this?

Suggested change

quantizers[f"model.lm_head"] = gptq.quantizer

quantizers[f"lm_head"] = gptq.quantizer

Ahh. NO. Just lm_head is not injected to PTQ observers. model.lm_head is injected.

Hmm.. turns out that the injection doesn't work well properly. I'll fix this soon. First of all, there are mismatch in wrapped fp_name.

#638

Sorry for bothering you. Could you review this PR and rebase it later?

And, we should re-evaluate the result :(

And, we should re-evaluate the result :(

OK.

stamalakhov self-assigned this Apr 14, 2026

stamalakhov force-pushed the GPTQ_lm_head branch 6 times, most recently from d18e64e to 23a40c6 Compare April 15, 2026 08:58

stamalakhov marked this pull request as ready for review April 15, 2026 08:59

stamalakhov requested a review from mhs4670go April 15, 2026 09:01

[quantization] Quantize lm_head

a737b08

This PR quantizes `lm_head` in GPTQ to improve accuracy. TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>

stamalakhov force-pushed the GPTQ_lm_head branch from 23a40c6 to a737b08 Compare April 15, 2026 12:46

mhs4670go reviewed Apr 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[quantization] Quantize `lm_head`#631

[quantization] Quantize `lm_head`#631
stamalakhov wants to merge 1 commit intoSamsung:mainfrom
stamalakhov:GPTQ_lm_head

stamalakhov commented Apr 14, 2026 •

edited

Loading

Uh oh!

mhs4670go Apr 16, 2026

Uh oh!

stamalakhov Apr 16, 2026

Uh oh!

mhs4670go Apr 16, 2026

Uh oh!

mhs4670go Apr 16, 2026

Uh oh!

mhs4670go Apr 16, 2026

Uh oh!

stamalakhov Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	quantizers[f"model.lm_head"] = gptq.quantizer
	quantizers[f"lm_head"] = gptq.quantizer

Conversation

stamalakhov commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mhs4670go Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

stamalakhov Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

mhs4670go Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

mhs4670go Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

mhs4670go Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

stamalakhov Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stamalakhov commented Apr 14, 2026 •

edited

Loading