Skip to content

[quantization] Quantize lm_head#631

Open
stamalakhov wants to merge 1 commit intoSamsung:mainfrom
stamalakhov:GPTQ_lm_head
Open

[quantization] Quantize lm_head#631
stamalakhov wants to merge 1 commit intoSamsung:mainfrom
stamalakhov:GPTQ_lm_head

Conversation

@stamalakhov
Copy link
Copy Markdown
Contributor

@stamalakhov stamalakhov commented Apr 14, 2026

This PR quantizes lm_head in GPTQ to imorove accuracy.

./ccex test --include-internal -k quantization.algorithm.test_gptq

RUN unit tests with -k quantization.algorithm.test_gptq ...
test_gptq_config_validate_rejects_non_positive_weight_bits_override (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_gptq_config_validate_weight_bits_overrides (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_groupwise_conv1d (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_groupwise_conv2d (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_model (quantization.algorithm.test_gptq.GPTQTest) ... <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute
<frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute
ok
test_net (quantization.algorithm.test_gptq.GPTQTest) ... No specialized wrapper found for ModuleList; applying recursive wrapping.
ok
test_net_on_zero_inputs (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_normconv1d (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_normconv1d_with_logits (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_normconv2d (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_normconv2d_on_zero_inputs (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_normconv2d_with_logits (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_normconv3d (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_normconv3d_on_zero_inputs (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_normconv3d_with_logits (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_paddednormconv2d (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_paddednormconv3d (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_resolve_weight_bits_priority (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_transposed_conv2d (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_transposed_conv2d_with_logits (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_weight_bits_overrides_are_applied_per_module (quantization.algorithm.test_gptq.GPTQTest) ... ok

----------------------------------------------------------------------
Ran 21 tests in 119.973s

OK

Value tests:

HuggingFaceTB/SmolLM2-135M-Instruct

Config ID PPL
FP32 17.40
GPTQ_MSE_w4A16_head4 27.74
GPTQ_MSE_w4A16_head_GPTQ_4 25.01
GPTQ_SMSE_w4A16_head4 27.19
GPTQ_SMSE_w4A16_head_GPTQ_4 24.14

TinyLlama/TinyLlama-1.1B-Chat-v1.0:

Config ID PPL
FP32 7.97
GPTQ_MSE_w4A16_head4 8.66
GPTQ_MSE_w4A16_head_GPTQ_4 8.54
GPTQ_SMSE_w4A16_head4 8.52
GPTQ_SMSE_w4A16_head_GPTQ_4 8.42

unsloth/Llama-3.2-1B-Instruct:

Config ID PPL
FP32 13.17
GPTQ_MSE_w4A16_head4 18.59
GPTQ_MSE_w4A16_head_GPTQ_4 18.30
GPTQ_SMSE_w4A16_head4 15.26
GPTQ_SMSE_w4A16_head_GPTQ_4 15.00

unsloth/Llama-3.2-3B-Instruct:

Config ID PPL
FP32 11.05
GPTQ_MSE_w4A16_head4 12.96
GPTQ_MSE_w4A16_head_GPTQ_4 12.67
GPTQ_SMSE_w4A16_head4 12.31
GPTQ_SMSE_w4A16_head_GPTQ_4 12.17

Related: as a fallback to #624
TICO-DCO-1.0-Signed-off-by: s.malakhov s.malakhov@partner.samsung.com

@stamalakhov stamalakhov self-assigned this Apr 14, 2026
@stamalakhov stamalakhov force-pushed the GPTQ_lm_head branch 6 times, most recently from d18e64e to 23a40c6 Compare April 15, 2026 08:58
@stamalakhov stamalakhov marked this pull request as ready for review April 15, 2026 08:59
@stamalakhov stamalakhov requested a review from mhs4670go April 15, 2026 09:01
This PR quantizes `lm_head` in GPTQ to improve accuracy.

TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>
static_groups=gptq_conf.static_groups,
verbose=gptq_conf.verbose,
)
quantizers[f"model.lm_head"] = gptq.quantizer
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it like this?

Suggested change
quantizers[f"model.lm_head"] = gptq.quantizer
quantizers[f"lm_head"] = gptq.quantizer

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh. NO. Just lm_head is not injected to PTQ observers. model.lm_head is injected.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. turns out that the injection doesn't work well properly. I'll fix this soon. First of all, there are mismatch in wrapped fp_name.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#638

Sorry for bothering you. Could you review this PR and rebase it later?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, we should re-evaluate the result :(

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, we should re-evaluate the result :(

OK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants