GPTQ quantization(3 or 4 bit quantization) support for LLaMa

[GPTQ](https://arxiv.org/abs/2210.17323) is currently the SOTA one shot quantization method for LLMs.
GPTQ supports amazingly low 3-bit and 4-bit weight quantization. And it can be applied to LLaMa.
I've actually confirmed that this works well in LLaMa 7b.
I haven't tested the memory usage(n-bit cuda kernel), but I think it should work.

| Model([LLaMa-7B](https://arxiv.org/abs/2302.13971))      | Bits | group-size | Wikitext2 |   PTB     |    C4   |
| ---------                                                | ---- | ---------- | --------- | --------- | ------- |
| FP16                                                     |  16  |     -      |    5.67   |    8.79   |   7.05  | 
| RTN                                                      |  4   |     -      |    6.28   |    9.68   |   7.70  | 
| [GPTQ](https://arxiv.org/abs/2210.17323)                 |  4   |    64      |    **6.16**   |    **9.66**   |   **7.52**  | 
| RTN                                                      |  3   |     -      |    25.66   |    61.25   |   28.19  | 
| [GPTQ](https://arxiv.org/abs/2210.17323)                 |  3   |    64      |    **12.24**   |    **16.77**   |   **9.55**  | 

code: https://github.com/qwopqwop200/GPTQ-for-LLaMa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPTQ quantization(3 or 4 bit quantization) support for LLaMa #177

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Model(LLaMa-7B)	Bits	group-size	Wikitext2	PTB	C4
FP16	16	-	5.67	8.79	7.05
RTN	4	-	6.28	9.68	7.70
GPTQ	4	64	6.16	9.66	7.52
RTN	3	-	25.66	61.25	28.19
GPTQ	3	64	12.24	16.77	9.55

GPTQ quantization(3 or 4 bit quantization) support for LLaMa #177

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions