GPTQ is currently the SOTA one shot quantization method for LLMs.
GPTQ supports amazingly low 3-bit and 4-bit weight quantization. And it can be applied to LLaMa.
I've actually confirmed that this works well in LLaMa 7b.
I haven't tested the memory usage(n-bit cuda kernel), but I think it should work.
| Model(LLaMa-7B) |
Bits |
group-size |
Wikitext2 |
PTB |
C4 |
| FP16 |
16 |
- |
5.67 |
8.79 |
7.05 |
| RTN |
4 |
- |
6.28 |
9.68 |
7.70 |
| GPTQ |
4 |
64 |
6.16 |
9.66 |
7.52 |
| RTN |
3 |
- |
25.66 |
61.25 |
28.19 |
| GPTQ |
3 |
64 |
12.24 |
16.77 |
9.55 |
code: https://github.com/qwopqwop200/GPTQ-for-LLaMa