Skip to content

GPTQ quantization(3 or 4 bit quantization) support for LLaMa #177

@qwopqwop200

Description

@qwopqwop200

GPTQ is currently the SOTA one shot quantization method for LLMs.
GPTQ supports amazingly low 3-bit and 4-bit weight quantization. And it can be applied to LLaMa.
I've actually confirmed that this works well in LLaMa 7b.
I haven't tested the memory usage(n-bit cuda kernel), but I think it should work.

Model(LLaMa-7B) Bits group-size Wikitext2 PTB C4
FP16 16 - 5.67 8.79 7.05
RTN 4 - 6.28 9.68 7.70
GPTQ 4 64 6.16 9.66 7.52
RTN 3 - 25.66 61.25 28.19
GPTQ 3 64 12.24 16.77 9.55

code: https://github.com/qwopqwop200/GPTQ-for-LLaMa

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions