Conversation
Together with an importance matrix, this brings perplexity for LLaMA-v2-70B below the perplexity of the former Q2_K with a 800 MB smaller quantized model size.
|
Just a reminder of the table obtained after some optimizations you made on Q2_K and Q3_K_S in late August 2023. #2807 That Q2_K was the one I spoke about to rename in Q3_K_XS, because it already exists and is proofed for a long time, its perplexity bump (<1%) was more than twice inferior to its size shrinking (>2%), and there's a gain of 1k context at stake in KV f16 just with that change. But it'd ofc be great to have an intermediate quant below, the Q3_K_XS that you PRed, and which looks like a Q3_K_XXS to me! |
I was taking the values from my notes, and I guess I forgot to update the notes when I made PR #2807. So, what we see in the above tables/graph is what we had before PR #2807. Here is an updated graph with the values post #2807 (i.e., current master) |
|
Q3_K_XS seems to give broken results for mixtral-type models. Generation just ends immediately or prints a few symbols then stops. I've tested and hit the bug with Noromaid-v0.4-Mixtral-Instruct-8x7b-Zloss, TenyxChat-8x7B and bagel-dpo-8x7b-v0.2. |
Thank you for noticing. It should be fixed via PR #5113 |
|
I've tested the patch, it works. Thanks! |
* Add Q3_K_XS - intermediate size between Q2_K and Q3_K_S * Q3_K_XS: quanize first 1/8 of ffn_down layers with Q4_K Together with an importance matrix, this brings perplexity for LLaMA-v2-70B below the perplexity of the former Q2_K with a 800 MB smaller quantized model size. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>


TL;DR See #5055
Before the recent two-bit quantization and importance matrix related changes, there were two low-bit quantization types available in
llama.cpp:Q2_KandQ3_K_S.Q2_Kwas basically a 3-bit quantization with just theattn_kandattn_qtensors quantized with 2 bit. The table shows their model sizes and perplexities (wiki.test.raw, n_ctx = 512) for LLaMA-v2-70B:After the recent changes,
Q2_Khas become an actual 2-bit quantization (less than 3 bits-per-weight), has a LLaMA-v-70B model size of 23.71 GiB, and a perplexity of4.0039(using an importance matrix derived fromwiki.train.raw).Q3_K_Shas increased very slightly to 27.86 GiB, but has a better perplexity of3.6603. Based on #5005 there is a need to have an intermediate step in terms of model size between the newQ2_KandQ3_K_S. This PR adds such a quantization type asQ3_K_XS. The following table summarizes the new situation for LLaMA-v2-70BThe table on a graph:
