[Falcon] Use stated vocab size#2914
Conversation
ebb4746 to
3a7e9eb
Compare
|
I'm looking at https://huggingface.co/tiiuae/falcon-rw-7b and it's very confusing:
It seems that for this model the change in this PR would make it think the vocab is 50304, while it seems to be 65024. |
|
Yea, it is confusing, but I'm pretty sure the inference will fail (model architecture differences aside). See e.g: #2887 (comment) But maybe there is some yet-undiscovered way that will make these two numbers agree |
|
I just noticed that So in this case, using the stated vocab size would be fine, but we would also have to load the tokens from this file instead of |
|
Any updates on this one? |


See e.g #2894 and #2868
In all of these cases, there is a
vocab_sizeinconfig.jsonwith the correct size, buttokenizer.jsonhas an incorrect amount of tokens compared to the vocab size. Later on, the inference is expecting a tensor withvocab_sizeas one of its dimensions but gets<actual count of tokens>instead.At least in the case of #2894, there is some configuration for an extra 'pad' token which makes up the difference (we are only missing a single token). However for #2868, the difference is much larger and I wasn't able to figure out where those tokens were supposed to come from.
In both cases, this fix was able to produce a gguf which doesn't run into that mismatch issue later on. That's because we already have some logic to introduce pad tokens if the ID is not found: https://github.com/ggerganov/llama.cpp/blob/71d6975559acfd6c8407a4ef8275a9979c737765/convert-falcon-hf-to-gguf.py#L155-L157