You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Improve tokenization to work with other tokenizers (databrickslabs#40)
This addresses databrickslabs#4. The tokenizer used by bloom appears to combine the newline after `## Response:` with the following character, which does not happen with GPT-J 6b. This results in the tokens for `### Response:\n` being different when appearing in the text compared to when it is tokenized in isolation. My solution here is to change the key to `### Response:\n" so that this becomes a single token.
The other fix is to try getting a different config setting for the max length, or fallback to 1024 if none can be found.
I've tested this on [bloomz-7b1-mt](https://huggingface.co/bigscience/bloomz-7b1-mt) and it produces similar generation quality. It also still trains successfully using GPT-J 6B as the base model.
0 commit comments