-
Notifications
You must be signed in to change notification settings - Fork 32.5k
Closed
Description
System Info
transformersversion: 4.29.0.dev0- Platform: Linux-4.18.0-305.19.1.el8_4.x86_64-x86_64-with-glibc2.28
- Python version: 3.9.7
- Huggingface_hub version: 0.13.3
- Safetensors version: 0.3.0
- PyTorch version (GPU?): 2.1.0.dev20230411+cu117 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
As mentioned on the title, the LLaMA tokenizer does not add the eos_token at the end of the inputs. This only happens on the fast version (use_fast=True).
Steps to reproduce the behaviour:
- Load the LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(LLAMA_PATH, add_eos_token=True, use_fast=True)- Tokenize something
simple_sentence = "This is a sentence to test if the tokenizer adds eos token."
simple_sentence_ids = tokenizer(
simple_sentence, add_special_tokens=True
).input_ids- Print the
input_idsto check if theeos_token_id(2) is added at the end.
print(simple_sentence_ids)- Output:
[1, 910, 338, 263, 10541, 304, 1243, 565, 278, 5993, 3950, 12778, 321, 359, 5993, 29889]Expected behavior
Expected output
[1, 910, 338, 263, 10541, 304, 1243, 565, 278, 5993, 3950, 12778, 321, 359, 5993, 29889, 2]Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels