When the tokenizer sees the unicode apostrophe, it doesn't tokenize correctly. For example:
import spacy
nlp = spacy.load('en', parser=False)
print(list(nlp.tokenizer("I'm hungry")))
print(list(nlp.tokenizer("I\u2019m hungry")))
outputs
[I, 'm, hungry]
[I’m, hungry]
My Environment
- OS X 10.11.6
- Python 3.5.2
- spacy 1.2.0
When the tokenizer sees the unicode apostrophe, it doesn't tokenize correctly. For example:
outputs
My Environment