Skip to content

Unicode apostrophe confuses tokenizer #685

@pokey

Description

@pokey

When the tokenizer sees the unicode apostrophe, it doesn't tokenize correctly. For example:

import spacy

nlp = spacy.load('en', parser=False)

print(list(nlp.tokenizer("I'm hungry")))
print(list(nlp.tokenizer("I\u2019m hungry")))

outputs

[I, 'm, hungry]
[I’m, hungry]

My Environment

  • OS X 10.11.6
  • Python 3.5.2
  • spacy 1.2.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions