Unicode apostrophe confuses tokenizer

When the tokenizer sees the [unicode apostrophe](http://www.fileformat.info/info/unicode/char/2019/index.htm), it doesn't tokenize correctly.  For example:

```
import spacy

nlp = spacy.load('en', parser=False)

print(list(nlp.tokenizer("I'm hungry")))
print(list(nlp.tokenizer("I\u2019m hungry")))
```

outputs

```
[I, 'm, hungry]
[I’m, hungry]
```



## My Environment
* OS X 10.11.6
* Python 3.5.2
* spacy 1.2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unicode apostrophe confuses tokenizer #685

My Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Unicode apostrophe confuses tokenizer #685

Description

My Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions