Possible confusing wording upon introducing `tokenizers`?

Please delete if this is overly pedantic or a non-issue. I found it a bit confusing in 

https://github.com/EmilHvitfeldt/smltar/blob/d120e8b92f19e7d5464843f1f5b8f74d088a3220/02_tokenization.Rmd#L53-L66

where it is noted that 'fir-tree' was split in half and that punctuation was lost. It then introduced the concept of packages of `tokenizers` as a possible advancement to the simple technique of using `str_split`. However, after using `tokenize_words`, we still see that these issues remain.

 To me this was a little confusing - it might not be to other people.

Thanks for writing an excellent book!


	```{r}
	strsplit(the_fir_tree[1:2], "[^a-zA-Z0-9]+")
	```

	At first sight, this result looks pretty decent. However, we have lost all punctuation, which may or may not be helpful for our modeling goal, and the hero of this story (`"fir-tree"`) was split in half. Already it is clear that tokenization is going to be quite complicated. Luckily for us, a lot of work has been invested in this process, and typically it is best to use these existing tools. For example, tokenizers [@Mullen18] and spaCy [@spacy2] implement fast, consistent tokenizers we can use. Let's demonstrate with the tokenizers package.

	```{r}
	library(tokenizers)
	tokenize_words(the_fir_tree[1:2])
	```

	We see sensible single-word results here; the `tokenize_words()` function uses the stringi package [@Gagolewski19] and C++ under the hood, making it very fast. Word-level tokenization is done by finding word boundaries according to the specification from the International Components for Unicode (ICU).\index{Unicode} How does this [word boundary algorithm](https://www.unicode.org/reports/tr29/tr29-35.html#Default_Word_Boundaries) work? It can be outlined as follows:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible confusing wording upon introducing `tokenizers`? #171

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Possible confusing wording upon introducing tokenizers? #171

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Possible confusing wording upon introducing `tokenizers`? #171