Please delete if this is overly pedantic or a non-issue. I found it a bit confusing in
|
|
|
```{r} |
|
strsplit(the_fir_tree[1:2], "[^a-zA-Z0-9]+") |
|
``` |
|
|
|
At first sight, this result looks pretty decent. However, we have lost all punctuation, which may or may not be helpful for our modeling goal, and the hero of this story (`"fir-tree"`) was split in half. Already it is clear that tokenization is going to be quite complicated. Luckily for us, a lot of work has been invested in this process, and typically it is best to use these existing tools. For example, **tokenizers** [@Mullen18] and **spaCy** [@spacy2] implement fast, consistent tokenizers we can use. Let's demonstrate with the **tokenizers** package. |
|
|
|
```{r} |
|
library(tokenizers) |
|
tokenize_words(the_fir_tree[1:2]) |
|
``` |
|
|
|
We see sensible single-word results here; the `tokenize_words()` function uses the **stringi** package [@Gagolewski19] and C++ under the hood, making it very fast. Word-level tokenization is done by finding word boundaries according to the specification from the International Components for Unicode (ICU).\index{Unicode} How does this [word boundary algorithm](https://www.unicode.org/reports/tr29/tr29-35.html#Default_Word_Boundaries) work? It can be outlined as follows: |
|
|
where it is noted that 'fir-tree' was split in half and that punctuation was lost. It then introduced the concept of packages of tokenizers as a possible advancement to the simple technique of using str_split. However, after using tokenize_words, we still see that these issues remain.
To me this was a little confusing - it might not be to other people.
Thanks for writing an excellent book!
Please delete if this is overly pedantic or a non-issue. I found it a bit confusing in
smltar/02_tokenization.Rmd
Lines 53 to 66 in d120e8b
where it is noted that 'fir-tree' was split in half and that punctuation was lost. It then introduced the concept of packages of
tokenizersas a possible advancement to the simple technique of usingstr_split. However, after usingtokenize_words, we still see that these issues remain.To me this was a little confusing - it might not be to other people.
Thanks for writing an excellent book!