Skip to content

Commit bd2af3c

Browse files
committed
added notes on tokenization
1 parent db9cd5e commit bd2af3c

2 files changed

Lines changed: 41 additions & 0 deletions

File tree

File renamed without changes.
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
## Understanding Transformers
2+
3+
- **motivation**
4+
- when we have any piece of text, to the human eye, it’s very easy to understand what’s going on, but more specifically, there are 3 observations that can be made
5+
1. the encoded input can be surprisingly large — if we use an embedding vector of length 1024, the encoded input would be the number of words x 1024
6+
2. It is not obvious how to use a FCC in this situation as sentences come in different length
7+
3. Language is ambiguous — in text, we often use the word “it” to refer to something in the sentence, but we need to get the model to understand what exactly that word is referring to
8+
- **model input**
9+
- The process begins with tokenizing the input text into individual tokens (words or subwords). Each token is then converted into a vector representation:
10+
- A lookup table (embedding matrix) is used to convert each token to a dense vector.
11+
- The dimensionality of this vector is typically 512 or 768 in models like BERT.
12+
- These embeddings are learned during the training process.
13+
- aside: **understanding tokenization**
14+
- there is some preprocessing done on the input, more specifically we tokenize the input text into subwords or words (depending on the tokenizer)
15+
- for example, BERT uses uses WordPiece, which breaks down words into subwords to handle rare or unknown words efficiently
16+
17+
```python
18+
from transformers import BertTokenizer
19+
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
20+
text = "Tokenizing input is an important step."
21+
tokens = tokenizer.tokenize(text)
22+
# Output: ['token', '##izing', 'input', 'is', 'an', 'important', 'step', '.']
23+
```
24+
25+
- sometimes models require special tokens — for example the BERT requires `[CLS]` at the beginning and `[SEP]` at the end of the sequence
26+
- then we convert each token into its corresponding integer ID using the tokenizers vocabulary
27+
28+
```python
29+
token_ids = tokenizer.convert_tokens_to_ids(tokens)
30+
# Output: [101, 19204, 2135, 1567, 2003, 2019, 2590, 3350, 1012, 102]
31+
```
32+
33+
- then we pad or truncate the sequences to a fixed length if necessary
34+
- then we create an attention mask which allows the model to differentiate between actual tokens and padding tokens — 1 for real tokens, 0 for padding tokens
35+
36+
```python
37+
attention_mask = [1 if id != 0 else 0 for id in padded_token_ids]
38+
# Output: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]
39+
```
40+
41+
-

0 commit comments

Comments
 (0)