|
| 1 | +## Understanding Transformers |
| 2 | + |
| 3 | +- **motivation** |
| 4 | + - when we have any piece of text, to the human eye, it’s very easy to understand what’s going on, but more specifically, there are 3 observations that can be made |
| 5 | + 1. the encoded input can be surprisingly large — if we use an embedding vector of length 1024, the encoded input would be the number of words x 1024 |
| 6 | + 2. It is not obvious how to use a FCC in this situation as sentences come in different length |
| 7 | + 3. Language is ambiguous — in text, we often use the word “it” to refer to something in the sentence, but we need to get the model to understand what exactly that word is referring to |
| 8 | +- **model input** |
| 9 | + - The process begins with tokenizing the input text into individual tokens (words or subwords). Each token is then converted into a vector representation: |
| 10 | + - A lookup table (embedding matrix) is used to convert each token to a dense vector. |
| 11 | + - The dimensionality of this vector is typically 512 or 768 in models like BERT. |
| 12 | + - These embeddings are learned during the training process. |
| 13 | + - aside: **understanding tokenization** |
| 14 | + - there is some preprocessing done on the input, more specifically we tokenize the input text into subwords or words (depending on the tokenizer) |
| 15 | + - for example, BERT uses uses WordPiece, which breaks down words into subwords to handle rare or unknown words efficiently |
| 16 | + |
| 17 | + ```python |
| 18 | + from transformers import BertTokenizer |
| 19 | + tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') |
| 20 | + text = "Tokenizing input is an important step." |
| 21 | + tokens = tokenizer.tokenize(text) |
| 22 | + # Output: ['token', '##izing', 'input', 'is', 'an', 'important', 'step', '.'] |
| 23 | + ``` |
| 24 | + |
| 25 | + - sometimes models require special tokens — for example the BERT requires `[CLS]` at the beginning and `[SEP]` at the end of the sequence |
| 26 | + - then we convert each token into its corresponding integer ID using the tokenizers vocabulary |
| 27 | + |
| 28 | + ```python |
| 29 | + token_ids = tokenizer.convert_tokens_to_ids(tokens) |
| 30 | + # Output: [101, 19204, 2135, 1567, 2003, 2019, 2590, 3350, 1012, 102] |
| 31 | + ``` |
| 32 | + |
| 33 | + - then we pad or truncate the sequences to a fixed length if necessary |
| 34 | + - then we create an attention mask which allows the model to differentiate between actual tokens and padding tokens — 1 for real tokens, 0 for padding tokens |
| 35 | + |
| 36 | + ```python |
| 37 | + attention_mask = [1 if id != 0 else 0 for id in padded_token_ids] |
| 38 | + # Output: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0] |
| 39 | + ``` |
| 40 | + |
| 41 | +- |
0 commit comments