|
52 | 52 |
|
53 | 53 |  |
54 | 54 |
|
55 | | - - in the image above, we optimize all the parameters by walking down the gradient (minimizing the loss by computing all the vector gradients) |
| 55 | + - in the image above, we optimize all the parameters by walking down the gradient (minimizing the loss by computing all the vector gradients) |
| 56 | + |
| 57 | +## lecture 2. |
| 58 | + |
| 59 | +- **bag of words model (word2vec)** |
| 60 | + - this model makes the same predictions at each position |
| 61 | + - to develop the intuition here, the models have gotten better over time. we want a model that gives a reasonably high probability estimate to all words that occur in the context |
| 62 | + - parameters |
| 63 | + - the parameters of this model are the vectors of the outside words, vectors of the center words |
| 64 | + - recall: the model takes the dot product of the outside vector (U) and dots it with the center word to output a 1-D vector. this vector is then passed through a softmax |
| 65 | +- **optimization (gradient descent)** |
| 66 | + - to learn good word vectors, we have a loss function that we want to minimize |
| 67 | + - gradient descent is an algorithm to minimize the loss function by changing the parameters |
| 68 | + - the concept here is to compute the gradient from the current value of the parameters and then take a small step in the direction of the negative gradient |
| 69 | + - we do this because we want to minimize the loss |
| 70 | + - the formula looks like this: $\theta^{new} = \theta^{old} - \alpha * \nabla_{\theta}L(\theta)$ |
| 71 | +- **stochastic gradient descent** |
| 72 | + - the idea behind doing SGD is because it’s very expensive computing and fine tuning the parameters every single time |
| 73 | + - instead what we do is that we repeatedly sample windows and update after each one / small batch |
| 74 | + - in other words, instead of doing the entire dataset, we do smaller batches and update the dataset |
| 75 | +- **word2vec algorithm family** |
| 76 | + - there are 2 model variants |
| 77 | + - skip-grams |
| 78 | + - they predict context words given the center word |
| 79 | + - continuous bag of words (CBOW) |
| 80 | + - predict the center word from a bag of context words |
0 commit comments