Skip to content

Commit d77b55e

Browse files
committed
added notes from lec 2 cs224n
1 parent 51775b2 commit d77b55e

1 file changed

Lines changed: 26 additions & 1 deletion

File tree

stanford_lectures/cs224n/cs224n.md

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,4 +52,29 @@
5252

5353
![Screenshot 2024-08-01 at 5.53.47 PM.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/c9aa599c-2115-4330-846e-652102e8621e/40de4c60-feb7-4e52-9eda-3fdfcb8abbaa/Screenshot_2024-08-01_at_5.53.47_PM.png)
5454

55-
- in the image above, we optimize all the parameters by walking down the gradient (minimizing the loss by computing all the vector gradients)
55+
- in the image above, we optimize all the parameters by walking down the gradient (minimizing the loss by computing all the vector gradients)
56+
57+
## lecture 2.
58+
59+
- **bag of words model (word2vec)**
60+
- this model makes the same predictions at each position
61+
- to develop the intuition here, the models have gotten better over time. we want a model that gives a reasonably high probability estimate to all words that occur in the context
62+
- parameters
63+
- the parameters of this model are the vectors of the outside words, vectors of the center words
64+
- recall: the model takes the dot product of the outside vector (U) and dots it with the center word to output a 1-D vector. this vector is then passed through a softmax
65+
- **optimization (gradient descent)**
66+
- to learn good word vectors, we have a loss function that we want to minimize
67+
- gradient descent is an algorithm to minimize the loss function by changing the parameters
68+
- the concept here is to compute the gradient from the current value of the parameters and then take a small step in the direction of the negative gradient
69+
- we do this because we want to minimize the loss
70+
- the formula looks like this: $\theta^{new} = \theta^{old} - \alpha * \nabla_{\theta}L(\theta)$
71+
- **stochastic gradient descent**
72+
- the idea behind doing SGD is because it’s very expensive computing and fine tuning the parameters every single time
73+
- instead what we do is that we repeatedly sample windows and update after each one / small batch
74+
- in other words, instead of doing the entire dataset, we do smaller batches and update the dataset
75+
- **word2vec algorithm family**
76+
- there are 2 model variants
77+
- skip-grams
78+
- they predict context words given the center word
79+
- continuous bag of words (CBOW)
80+
- predict the center word from a bag of context words

0 commit comments

Comments
 (0)