added notes from lec 2 cs224n

devshah21 · devshah21 · commit d77b55e19ee6 · 2024-08-09T22:33:26.000-04:00
diff --git a/stanford_lectures/cs224n/cs224n.md b/stanford_lectures/cs224n/cs224n.md
@@ -52,4 +52,29 @@
                     
                     ![Screenshot 2024-08-01 at 5.53.47 PM.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/c9aa599c-2115-4330-846e-652102e8621e/40de4c60-feb7-4e52-9eda-3fdfcb8abbaa/Screenshot_2024-08-01_at_5.53.47_PM.png)
                     
-            - in the image above, we optimize all the parameters by walking down the gradient (minimizing the loss by computing all the vector gradients)
+            - in the image above, we optimize all the parameters by walking down the gradient (minimizing the loss by computing all the vector gradients)
+
+## lecture 2.
+
+- **bag of words model (word2vec)**
+    - this model makes the same predictions at each position
+        - to develop the intuition here, the models have gotten better over time. we want a model that gives a reasonably high probability estimate to all words that occur in the context
+    - parameters
+        - the parameters of this model are the vectors of the outside words, vectors of the center words
+    - recall: the model takes the dot product of the outside vector (U) and dots it with the center word to output a 1-D vector. this vector is then passed through a softmax
+- **optimization (gradient descent)**
+    - to learn good word vectors, we have a loss function that we want to minimize
+    - gradient descent is an algorithm to minimize the loss function by changing the parameters
+        - the concept here is to compute the gradient from the current value of the parameters and then take a small step in the direction of the negative gradient
+            - we do this because we want to minimize the loss
+        - the formula looks like this: $\theta^{new} = \theta^{old} - \alpha * \nabla_{\theta}L(\theta)$
+- **stochastic gradient descent**
+    - the idea behind doing SGD is because it’s very expensive computing and fine tuning the parameters every single time
+        - instead what we do is that we repeatedly sample windows and update after each one / small batch
+            - in other words, instead of doing the entire dataset, we do smaller batches and update the dataset
+- **word2vec algorithm family**
+    - there are 2 model variants
+        - skip-grams
+            - they predict context words given the center word
+        - continuous bag of words (CBOW)
+            - predict the center word from a bag of context words