Stanford: NLP with Machine Learning (2)
Lecture 2: Word Vectors 2 and Word Senses
- 2019 Winter | video | slides | notes
Lecture Plan
Lecture 2: Introduction and Word Vectors
Main idea of word2vec
- Iterate through each word of the whole corpus
- Predict surrounding words using word vectors
- The probability distribution is found by the dot product of the word vectors by the softmax function
- Same predictions at each position.
We want a model that gives a reasonably high probability estimate to all words that occur in the context.
- Word2vec maximizes objective function by putting similar words nearby in space.
2. Optimization: Gradient Descent
- We have a cost function \(J(\theta)\) to minimize
- Gradient Descent is an algorithm to minimize \(J(\theta)\)
- Idea: for current value of \(\theta\), calculate gradient of \(J(\theta)\), then take a small step in the direction of negative gradient & repeat.
Gradient Descent - updated equation
(in matrix rotation)
\(\theta^{new} = \theta^{old} - \alpha{\nabla}_\theta J(\theta)\)
\(\alpha\) = step size or learning rate
(for a single parameter) \[\theta_j^{new} = \theta_j^{old} - \alpha\frac{\alpha}{\alpha\theta_j^{old}} J(\theta)\]
- Algorithm
while True: theta_grad = evaluate_gradient(J, corpus, theta) theta = theta - alpha * theta_grad
Problem: \(J(\theta)\) is a function of all windows in the corpus (potentially billions!) So \(\nabla_{\theta} J(\theta)\) is very expensive to compute.
- Solution: Stochastic gradient descent (SGD)
- Repeatedly sample windows and update after each one
- Algorithm:
while True: window = sample_window(corpus) theta_grad = evaludate_gradient(J, window, theta) theta = theta - alpha * theta_grad
- Iteratively take gradients at each such window for SGD
- But in each window, we only have at most 2m + 1 words so \(\nabla_{\theta} J_t(\theta)\) is very sparse!
We might only update the word vector that actually appear!
- Solution: either you need sparse matrix update operations to only update certain rows of full embedding matrices U and V, or you need to keep around a hash for word vectors
If you have millions of word vectors and do distributed computing, it is important to not have to send gigantic updates around!
- Why two vectors? –> Easier optimization. Average both at the end
- But can do algorithm with just one vector per word
- Two models
- Skip-grams: Predict context (outside) words (position independent) given a center word
- Continuous Bag of Words (CBOW): Predict center word from (bag of) context words
- For additional efficiency: negative sampling Train binary logistic regressions for a true pair (center word and word in its context window) versus several noise pairs (the center word paired with a random word)
Reference
- Stanford NLP with Deep Learning by Chris Manning
- videos
- New online certificate course in 2021
- Chris Manning’s github Text Analysis for Humanities Research
- Distributed Representations of Words and Phrases and their Compositionality (Mikolov, et al. 2013) NeuIPS