Natural Language Processing with Deep Learning

1 minute read

Published: January 21, 2019

CS224n: Natural Language Processing with Deep Learning by Christopher Manning at Stanford University.

Why is NLP hard?

Complexity in representing, learning and using linguistic/situational/word/visual knowledge
Human languages are ambiguous (unlike programming and other formal languages)
Human language interpretation depends on real world, common sense, and contextual knowledge

Main idea of word2vec

Two algorithms

skip-grams (SG): predict context words given target (position independent)
Continuous Bag of Words (CBOW): predict target word from bag-of-words context

Two (moderately efficient training methods)

Hierarchical softmax
Negative sampling

$J(\theta) = -\frac{1}{T}\sum_{t=1}^T \sum_{-m \le j \le m, j \neq 0} \log p(w_{t+j} \mid w_{t})$

$p(o \mid c) = \frac{\exp (u_o^T v_c)}{\sum_{w=1}^V \exp (u_w^T v_c)}$

Negative sampling: train binary logistic regressions for a true pair (center word and word in its context window) versus a couple of noise pairs (the center word paired with a random word)

The skip-gram model with negative sampling

$J(\theta) = \frac{1}{T} \sum_{t=1}^T J_t(\theta)$

$J_t(\theta) = \log \sigma(u_o^Tv_c) + \sum_{j \sim P(\omega)} [\log \sigma(-u_j^Tv_c)]$ : maximize probability that real outside word appears, minimize probability that random words appear around center word

GloVe: $J(\theta) = \frac{1}{2} \sum_{i,j=1}^W f(P_{ij})(u_i^Tv_j - \log P_{ij})^2$ Count-based + Direct prediction method