# Natural Language Processing with Deep Learning

** Published:**

CS224n: Natural Language Processing with Deep Learning by Christopher Manning at Stanford University.

Why is NLP hard?

Complexity in representing, learning and using linguistic/situational/word/visual knowledge

Human languages are ambiguous (unlike programming and other formal languages)

Human language interpretation depends on real world, common sense, and contextual knowledge

Main idea of word2vec

Two algorithms

skip-grams (SG): predict context words given target (position independent)

Continuous Bag of Words (CBOW): predict target word from bag-of-words context

Two (moderately efficient training methods)

Hierarchical softmax

Negative sampling

$J(\theta) = -\frac{1}{T}\sum_{t=1}^T \sum_{-m \le j \le m, j \neq 0} \log p(w_{t+j} \mid w_{t})$

$p(o \mid c) = \frac{\exp (u_o^T v_c)}{\sum_{w=1}^V \exp (u_w^T v_c)}$

Negative sampling: train binary logistic regressions for a true pair (center word and word in its context window) versus a couple of noise pairs (the center word paired with a random word)

The skip-gram model with negative sampling

$J(\theta) = \frac{1}{T} \sum_{t=1}^T J_t(\theta)$

$J_t(\theta) = \log \sigma(u_o^Tv_c) + \sum_{j \sim P(\omega)} [\log \sigma(-u_j^Tv_c)]$ : maximize probability that real outside word appears, minimize probability that random words appear around center word

GloVe: $J(\theta) = \frac{1}{2} \sum_{i,j=1}^W f(P_{ij})(u_i^Tv_j - \log P_{ij})^2$ Count-based + Direct prediction method

If you only have a small training dataset, don’t train the word vectors.

If you have a very large dataset, it may work better to train word vectors to the task.

The max-margin loss: $J = \max (0, 1-s+s_c)$

Idea for training objective: make score of true window larger and corrupt window’s score lower (until they’re good enough)

Chain rule, Nothing fancy!

TensorFlow = Tensor + Flow