Deep Learning for Natural Language Processing (Richard Socher, Salesforce)

The guest

Richard Socher — Deep learning and NLP researcher, then Chief Scientist at Salesforce, formerly a Stanford PhD; creator of the GloVe word vectors and dynamic memory network work.

The gist

Richard Socher delivers a lecture on deep learning for natural language processing, structured from basics to cutting-edge research. He covers the two core building blocks: word vectors (word2vec and GloVe, which capture co-occurrence statistics) and recurrent neural networks including GRUs for sequence modeling. He then introduces dynamic memory networks (DMNs), an architecture that reframes many NLP tasks as question answering and uses an episodic memory module with attention to reason over inputs across multiple passes. He shows the same architecture achieving state-of-the-art on logical reasoning, sentiment analysis, part-of-speech tagging, and even visual question answering by swapping the input module to CNN image features.

Big reveals

Socher points listeners to his Stanford course CS224d for the full details on optimizing word2vec rather than covering it all in the talk.
00:17:37
He previews an unpublished 'pointer sentinel mixture model' to be released the following week, letting language models predict words never seen at training time.
00:39:34
The pointer sentinel model pushes language-model perplexity down to 70, a 10+ point improvement in roughly two years driven by deep learning.
00:43:13
Socher proposes reducing nearly all NLP tasks to question answering, motivating the dynamic memory network architecture.
00:49:05
The dynamic memory network achieves state of the art on Facebook's bAbI logical-reasoning dataset, sometimes 100% accuracy.
01:02:09
A computer-vision researcher (Zhiming) adapted the DMN to visual question answering just by changing the input module to CNN region features, achieving state of the art.
01:08:24
Live demo: the visual QA model correctly answers a journalist's improvised question and a string of Socher's own unscripted questions about an image.
01:15:13

Things worth remembering

The four-word sentence 'I made her duck' has at least four distinct meanings, illustrating language ambiguity.
00:04:38
Word analogies fall out of simple vector arithmetic: vector(woman) - vector(man) + vector(king) yields a vector whose nearest word is 'queen'.
00:21:48
GloVe vectors were trained on Common Crawl, a dataset covering most of the internet with many billions of tokens.
00:19:41
Language modeling is described as nearly 'AI complete' since perfect next-word prediction would disambiguate almost everything.
00:30:13
The DMN's episodic memory parallels human episodic memory and the hippocampus's role in transitive inference (connecting A to B to C).
00:58:32
More passes over the input help reasoning and counting tasks, but for sentiment analysis accuracy hurts after more than two passes.
01:04:13
Visual QA models can exploit language priors: a model is right ~95% of the time answering 'yellow' for banana color without seeing the image.
01:12:37
The QA systems are not robust to false-premise questions because they were never trained with adversarial examples, raising security concerns.
01:21:59

Topics

deep learning natural language processing word vectors recurrent neural networks question answering dynamic memory networks visual question answering sentiment analysis