Nuts and Bolts of Applying Deep Learning (Andrew Ng)

The guest

Andrew Ng — Co-founder of Coursera, former leader of Baidu's AI team and Google Brain, and a leading deep learning researcher and educator.

The gist

In this whiteboard-style talk at a deep learning workshop, Andrew Ng distills common patterns he observed leading a large AI team across vision, speech, and NLP applications. He argues that scale of data and compute is the number one driver of deep learning progress, and that end-to-end deep learning is powerful but only works when you have enough labeled data. Much of the talk is a practical workflow for diagnosing models using human-level performance, bias, and variance, including how to handle train/test sets drawn from different distributions. He closes with career advice: read 20-50 papers, replicate results, embrace the dirty work, and study consistently weekend after weekend.

Big reveals

Ng argues the single biggest reason deep learning works now is scale, large neural networks trained on the huge data we finally have access to.
00:02:37
He frames the rise of end-to-end deep learning, learning algorithms that output complex things like sentences, captions, or audio, as the second major trend.
00:09:58
Ng recounts publicly arguing phonemes are a fantasy of linguists and being yelled at by a linguist at Stanford, but says they turned out to be right.
00:15:41
He uses Baidu's speech-enabled rearview mirror product in China to show why train and test sets often come from different distributions.
00:39:09
Best practice revealed: your dev set and test set must come from the same distribution, or months of tuning can be wasted.
00:41:44
Rule of thumb: if a typical person can do a task in less than one second of thinking, deep learning can probably automate it.
01:06:36
His reliable formula for becoming a machine learning researcher: read 20 to 50 papers and replicate results, and you will start having your own ideas.
01:13:29
The Saturday story, real career growth comes from studying and replicating results weekend after weekend for a year despite no short-term rewards.
01:16:34

Things worth remembering

Ng recommends building a separate computer systems team alongside the AI team because HPC expertise is too specialized for one person to also master ML.
00:05:47
Doctors read X-rays of a child's hand to predict the child's age, a task where non-end-to-end pipelines work better due to limited data.
00:16:43
Synthetic OCR data can be generated by pasting random English words in random fonts onto random internet images, but requires heavy tuning to match the real distribution.
00:32:20
Speech recognition training data can be synthesized by mathematically adding clean speech to recorded background noise like car interior sounds.
00:33:55
Using Grand Theft Auto cars as training data fails because a game may show only about 20 distinct cars, an impoverished dataset for a learning algorithm.
00:36:32
Ng mandated a single unified company-wide data warehouse at Baidu, treating data as company data with separate discussions only about access rights.
00:37:04
Ng argues a team of expert doctors debating an image, around 0.5% error, is the most useful definition of human-level performance because it best estimates the optimal error rate.
01:00:26
Predicting whether a user clicks the next ad is described as probably the most lucrative application of deep learning today.
01:09:18
A phoneme is the basic unit of sound, for example the shared 'c' sound in 'cat' and 'kick' that linguists hypothesized as fundamental.
00:14:09

Topics

deep learning machine learning workflow bias and variance end-to-end learning data synthesis human-level performance speech recognition ML career advice