Deep Learning for Speech Recognition (Adam Coates, Baidu)

The guest

Adam Coates — Researcher at Baidu's Silicon Valley AI Lab working on the Deep Speech speech recognition engine; gives this technical tutorial on deep learning for speech.

The gist

This is a technical tutorial on building speech recognition systems with deep learning. Coates first walks through the traditional pipeline (features, acoustic model, language model, decoder, phonemes, lexicon) and its brittleness, then shows how deep learning first replaced single components and now powers end-to-end systems. He explains CTC (Connectionist Temporal Classification) in detail as the method for mapping variable-length audio to transcriptions, plus training tricks like SortaGrad and batch normalization. The final sections cover scaling up with data augmentation, GPU computation, beam-search decoding with n-gram language models, and production concerns like latency and batching. He notes Baidu's Deep Speech engine reached human-competitive accuracy in Mandarin.

Big reveals

A study with Stanford and UW showed texting by voice is three times faster than typing, even with recognition errors.
00:00:31
Swapping a Gaussian mixture model for a deep belief network as the acoustic model gave a 10-20% relative accuracy jump in a single 2011 paper.
00:11:32
Baidu's Deep Speech Mandarin engine reached below 6% character error rate, beating a single human (~10%) and matching committees of native speakers.
01:17:06
Once you have a basic deep learning pipeline, getting to state-of-the-art is fundamentally a problem of scale: more data and more compute.
01:01:23
Robustness to noise is engineered cheaply by synthesizing data, overlaying free Creative Commons noise tracks onto clean read speech rather than collecting noisy audio.
01:06:42

Things worth remembering

Phonemes are approximate perceptual units of sound and it is unclear how fundamental they really are; the standardized TIMIT dataset provides labeled examples.
00:07:15
A sample training utterance used is a person reading the Wall Street Journal: 'a tanker is a ship designed to carry large volumes of oil.'
00:43:30
Watching the max-decoding output of softmax neurons is a handy diagnostic; after 300 iterations the network just outputs blanks and spaces.
00:45:36
Transcribing speech data costs roughly 50 cents to a dollar a minute depending on quality and difficulty.
00:55:30
Training one Deep Speech model is about 1.2 x 10^19 flops, roughly a month on a single Titan X card.
01:10:51
The Lombard effect makes people involuntarily raise their voice in noisy environments; researchers play loud noise in headphones to elicit it.
01:03:30
To get more conversational, expressive speech data, workers are given movie scripts and poetry so they voice-act while reading.
01:04:34
The 'Tchaikovsky problem': proper names you've never heard can only be spelled correctly with a language model trained on text.
00:51:52

Topics

speech recognition deep learning CTC recurrent neural networks Baidu Deep Speech language models GPU training data augmentation