Foundations of Unsupervised Deep Learning (Ruslan Salakhutdinov, CMU)

The guest

Ruslan Salakhutdinov — Machine learning professor at Carnegie Mellon University, a leading researcher in deep learning and unsupervised generative models.

The gist

This is a technical lecture on the foundations of unsupervised deep learning, motivated by the fact that most real-world data is unlabeled. Salakhutdinov walks through the building blocks: sparse coding, autoencoders, and their connection to PCA, then moves into probabilistic generative models including restricted Boltzmann machines and deep Boltzmann machines. He explains the math of maximum likelihood learning, contrastive divergence, and the difficulty of intractable partition functions, before covering variational autoencoders and the reparameterization trick. The talk concludes with generative adversarial networks, framing learning as a game between a generator and a discriminator, with examples in image generation, multimodal image-text models, and one-shot character generation.

Big reveals

Linear autoencoders with shared encoder/decoder weights collapse to the same latent space as PCA, making autoencoders nonlinear extensions of PCA.
00:15:38
Hinton's 2002 contrastive divergence algorithm made Boltzmann machines trainable by running the Markov chain for just one step instead of to infinity.
00:32:45
The Helmholtz machine (1995) 'never worked' for a decade until researchers figured out the trick two years prior to the talk.
00:53:31
The 2014 reparameterization trick (Kingma, Welling and others) collapses variational models into autoencoders, enabling backpropagation through stochastic systems.
00:57:08
GANs learn generative models with no maximum likelihood, no MCMC, and no explicit density, just by playing a game between two networks.
01:08:39
Stochasticity (injected noise) lets generative models produce a whole distribution of plausible images from one caption rather than a single deterministic output.
01:01:51
GANs produce sharper images than VAEs because they do not care where an edge is placed as long as the result fools the discriminator, unlike the L2 Gaussian loss of VAEs.
01:18:27

Things worth remembering

Sparse coding has roots in 1996 and was originally developed to explain early visual processing in the brain as an edge detector.
00:06:46
Compressing data to a 20-dimensional binary code means about 4 gigabytes of addresses, letting you store everything in memory and do retrieval via direct memory lookups.
00:21:19
A deep autoencoder trained on faces 'regularized' the data by removing the glasses from the only person wearing them, treating the rare feature as noise.
00:20:16
On the Netflix dataset, one hidden unit of an RBM became dedicated specifically to Michael Moore's movies, which people tend to either love or hate.
00:36:54
A 28x28 binary image has 2^784 possible configurations, an exponential space in which real images occupy a tiny subspace.
00:24:30
After a caption-to-image paper went on arXiv, the team's generated image ranked above the real Google result for 'a toilet seat sits open in the bathroom' due to click traffic.
01:04:28
A recurrent generative model was trained on about 7,000 romance novels to generate romantic-style captions for images.
01:05:00
In one-shot character generation, machine-drawn versus human-drawn characters are nearly indistinguishable to people, roughly a 50/50 guess rate.
01:07:06

Topics

unsupervised learning deep learning generative models autoencoders Boltzmann machines variational autoencoders GANs representation learning