MIT 6.S094: Deep Reinforcement Learning

The guest

Lex Fridman — MIT researcher and lecturer teaching the 6.S094 Deep Learning for Self-Driving Cars course

The gist

This is a solo MIT lecture in which Lex Fridman explains deep reinforcement learning as the attempt to teach systems to perceive and act in the world end-to-end from raw sensory data. He walks through the full AI stack, the structure of reinforcement learning (states, actions, rewards, policies, value functions), Q-learning and the Bellman equation, and how neural networks scale these methods to huge state spaces via Deep Q-Networks. He details the tricks that made DQN work (experience replay, fixed target networks, reward clipping) and celebrates AlphaGo and AlphaGo Zero as landmark achievements. He then introduces the class's DeepTraffic competition, a browser-based multi-agent deep RL challenge, and closes by questioning whether RL is yet applicable to real-world robotics and driving.

Big reveals

DeepTraffic is now a multi-agent deep RL problem where students can control up to ten cars in a network, trained in JavaScript in the browser.
00:10:31
DQN has been able to outperform human-level performance on many Atari games, but Lex stresses these games are still trivial.
00:35:42
AlphaGo Zero learns with no human expert game data and beats the best in the world, which Lex calls the AI accomplishment of the decade.
00:37:52
AlphaGo Zero achieved a rating better than AlphaGo and the best human players in just 21 days of self-play.
00:38:24
DeepTraffic evaluation uses the median speed across 500 server-side runs to remove randomness and make cheating extremely difficult.
00:51:41
In the real world, the most successful robots (Boston Dynamics, Waymo) use mostly non-learning, optimization-based control, not deep reinforcement learning.
00:55:15

Things worth remembering

Reinforcement learning sits between supervised and unsupervised learning in terms of how much human-labeled input it requires.
00:08:23
The lecture frames RL with Pavlov's cats ringing a bell for food to illustrate learning from sparse reward signals.
00:09:28
A simple 3x4 grid world is used to show how changing the per-step reward dramatically changes the optimal policy.
00:16:24
Experience replay trains the network on randomly sampled past experiences so it doesn't overfit a single continuous evolution of the game.
00:29:22
Fixing the target network and only updating it every several thousand steps stabilizes the loss function during training.
00:30:27
Reward clipping normalizes positive points to +1 and negative points to -1 so DQN generalizes across different games.
00:32:01
A 19x19 Go board has roughly 2 times 10 to the power of 170 legal game positions, far more than chess.
00:36:50
AlphaGo Zero used ResNet residual networks (the ImageNet-winning architecture) which made a major difference.
00:41:37
Americans collectively spend about eight billion hours stuck in traffic, the motivation behind DeepTraffic.
00:42:08
In DeepTraffic the agents are not aware of each other and act greedily, yet are trained under a joint average-speed objective.
00:49:05

Topics

deep reinforcement learning Q-learning Deep Q-Networks AlphaGo self-driving cars DeepTraffic neural networks MIT lecture