Home Lex Fridman Notes
Lex Fridman · 2018-01-25 · 57m

MIT 6.S094: Deep Reinforcement Learning

Lex Fridman's MIT lecture on deep reinforcement learning, from Q-learning and DQN to AlphaGo Zero and the DeepTraffic competition.

MIT 6.S094: Deep Reinforcement Learning
The guest

Lex Fridman — MIT researcher and lecturer teaching the 6.S094 Deep Learning for Self-Driving Cars course

The gist

This is a solo MIT lecture in which Lex Fridman explains deep reinforcement learning as the attempt to teach systems to perceive and act in the world end-to-end from raw sensory data. He walks through the full AI stack, the structure of reinforcement learning (states, actions, rewards, policies, value functions), Q-learning and the Bellman equation, and how neural networks scale these methods to huge state spaces via Deep Q-Networks. He details the tricks that made DQN work (experience replay, fixed target networks, reward clipping) and celebrates AlphaGo and AlphaGo Zero as landmark achievements. He then introduces the class's DeepTraffic competition, a browser-based multi-agent deep RL challenge, and closes by questioning whether RL is yet applicable to real-world robotics and driving.

Big reveals

  • DeepTraffic is now a multi-agent deep RL problem where students can control up to ten cars in a network, trained in JavaScript in the browser.
  • DQN has been able to outperform human-level performance on many Atari games, but Lex stresses these games are still trivial.
  • AlphaGo Zero learns with no human expert game data and beats the best in the world, which Lex calls the AI accomplishment of the decade.
  • AlphaGo Zero achieved a rating better than AlphaGo and the best human players in just 21 days of self-play.
  • DeepTraffic evaluation uses the median speed across 500 server-side runs to remove randomness and make cheating extremely difficult.
  • In the real world, the most successful robots (Boston Dynamics, Waymo) use mostly non-learning, optimization-based control, not deep reinforcement learning.

Things worth remembering

  • Reinforcement learning sits between supervised and unsupervised learning in terms of how much human-labeled input it requires.
  • The lecture frames RL with Pavlov's cats ringing a bell for food to illustrate learning from sparse reward signals.
  • A simple 3x4 grid world is used to show how changing the per-step reward dramatically changes the optimal policy.
  • Experience replay trains the network on randomly sampled past experiences so it doesn't overfit a single continuous evolution of the game.
  • Fixing the target network and only updating it every several thousand steps stabilizes the loss function during training.
  • Reward clipping normalizes positive points to +1 and negative points to -1 so DQN generalizes across different games.
  • A 19x19 Go board has roughly 2 times 10 to the power of 170 legal game positions, far more than chess.
  • AlphaGo Zero used ResNet residual networks (the ImageNet-winning architecture) which made a major difference.
  • Americans collectively spend about eight billion hours stuck in traffic, the motivation behind DeepTraffic.
  • In DeepTraffic the agents are not aware of each other and act greedily, yet are trained under a joint average-speed objective.