Home Lex Fridman Notes
Lex Fridman · 2016-09-27 · 1h 27m

Deep Reinforcement Learning (John Schulman, OpenAI)

John Schulman gives a foundational lecture on deep reinforcement learning, covering policy gradients, Q-learning, and when to use them.

Deep Reinforcement Learning (John Schulman, OpenAI)
The guest

John Schulman — Research scientist at OpenAI, known for core deep reinforcement learning methods including Trust Region Policy Optimization (TRPO).

The gist

This is a technical lecture by John Schulman of OpenAI introducing the core methods of deep reinforcement learning. He explains how RL differs from supervised learning, frames problems as Markov decision processes, and walks through the two main algorithm families: policy gradient methods and Q-function methods like Q-learning and SARSA. He covers the score function gradient estimator, variance reduction techniques (temporal structure, baselines, discounts), Bellman equations and backups, and practical issues like step sizes that motivated his TRPO algorithm. He concludes by comparing the tradeoffs between policy gradient and Q-function approaches and shows video demos of simulated robots learning locomotion.

Big reveals

  • Schulman warns deep RL may be overkill for many problems and recommends trying derivative-free optimization or contextual bandit methods first.
  • All three variance reduction techniques (temporal structure, baselines, discounts) are essentially required for anything beyond small-scale problems to work.
  • In RL, taking too large a step can wreck your policy permanently because future data is collected by the broken policy, which motivated Trust Region Policy Optimization.
  • Q-function methods are more sample-efficient when they work, but policy gradient methods work more generally and are easier to debug.
  • The hardest part of RL optimization comes from the behavior space and local minima (robot standing still vs diving forward), not the neural network architecture.

Things worth remembering

  • DeepMind's deep Q-learning agent learned to play many different Atari games from raw screen images using a single algorithm.
  • Beating a champion Go player combined supervised learning, policy gradients, Monte Carlo tree search, and value functions.
  • A discount factor of gamma=0.99 means rewards are reduced by a factor of 1/e after about 100 time steps.
  • The locomotion policies were trained only with raw robot state (joint angles, velocities, link positions) and no clever feature engineering.
  • None of these methods are guaranteed to work well when more than 1/(1-gamma) time steps separate an action from its reward.
  • Stochastic training noise makes the learned policies surprisingly robust to changes in dynamics parameters.
  • The locomotion results required about two weeks of real time, which Schulman compares favorably to how long toddlers take to learn to walk.