John Schulman gives a foundational lecture on deep reinforcement learning, covering policy gradients, Q-learning, and when to use them.

John Schulman — Research scientist at OpenAI, known for core deep reinforcement learning methods including Trust Region Policy Optimization (TRPO).
This is a technical lecture by John Schulman of OpenAI introducing the core methods of deep reinforcement learning. He explains how RL differs from supervised learning, frames problems as Markov decision processes, and walks through the two main algorithm families: policy gradient methods and Q-function methods like Q-learning and SARSA. He covers the score function gradient estimator, variance reduction techniques (temporal structure, baselines, discounts), Bellman equations and backups, and practical issues like step sizes that motivated his TRPO algorithm. He concludes by comparing the tradeoffs between policy gradient and Q-function approaches and shows video demos of simulated robots learning locomotion.