David Silver: AlphaGo, AlphaZero, and Deep Reinforcement Learning

The guest

David Silver — Leader of DeepMind's reinforcement learning research group and the lead researcher on AlphaGo and AlphaZero, who also co-led AlphaStar and MuZero. He is one of the central figures behind modern deep reinforcement learning.

The gist

David Silver traces his path from programming a BBC Micro at age seven and building games to a PhD applying reinforcement learning to the game of Go. He explains the core of reinforcement learning, why Go was considered impossible for AI, and how deep learning plus Monte Carlo tree search produced AlphaGo's historic 2016 win over Lee Sedol. He details the leap to AlphaGo Zero and AlphaZero, which learned entirely from self-play with no human data, and MuZero, which learns even without being told the rules. The conversation closes on creativity, intrinsic reward, and a layered view of the meaning of life and intelligence.

Big reveals

A pure deep learning AlphaGo system with no search at all reached master-level Go, a definitive break from decades of search-dominated AI.
00:50:05
Silver predicted Lee Sedol 4-1 based on data showing AlphaGo developed a 'delusion' roughly 1 in 5 games.
01:00:00
He admits AlphaGo had inner 'holes' in its knowledge that persisted for tens of moves, and Lee Sedol exploited exactly that in the one game he won.
01:00:30
AlphaGo's 'move 37' broke every convention Go players are taught, proving machines could exhibit genuine creativity.
01:02:33
The full AlphaZero algorithm came to Silver while on his honeymoon, in his most relaxed state.
01:17:44
Silver offers a falsifiable prediction that scaling AlphaZero would beat each prior version 100-0 indefinitely throughout his lifetime.
01:24:06
MuZero learns a model of Atari, Go, chess and shogi without ever being told the rules, then plans to superhuman level.
01:29:23
World champion Magnus Carlsen credits studying AlphaZero's games for a new peak in his rating.
01:10:56

Things worth remembering

In the 90s, heuristic search beat human world champions at chess, checkers, backgammon and Othello, but Go resisted.
00:14:40
When a $1M Go prize expired in 2000, the strongest program lost to a nine-year-old child even with a nine-stone handicap.
00:14:40
Go has the same number of stones for both players, so position value rests almost entirely on intuition, unlike chess's point system.
00:24:31
Go has around 10^170 possible positions, more than the roughly 10^80 atoms in the universe.
00:25:33
Monte Carlo search evaluates a position by playing random games to the end and averaging who wins.
00:44:52
After Lee Sedol, the next AlphaGo beat other top human players 60 games to nil.
01:08:50
AlphaZero's algorithm was independently applied in Nature papers to chemical synthesis and quantum computation, beating state of the art.
01:36:37
Silver borrows from Max Tegmark a view that the universe's 'goal' may be to maximize entropy, framing evolution and intelligence as nested sub-goals.
01:42:20
AlphaGo Zero rediscovered human joseki opening patterns then invented new variations now studied in top human competitions.
01:33:30

Recommended in this episode

Books, products and media the guest or host genuinely endorsed here — with the buy link.

Affiliate link — we may earn a commission at no extra cost to you.

RecommendedBook

A Cent of Money: A History of Money

Jonathan Williams (inferred)

“let me mention that cryptocurrency in the context of the history of money it's fascinating I recommend a cent of money as a great book on this history” — Lex Fridman 00:01:33

Find it on Amazon

Topics

reinforcement learning AlphaGo AlphaZero self-play deep learning game of Go AI creativity meaning of life