Home Lex Fridman Notes
Lex Fridman · 2020-01-19 · 1h 13m

Privacy Preserving AI (Andrew Trask) | MIT Deep Learning Series

Andrew Trask explains how privacy-preserving AI lets us answer questions using data we are never allowed to actually see.

Privacy Preserving AI (Andrew Trask) | MIT Deep Learning Series
The guest

Andrew Trask — Researcher, author of 'Grokking Deep Learning,' and leader of OpenMined, an open-source community building privacy-preserving machine learning tools. He works out of Oxford on tools like PySyft and PyGrid.

The gist

In this MIT Deep Learning Series talk, Andrew Trask walks through the toolkit of privacy-preserving AI from a data scientist's point of view: remote execution, private search, differential privacy, and secure multi-party computation. He shows how these techniques combine so researchers can train models on sensitive data (like medical records) without ever seeing it. In the second half he zooms out to the societal implications, sketching four big use-case categories: open data for science, single-use accountability systems, end-to-end encrypted services, and better recommendation systems. He argues the theory exists and what remains is adoption, engineering, and infrastructure.

Big reveals

  • Trask points out almost everyone trains classifiers on MNIST/CIFAR while virtually no one works on dementia, diabetes, or Alzheimer's because private data is so hard to access.
  • He warns that simple data anonymization 'by and large does not work' and is very dangerous to rely on.
  • The Netflix Prize anonymized dataset was de-anonymized by UT Austin researchers who scraped IMDB and matched users.
  • Trask says he knows of companies whose business model is buying anonymized datasets, de-anonymizing them, and selling intelligence to insurance companies.
  • He argues individuals, not hospitals or data scientists, should ultimately set their own personal epsilon (privacy) budgets.
  • Secure multi-party computation lets multiple parties share ownership of a number and compute on it while it stays encrypted.
  • He describes 'structured transparency': combining MPC for input privacy and differential privacy for output privacy to enable end-to-end encrypted services.
  • The 2020 US Census will protect its data using differential privacy, some of the leading real-world deployment of the technique.

Things worth remembering

  • Randomized response (a coin-flip survey technique) gives respondents plausible deniability while still recovering the true population mean.
  • Adding noise for plausible deniability is the 'secret weapon' of differential privacy.
  • Local differential privacy adds noise before data is sent; global adds noise to query outputs but requires trusting the database owner.
  • In secure MPC, neither shareholder can tell the encrypted number from their own share alone; decryption requires all shareholders to agree.
  • State-of-the-art encrypted deep learning runs about a 13x slowdown versus plaintext.
  • The next-word prediction on your phone keyboard is trained on-device using federated learning, pioneered by Google.
  • Federated learning alone is not secure; a model can memorize and later spit out training data unless combined with differential privacy.
  • A sniffing dog at the airport is a great privacy-preserving system: it reveals only one bit of information without searching the whole bag.
  • The PyTorch team sponsored $250,000 in open-source grants to fund work on the PySyft library.
  • Trask argues recommendation systems' biggest flaw is optimizing for engagement rather than holistic goals like good sleep or meaningful friendships.

Recommended in this episode

Books, products and media the guest or host genuinely endorsed here — with the buy link.

Affiliate link — we may earn a commission at no extra cost to you.

Guest’s ownBook

Grokking Deep Learning

Andrew Trask

“he is the author of grokking deep learning the book that I highly recommended in the lecturer on Monday” — Lex Fridman 00:00:00
Find it on Amazon
Guest’s ownProduct

PySyft

OpenMined

“one of the tools they're working on we're talking about today is called PI seft pi sift extends the major deep learning frameworks” — guest 00:05:07
Find it on Amazon
Guest’s ownProduct

PyGrid

OpenMined

“let's say we have what's called a grid so PI grid if PI sift is a library at PI agree is sort of the platform version” — guest 00:08:41
Find it on Amazon