MIT 6.S094: Computer Vision

The guest

Lex Fridman — MIT researcher and lecturer teaching the 6.S094 deep learning course; presents computer vision concepts and the SegFuse dynamic scene segmentation competition.

The gist

This is a solo MIT lecture on computer vision as it stands today, framed almost entirely around deep learning. Fridman builds intuition from a trivial pixel-difference classifier and K-nearest neighbors up through convolutional neural networks, walking through the convolution operation, filters, stride, padding, and pooling. He surveys the lineage of ImageNet-winning architectures (AlexNet, VGGNet, GoogLeNet/Inception, ResNet, and SENet), highlighting the key insight each introduced. The second half moves to fully convolutional networks for pixel-level semantic scene segmentation, encoder-decoder frameworks, dilated convolutions, and conditional random fields, then to optical flow via FlowNet. It culminates in the SegFuse competition, which challenges students to use temporal information and optical flow to improve frame-by-frame driving-scene segmentation toward ground truth.

Big reveals

A trivial image-difference classifier hits ~35-38% on CIFAR-10, far above the 10% random-chance baseline.
00:10:32
On CIFAR-10, K-nearest neighbors reaches ~30%, human level is 95%, and CNNs get close to 100%.
00:11:38
ResNet surpassed human-level ImageNet performance (5.1% error) in 2015, achieving ~4% top-5 error.
00:24:27
Squeeze-and-Excitation Networks won 2017 by adding a learnable per-channel weighting, cutting error roughly 25% (4% to 3%).
00:29:51
Capsule networks expose that convolutional nets throw away spatial-relationship/pose information, so two scrambled faces look identical to a CNN.
00:33:00
Manually segmenting a single Cityscapes-style image (coloring every pixel) takes about 90 minutes, which is why no dense video-segmentation dataset exists yet.
00:46:34
FlowNet 2.0's accuracy depended heavily on the ORDER in which the training datasets were presented during training.
00:49:46

Things worth remembering

In supervised vision, humans provide annotated ground-truth labels and the network learns to map raw sensory input to those labels, then generalize.
00:01:33
Deep neural networks for vision are inspired by the layered visual cortex, where higher-order representations form as info passes from the eyes deeper into the brain.
00:03:42
Illumination variability and occlusion are cited as two of the biggest challenges in driving perception with visible-light cameras.
00:04:43
CNNs exploit spatial invariance: a cat in the top-left corner is the same as a cat in the bottom-right, so features are shared across the image to save parameters.
00:13:48
ImageNet contains 14 million images across 21,000 categories, with depth like 1,200 Granny Smith apple images.
00:22:20
VGGNet (2014) used a uniform conv-pool architecture with 138 million parameters and few optimizations.
00:25:35
GoogLeNet's Inception module runs 1x1, 3x3, and 5x5 convolutions together rather than choosing one size, lowering parameter count.
00:26:07
Optical flow estimates the direction and magnitude each pixel moved between two frames 33.3 milliseconds apart at 30 fps.
00:45:32
The SegFuse dataset includes original 1080p HD and 8K 360 driving video shot around Cambridge, with frame-by-frame ground-truth segmentation on Mechanical Turk.
00:50:16
SegFuse provides 10,000 images with Python starter code on GitHub for the competition.
00:52:26

Topics

computer vision deep learning convolutional neural networks semantic segmentation ImageNet architectures optical flow autonomous driving MIT 6.S094