Lex Fridman's MIT lecture on how machines see: convolutional neural networks, semantic scene segmentation, and the SegFuse driving competition.

Lex Fridman — MIT researcher and lecturer teaching the 6.S094 deep learning course; presents computer vision concepts and the SegFuse dynamic scene segmentation competition.
This is a solo MIT lecture on computer vision as it stands today, framed almost entirely around deep learning. Fridman builds intuition from a trivial pixel-difference classifier and K-nearest neighbors up through convolutional neural networks, walking through the convolution operation, filters, stride, padding, and pooling. He surveys the lineage of ImageNet-winning architectures (AlexNet, VGGNet, GoogLeNet/Inception, ResNet, and SENet), highlighting the key insight each introduced. The second half moves to fully convolutional networks for pixel-level semantic scene segmentation, encoder-decoder frameworks, dilated convolutions, and conditional random fields, then to optical flow via FlowNet. It culminates in the SegFuse competition, which challenges students to use temporal information and optical flow to improve frame-by-frame driving-scene segmentation toward ground truth.