Transformers

Timestamp	Description
00:00:00	Transformers from scratch
00:01:05	Subword tokenization
00:04:27	Subword tokenization with byte-pair encoding (BPE)
00:06:53	The shortcomings of recurrent-based attention
00:07:55	How Self-Attention works
00:14:49	How Multi-Head Self-Attention works
00:17:52	The advantages of multi-head self-attention
00:18:20	Adding positional information
00:20:30	Adding a non-linear layer
00:22:02	Stacking encoder blocks
00:22:30	Dealing with side effects using layer normalization and skip connections
00:26:46	Input to the decoder block
00:27:11	Masked Multi-Head Self-Attention
00:29:38	The rest of the decoder block
00:30:39	[DEMO] Coding a Transformer from scratch
00:56:29	Transformer drawbacks
00:57:14	Pre-Training and Transfer Learning
00:59:36	The Transformer families
01:01:05	How BERT works
01:09:38	GPT: Language modelling at scale
01:15:13	[DEMO] Pre-training and transfer learning with Hugging Face and OpenAI
01:51:48	The Transformer is a "general-purpose differentiable computer"