| 00:00:00 | Transformers from scratch |
| 00:01:05 | Subword tokenization |
| 00:04:27 | Subword tokenization with byte-pair encoding (BPE) |
| 00:06:53 | The shortcomings of recurrent-based attention |
| 00:07:55 | How Self-Attention works |
| 00:14:49 | How Multi-Head Self-Attention works |
| 00:17:52 | The advantages of multi-head self-attention |
| 00:18:20 | Adding positional information |
| 00:20:30 | Adding a non-linear layer |
| 00:22:02 | Stacking encoder blocks |
| 00:22:30 | Dealing with side effects using layer normalization and skip connections |
| 00:26:46 | Input to the decoder block |
| 00:27:11 | Masked Multi-Head Self-Attention |
| 00:29:38 | The rest of the decoder block |
| 00:30:39 | [DEMO] Coding a Transformer from scratch |
| 00:56:29 | Transformer drawbacks |
| 00:57:14 | Pre-Training and Transfer Learning |
| 00:59:36 | The Transformer families |
| 01:01:05 | How BERT works |
| 01:09:38 | GPT: Language modelling at scale |
| 01:15:13 | [DEMO] Pre-training and transfer learning with Hugging Face and OpenAI |
| 01:51:48 | The Transformer is a "general-purpose differentiable computer" |