Readability and code quality improvements (#959)

* Consistent dataset naming * consistent section headers
2026-04-10 12:33:42 +00:00 · 2026-02-17 19:44:56 -05:00
parent 7b1f740f74
commit be5e2a3331
48 changed files with 419 additions and 297 deletions
--- a/ch03/02_bonus_efficient-multihead-attention/mha-implementations.ipynb
+++ b/ch03/02_bonus_efficient-multihead-attention/mha-implementations.ipynb
@@ -117,7 +117,7 @@
    "<br>\n",
    "&nbsp;\n",
    "\n",
-    "## 1) CausalAttention MHA wrapper class from chapter 3"
+    "## 1. CausalAttention MHA wrapper class from chapter 3"
   ]
  },
  {
@@ -208,7 +208,7 @@
    "<br>\n",
    "&nbsp;\n",
    "\n",
-    "## 2) The multi-head attention class from chapter 3"
+    "## 2. The multi-head attention class from chapter 3"
   ]
  },
  {
@@ -311,7 +311,7 @@
    "<br>\n",
    "&nbsp;\n",
    "\n",
-    "## 3) An alternative multi-head attention with combined weights"
+    "## 3. An alternative multi-head attention with combined weights"
   ]
  },
  {
@@ -435,7 +435,7 @@
    "<br>\n",
    "&nbsp;\n",
    "\n",
-    "## 4) Multi-head attention with Einsum\n",
+    "## 4. Multi-head attention with Einsum\n",
    "\n",
    "- Implementing multi-head attention using Einstein summation via [`torch.einsum`](https://pytorch.org/docs/stable/generated/torch.einsum.html)"
   ]
@@ -567,7 +567,7 @@
    "<br>\n",
    "&nbsp;\n",
    "\n",
-    "## 5) Multi-head attention with PyTorch's scaled dot product attention and FlashAttention"
+    "## 5. Multi-head attention with PyTorch's scaled dot product attention and FlashAttention"
   ]
  },
  {
@@ -676,7 +676,7 @@
    "<br>\n",
    "&nbsp;\n",
    "\n",
-    "## 6) PyTorch's scaled dot product attention without FlashAttention\n",
+    "## 6. PyTorch's scaled dot product attention without FlashAttention\n",
    "\n",
    "- This is similar to above, except that we disable FlashAttention by passing an explicit causal mask"
   ]
@@ -785,7 +785,7 @@
    "<br>\n",
    "&nbsp;\n",
    "\n",
-    "## 7) Using PyTorch's torch.nn.MultiheadAttention"
+    "## 7. Using PyTorch's torch.nn.MultiheadAttention"
   ]
  },
  {
@@ -883,7 +883,7 @@
    "<br>\n",
    "&nbsp;\n",
    "\n",
-    "## 8) Using PyTorch's torch.nn.MultiheadAttention with `scaled_dot_product_attention`"
+    "## 8. Using PyTorch's torch.nn.MultiheadAttention with `scaled_dot_product_attention`"
   ]
  },
  {
@@ -948,7 +948,7 @@
    "<br>\n",
    "&nbsp;\n",
    "\n",
-    "## 9) Using PyTorch's FlexAttention\n",
+    "## 9. Using PyTorch's FlexAttention\n",
    "\n",
    "- See [FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention](https://pytorch.org/blog/flexattention/) to learn more about FlexAttention\n",
    "- FlexAttention caveat: It currently doesn't support dropout\n",
@@ -1108,7 +1108,18 @@
    "<br>\n",
    "&nbsp;\n",
    "\n",
-    "## Quick speed comparison (M3 Macbook Air CPU)"
+    "## 10. Quick speed comparisons"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "992e28f4-a6b9-4dd3-9705-30d0b9f4b5f0",
+   "metadata": {},
+   "source": [
+    "<br>\n",
+    "&nbsp;\n",
+    "\n",
+    "### 10.1 Speed comparisons on M3 Macbook Air CPU"
   ]
  },
  {
@@ -1361,7 +1372,7 @@
    "<br>\n",
    "&nbsp;\n",
    "\n",
-    "## Quick speed comparison (Nvidia A100 GPU)"
+    "### 10.2 Quick speed comparison on Nvidia A100 GPU"
   ]
  },
  {
@@ -1643,7 +1654,18 @@
    "&nbsp;\n",
    "\n",
    "\n",
-    "# Visualizations"
+    "## 11. Visualizations"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e6baf5ce-45ac-4e26-9523-5c32b82dc784",
+   "metadata": {},
+   "source": [
+    "<br>\n",
+    "&nbsp;\n",
+    "\n",
+    "### 11.1 Visualization utility functions"
   ]
  },
  {
@@ -1752,7 +1774,8 @@
    "id": "4df834dc"
   },
   "source": [
-    "## Speed comparison (Nvidia A100 GPU) with warmup (forward pass only)"
+    "&nbsp;\n",
+    "### 11.2 Speed comparison (Nvidia A100 GPU) with warmup (forward pass only)"
   ]
  },
  {
@@ -1834,7 +1857,7 @@
    "&nbsp;\n",
    "\n",
    "\n",
-    "## Speed comparison (Nvidia A100 GPU) with warmup (forward and backward pass)"
+    "### 11.3 Speed comparison (Nvidia A100 GPU) with warmup (forward and backward pass)"
   ]
  },
  {
@@ -1920,7 +1943,7 @@
    "&nbsp;\n",
    "\n",
    "\n",
-    "## Speed comparison (Nvidia A100 GPU) with warmup and compilation (forward and backward pass)"
+    "### 11.4 Speed comparison (Nvidia A100 GPU) with warmup and compilation (forward and backward pass)"
   ]
  },
  {