Olmo 3 from scratch (#914)

* Olmo 3 from scratch * update * update * update
2026-04-10 12:33:42 +00:00 · 2025-11-22 22:42:18 -06:00
parent 398b079efa
commit bc6f335526
14 changed files with 3163 additions and 58 deletions
--- a/.github/workflows/basic-tests-linux-uv.yml
+++ b/.github/workflows/basic-tests-linux-uv.yml
@@ -57,6 +57,8 @@ jobs:
          pytest ch05/11_qwen3/tests/test_qwen3_nb.py
          pytest ch05/12_gemma3/tests/test_gemma3_nb.py
          pytest ch05/12_gemma3/tests/test_gemma3_kv_nb.py
          pytest ch05/13_olmo3/tests/test_olmo3_nb.py
          pytest ch05/13_olmo3/tests/test_olmo3_kvcache_nb.py
          pytest ch06/01_main-chapter-code/tests.py
      - name: Validate Selected Jupyter Notebooks (uv)
--- a/.gitignore
+++ b/.gitignore
@@ -70,6 +70,16 @@ ch05/11_qwen3/Qwen3-8B
 ch05/11_qwen3/Qwen3-8B-Base
 ch05/11_qwen3/Qwen3-32B
 ch05/11_qwen3/Qwen3-32B-Base
 ch05/12_gemma3/gemma-3-270M-it
 ch05/12_gemma3/gemma-3-270M
 ch05/13_olmo3/Olmo-3-1025-7B
 ch05/13_olmo3/Olmo-3-1125-32B
 ch05/13_olmo3/Olmo-3-7B-Instruct
 ch05/13_olmo3/Olmo-3-32B-Instruct
 ch05/13_olmo3/Olmo-3-7B-Think
 ch05/13_olmo3/Olmo-3-32B-Think
 ch05/13_olmo3/Olmo-3-7B-RLZero-IF
 ch05/13_olmo3/Olmo-3-32B-RLZero-IF
 ch06/01_main-chapter-code/gpt2
 ch06/02_bonus_additional-experiments/gpt2
--- a/README.md
+++ b/README.md
@@ -179,19 +179,19 @@ Several folders contain optional materials as a bonus for interested readers:
  - [Optimizing Hyperparameters for Pretraining](ch05/05_bonus_hparam_tuning)
  - [Building a User Interface to Interact With the Pretrained LLM](ch05/06_user_interface)
  - [Converting GPT to Llama](ch05/07_gpt_to_llama)
-  - [Llama 3.2 From Scratch](ch05/07_gpt_to_llama/standalone-llama32.ipynb)
+  - [Memory-efficient Model Weight Loading](ch05/08_memory_efficient_weight_loading/memory-efficient-state-dict.ipynb)
-  - [Qwen3 Dense and Mixture-of-Experts (MoE) From Scratch](ch05/11_qwen3/)
+  - [Extending the Tiktoken BPE Tokenizer with New Tokens](ch05/09_extending-tokenizers/extend-tiktoken.ipynb)
  - [Gemma 3 From Scratch](ch05/12_gemma3/)
  - [Memory-Efficient Model Weight Loading](ch05/08_memory_efficient_weight_loading/memory-efficient-state-dict.ipynb)
  - [Extending the Tiktoken BPE Tokenizer With New Tokens](ch05/09_extending-tokenizers/extend-tiktoken.ipynb)
  - [PyTorch Performance Tips for Faster LLM Training](ch05/10_llm-training-speed)
-
+  - [LLM Architectures](ch05/#llm-architectures-from-scratch)
- **Chapter 6: Finetuning for Classification**
+    - [Llama 3.2 From Scratch](ch05/07_gpt_to_llama/standalone-llama32.ipynb)
-  - [Additional Experiments Finetuning Different Layers and Using Larger Models](ch06/02_bonus_additional-experiments)
+    - [Qwen3 Dense and Mixture-of-Experts (MoE) From Scratch](ch05/11_qwen3/)
-  - [Finetuning Different Models on 50k IMDb Movie Review Dataset](ch06/03_bonus_imdb-classification)
+    - [Gemma 3 From Scratch](ch05/12_gemma3/)
-  - [Building a User Interface to Interact With the GPT-Based Spam Classifier](ch06/04_user_interface)
+    - [Olmo 3 From Scratch](ch05/13_olmo3/)
-
+- **Chapter 6: Finetuning for classification**
- **Chapter 7: Finetuning to Follow Instructions**
+  - [Additional experiments finetuning different layers and using larger models](ch06/02_bonus_additional-experiments)
  - [Finetuning different models on 50k IMDb movie review dataset](ch06/03_bonus_imdb-classification)
  - [Building a User Interface to Interact With the GPT-based Spam Classifier](ch06/04_user_interface)
 - **Chapter 7: Finetuning to follow instructions**
  - [Dataset Utilities for Finding Near Duplicates and Creating Passive Voice Entries](ch07/02_dataset-utilities)
  - [Evaluating Instruction Responses Using the OpenAI API and Ollama](ch07/03_model-evaluation)
  - [Generating a Dataset for Instruction Finetuning](ch07/05_dataset-generation/llama3-ollama.ipynb)
--- a/ch05/11_qwen3/standalone-qwen3-moe-plus-kvcache.ipynb
+++ b/ch05/11_qwen3/standalone-qwen3-moe-plus-kvcache.ipynb
@@ -1223,7 +1223,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.13.5"
+   "version": "3.12.3"
  }
 },
 "nbformat": 4,
--- a/ch05/11_qwen3/standalone-qwen3-plus-kvcache.ipynb
+++ b/ch05/11_qwen3/standalone-qwen3-plus-kvcache.ipynb
@@ -1253,7 +1253,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.13.5"
+   "version": "3.12.3"
  }
 },
 "nbformat": 4,
--- a/ch05/11_qwen3/standalone-qwen3.ipynb
+++ b/ch05/11_qwen3/standalone-qwen3.ipynb
@@ -1179,7 +1179,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.13.5"
+   "version": "3.12.3"
  }
 },
 "nbformat": 4,
--- a/ch05/12_gemma3/standalone-gemma3-plus-kvcache.ipynb
+++ b/ch05/12_gemma3/standalone-gemma3-plus-kvcache.ipynb
@@ -78,9 +78,9 @@
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "huggingface_hub version: 0.34.4\n",
+      "huggingface_hub version: 0.35.0\n",
-      "tokenizers version: 0.21.4\n",
+      "tokenizers version: 0.22.1\n",
-      "torch version: 2.8.0\n"
+      "torch version: 2.9.0+cu130\n"
     ]
    }
   ],
@@ -700,9 +700,9 @@
    {
     "data": {
      "text/plain": [
-       "tensor([[[ 0.7500,  0.1060,  0.4844,  ...,  0.9414,  0.3984, -0.2324],\n",
+       "tensor([[[ 0.7500,  0.1011,  0.4863,  ...,  0.9414,  0.3984, -0.2285],\n",
-       "         [-0.3438, -0.0549,  0.8984,  ..., -0.2402,  0.4570,  0.8242],\n",
+       "         [-0.3398, -0.0564,  0.9023,  ..., -0.2480,  0.4551,  0.8203],\n",
-       "         [-0.2676, -0.3281,  0.4121,  ...,  0.8711, -0.9648,  0.9844]]],\n",
+       "         [-0.2695, -0.3242,  0.4121,  ...,  0.8672, -0.9688,  0.9844]]],\n",
       "       dtype=torch.bfloat16, grad_fn=<UnsafeViewBackward0>)"
      ]
     },
@@ -806,7 +806,20 @@
   "metadata": {
    "id": "31f12baf-f79b-499f-85c0-51328a6a20f5"
   },
-   "outputs": [],
+   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/rasbt/jupyterlab/reasoning/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py:283: UserWarning: \n",
      "    Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.\n",
      "    Minimum and Maximum cuda capability supported by this version of PyTorch is\n",
      "    (8.0) - (12.0)\n",
      "    \n",
      "  warnings.warn(\n"
     ]
    }
   ],
   "source": [
    "if torch.cuda.is_available():\n",
    "    device = torch.device(\"cuda\")\n",
@@ -1038,6 +1051,20 @@
    "outputId": "55b2f28c-142f-4698-9d23-d27456d3ed6d"
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "3396c08eab3f4cf980023483b969a337",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "model.safetensors:   0%|          | 0.00/536M [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
@@ -1131,7 +1158,22 @@
   "execution_count": 22,
   "id": "7b6df8bc-7308-468e-93ce-2d5529ea7866",
   "metadata": {},
-   "outputs": [],
+   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "39b7b77c5c3448cdbd48fcde4e1b1a57",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "tokenizer_file_path = os.path.join(local_dir, \"tokenizer.json\")\n",
    "if not os.path.exists(tokenizer_file_path):\n",
@@ -1195,34 +1237,40 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 25,
+   "execution_count": 27,
   "id": "7b8401c6-e244-4cb7-9849-2ba71ce758d5",
   "metadata": {
    "id": "7b8401c6-e244-4cb7-9849-2ba71ce758d5"
   },
   "outputs": [],
   "source": [
-    "def generate_text_basic_stream(model, token_ids, max_new_tokens, \n",
+    "def generate_text_basic_stream(model, token_ids, max_new_tokens, eos_token_id=None, context_size=None):\n",
    "                               eos_token_id=None):\n",
    "\n",
    "    model.eval()\n",
    "    with torch.no_grad():\n",
    "        for _ in range(max_new_tokens):\n",
    "            out = model(token_ids)[:, -1]\n",
    "            next_token = torch.argmax(out, dim=-1, keepdim=True)\n",
    "\n",
-    "            if (eos_token_id is not None\n",
+    "    with torch.no_grad():\n",
-    "                    and torch.all(next_token == eos_token_id)):\n",
+    "        cache = KVCache(n_layers=model.cfg[\"n_layers\"])\n",
    "        model.reset_kv_cache()\n",
    "\n",
    "        # Prime the cache with the initial context\n",
    "        logits = model(token_ids, cache=cache)\n",
    "\n",
    "        for _ in range(max_new_tokens):\n",
    "            next_token = torch.argmax(logits[:, -1], dim=-1, keepdim=True)\n",
    "\n",
    "            if eos_token_id is not None and torch.all(next_token == eos_token_id):\n",
    "                break\n",
    "\n",
-    "            yield next_token  # New: Yield each token as it's generated\n",
+    "            yield next_token\n",
-    "            \n",
+    "\n",
-    "            token_ids = torch.cat([token_ids, next_token], dim=1)"
+    "            token_ids = torch.cat([token_ids, next_token], dim=1)\n",
    "\n",
    "            # Feed only the new token to the model; cache handles history\n",
    "            logits = model(next_token, cache=cache)"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 26,
+   "execution_count": 28,
   "id": "56c9d0cf-25e9-4375-8d5c-368fa6911fdf",
   "metadata": {},
   "outputs": [
@@ -1230,17 +1278,25 @@
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "Large language models (LLMs) are sophisticated artificial intelligence systems that can understand, generate, and manipulate human language. They are trained on massive amounts of text data to learn patterns and relationships within language, enabling them to perform a wide range of tasks, from writing articles and answering questions to translating languages and summarizing information.\n"
+      "Large language models (LLMs) are sophisticated artificial intelligence systems that can understand, generate, and manipulate human language. They are trained on massive amounts of text data to learn patterns and relationships within that data, enabling them to perform a wide range of tasks, from writing articles and answering questions to translating languages and summarizing information.\n",
      "\n",
      "\n",
      "GPU memory used: 0.96 GB\n"
     ]
    }
   ],
   "source": [
    "input_token_ids_tensor = torch.tensor(input_token_ids, device=device).unsqueeze(0)\n",
    "\n",
    "\n",
    "if torch.cuda.is_available():\n",
    "    torch.cuda.reset_peak_memory_stats()\n",
    "\n",
    "\n",
    "for token in generate_text_basic_stream(\n",
    "    model=model,\n",
    "    token_ids=input_token_ids_tensor,\n",
-    "    max_new_tokens=150,\n",
+    "    max_new_tokens=500,\n",
    "    eos_token_id=tokenizer.encode(\"<end_of_turn>\")[-1]\n",
    "):\n",
    "    token_id = token.squeeze(0).tolist()\n",
@@ -1248,7 +1304,13 @@
    "        tokenizer.decode(token_id),\n",
    "        end=\"\",\n",
    "        flush=True\n",
-    "    )"
+    "    )\n",
    "\n",
    "if torch.cuda.is_available():\n",
    "    def gpu_gb(x):\n",
    "        return f\"{x / 1024 / 1024 / 1024:.2f} GB\"\n",
    "    \n",
    "    print(f\"\\n\\nGPU memory used: {gpu_gb(torch.cuda.max_memory_allocated())}\")"
   ]
  },
  {
@@ -1269,7 +1331,6 @@
    "id": "e6edaaae-2de1-406c-8ffa-897cdfa3808c"
   },
   "source": [
    "- Check out the [README.md](./README.md), to use this model via the `llms_from_scratch` package\n",
    "- For those interested in a comprehensive guide on building a large language model from scratch and gaining a deeper understanding of its mechanics, you might like my [Build a Large Language Model (From Scratch)](http://mng.bz/orYv)\n",
    "\n",
    "<a href=\"http://mng.bz/orYv\"><img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp\" width=\"100px\"></a>"
@@ -1297,7 +1358,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.10.16"
+   "version": "3.12.3"
  }
 },
 "nbformat": 4,
--- a/ch05/12_gemma3/standalone-gemma3.ipynb
+++ b/ch05/12_gemma3/standalone-gemma3.ipynb
@@ -41,7 +41,6 @@
   "source": [
    "- This notebook is purposefully minimal and focuses on the code to re-implement Gemma 3 270M in pure PyTorch without relying on other external LLM libraries\n",
    "- For more information, see the official [Gemma 3 270M model card](https://huggingface.co/google/gemma-3-270m)\n",
    "\n",
    "- Below is a side-by-side comparison with Qwen3 0.6B as a reference model; if you are interested in the Qwen3 0.6B standalone notebook, you can find it [here](../11_qwen3)\n",
    "<br>\n",
    "\n",
@@ -78,9 +77,9 @@
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "huggingface_hub version: 0.34.4\n",
+      "huggingface_hub version: 0.35.0\n",
-      "tokenizers version: 0.21.4\n",
+      "tokenizers version: 0.22.1\n",
-      "torch version: 2.8.0\n"
+      "torch version: 2.9.0+cu130\n"
     ]
    }
   ],
@@ -628,9 +627,9 @@
    {
     "data": {
      "text/plain": [
-       "tensor([[[ 0.7500,  0.1060,  0.4844,  ...,  0.9414,  0.3984, -0.2324],\n",
+       "tensor([[[ 0.7500,  0.1011,  0.4863,  ...,  0.9414,  0.3984, -0.2285],\n",
-       "         [-0.3438, -0.0549,  0.8984,  ..., -0.2402,  0.4570,  0.8242],\n",
+       "         [-0.3398, -0.0564,  0.9023,  ..., -0.2480,  0.4551,  0.8203],\n",
-       "         [-0.2676, -0.3281,  0.4121,  ...,  0.8711, -0.9648,  0.9844]]],\n",
+       "         [-0.2695, -0.3242,  0.4121,  ...,  0.8672, -0.9688,  0.9844]]],\n",
       "       dtype=torch.bfloat16, grad_fn=<UnsafeViewBackward0>)"
      ]
     },
@@ -731,7 +730,20 @@
   "metadata": {
    "id": "31f12baf-f79b-499f-85c0-51328a6a20f5"
   },
-   "outputs": [],
+   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/rasbt/jupyterlab/reasoning/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py:283: UserWarning: \n",
      "    Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.\n",
      "    Minimum and Maximum cuda capability supported by this version of PyTorch is\n",
      "    (8.0) - (12.0)\n",
      "    \n",
      "  warnings.warn(\n"
     ]
    }
   ],
   "source": [
    "if torch.cuda.is_available():\n",
    "    device = torch.device(\"cuda\")\n",
@@ -1095,7 +1107,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 24,
+   "execution_count": 25,
   "id": "7b8401c6-e244-4cb7-9849-2ba71ce758d5",
   "metadata": {
    "id": "7b8401c6-e244-4cb7-9849-2ba71ce758d5"
@@ -1121,7 +1133,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 25,
+   "execution_count": 28,
   "id": "1c7a04fa-6aac-416b-8f63-f1e19227633d",
   "metadata": {
    "id": "1c7a04fa-6aac-416b-8f63-f1e19227633d"
@@ -1131,7 +1143,10 @@
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "Large language models (LLMs) are sophisticated artificial intelligence systems that can understand, generate, and manipulate human language. They are trained on massive amounts of text data to learn patterns and relationships within language, enabling them to perform a wide range of tasks, from writing articles and answering questions to translating languages and summarizing information.\n"
+      "Large language models (LLMs) are sophisticated artificial intelligence systems that can understand, generate, and manipulate human language. They are trained on massive amounts of text data to learn patterns and relationships within that data, enabling them to perform a wide range of tasks, from writing articles and answering questions to translating languages and summarizing information.\n",
      "\n",
      "\n",
      "GPU memory used: 1.04 GB\n"
     ]
    }
   ],
@@ -1139,6 +1154,10 @@
    "input_token_ids_tensor = torch.tensor(input_token_ids, device=device).unsqueeze(0)\n",
    "\n",
    "\n",
    "if torch.cuda.is_available():\n",
    "    torch.cuda.reset_peak_memory_stats()\n",
    "\n",
    "\n",
    "for token in generate_text_basic_stream(\n",
    "    model=model,\n",
    "    token_ids=input_token_ids_tensor,\n",
@@ -1150,7 +1169,13 @@
    "        tokenizer.decode(token_id),\n",
    "        end=\"\",\n",
    "        flush=True\n",
-    "    )"
+    "    )\n",
    "\n",
    "if torch.cuda.is_available():\n",
    "    def gpu_gb(x):\n",
    "        return f\"{x / 1024 / 1024 / 1024:.2f} GB\"\n",
    "    \n",
    "    print(f\"\\n\\nGPU memory used: {gpu_gb(torch.cuda.max_memory_allocated())}\")"
   ]
  },
  {
@@ -1171,7 +1196,6 @@
    "id": "e6edaaae-2de1-406c-8ffa-897cdfa3808c"
   },
   "source": [
    "- Check out the [README.md](./README.md), to use this model via the `llms_from_scratch` package\n",
    "- For those interested in a comprehensive guide on building a large language model from scratch and gaining a deeper understanding of its mechanics, you might like my [Build a Large Language Model (From Scratch)](http://mng.bz/orYv)\n",
    "\n",
    "<a href=\"http://mng.bz/orYv\"><img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp\" width=\"100px\"></a>"
@@ -1199,7 +1223,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.10.16"
+   "version": "3.12.3"
  }
 },
 "nbformat": 4,
--- a/ch05/13_olmo3/standalone-olmo3-plus-kv-cache.ipynb
+++ b/ch05/13_olmo3/standalone-olmo3-plus-kv-cache.ipynb
--- a/ch05/13_olmo3/standalone-olmo3.ipynb
+++ b/ch05/13_olmo3/standalone-olmo3.ipynb
--- a/ch05/13_olmo3/tests/olmo3_layer_debugger.py
+++ b/ch05/13_olmo3/tests/olmo3_layer_debugger.py
@@ -0,0 +1,240 @@
 # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt).
 # Source for "Build a Large Language Model From Scratch"
 #   - https://www.manning.com/books/build-a-large-language-model-from-scratch
 # Code: https://github.com/rasbt/LLMs-from-scratch
 import importlib
 from pathlib import Path
 import torch
 from llms_from_scratch.utils import import_definitions_from_notebook
 try:
    from transformers import Olmo3Config, Olmo3ForCausalLM
 except ImportError:
    Olmo3Config = None
    Olmo3ForCausalLM = None
 def tiny_debug_config():
    return {
        "vocab_size": 257,
        "context_length": 8,
        "emb_dim": 32,
        "n_heads": 4,
        "n_layers": 2,
        "hidden_dim": 64,
        "head_dim": 8,
        "qk_norm": True,
        "n_kv_heads": 2,
        "sliding_window": 4,
        "layer_types": ["full_attention", "full_attention"],
        "dtype": torch.float32,
        "query_pre_attn_scalar": 256,
        "attention_bias": False,
        "rms_norm_eps": 1e-6,
        "rope_base": 1_000_000.0,
        "rope_attention_factor": 1.0,
        "rope_type": "default",
        "rope_factor": 1.0,
        "rope_orig_max": 8,
        "rope_local_base": 10_000.0,
    }
 def _hf_config_from_dict(cfg):
    if Olmo3Config is None:
        raise ImportError("transformers is required for the Olmo-3 debugger.")
    return Olmo3Config(
        vocab_size=cfg["vocab_size"],
        max_position_embeddings=cfg["context_length"],
        hidden_size=cfg["emb_dim"],
        num_attention_heads=cfg["n_heads"],
        num_hidden_layers=cfg["n_layers"],
        intermediate_size=cfg["hidden_dim"],
        head_dim=cfg["head_dim"],
        num_key_value_heads=cfg["n_kv_heads"],
        rope_theta=cfg["rope_base"],
        rope_local_base_freq=cfg.get("rope_local_base", 10_000.0),
        layer_types=cfg["layer_types"],
        sliding_window=cfg["sliding_window"],
        tie_word_embeddings=False,
        attn_implementation="eager",
        torch_dtype=cfg.get("dtype", torch.float32),
        query_pre_attn_scalar=cfg.get("query_pre_attn_scalar", 256),
        rope_scaling={"rope_type": cfg.get("rope_type", "default")},
        qk_norm=cfg.get("qk_norm", False),
        rms_norm_eps=cfg.get("rms_norm_eps", 1e-5),
    )
 def load_notebook_defs(nb_name="standalone-olmo3.ipynb"):
    nb_dir = Path(__file__).resolve().parents[1]
    return import_definitions_from_notebook(nb_dir, nb_name)
 def build_olmo3_pair(nb_imports, cfg, hf_checkpoint=None):
    if Olmo3ForCausalLM is None:
        raise ImportError("transformers is required for the Olmo-3 debugger.")
    ours = nb_imports.Olmo3Model(cfg)
    hf_cfg = _hf_config_from_dict(cfg)
    if hf_checkpoint:
        hf_model = Olmo3ForCausalLM.from_pretrained(
            hf_checkpoint,
            torch_dtype=cfg.get("dtype", torch.float32),
            attn_implementation="eager",
        )
    else:
        hf_model = Olmo3ForCausalLM(hf_cfg)
    param_config = {"n_layers": cfg["n_layers"], "hidden_dim": cfg["hidden_dim"]}
    nb_imports.load_weights_into_olmo(ours, param_config, hf_model.state_dict())
    ours.eval()
    hf_model.eval()
    return ours, hf_model
 def _attach_debug_hooks(model, is_hf):
    traces = {}
    handles = []
    def hook(name):
        def _record(_, __, output):
            traces[name] = output.detach().to(torch.float32).cpu()
        return _record
    if is_hf:
        core = model.model
        handles.append(core.embed_tokens.register_forward_hook(hook("embedding")))
        for idx, layer in enumerate(core.layers):
            handles.append(layer.register_forward_hook(hook(f"block_{idx}")))
        handles.append(core.norm.register_forward_hook(hook("final_norm")))
        handles.append(model.lm_head.register_forward_hook(hook("logits")))
    else:
        handles.append(model.tok_emb.register_forward_hook(hook("embedding")))
        for idx, block in enumerate(model.blocks):
            handles.append(block.register_forward_hook(hook(f"block_{idx}")))
        handles.append(model.final_norm.register_forward_hook(hook("final_norm")))
        handles.append(model.out_head.register_forward_hook(hook("logits")))
    return traces, handles
 def _layer_sort_key(name):
    if name == "embedding":
        return (0, 0)
    if name.startswith("block_"):
        idx = int(name.split("_")[1])
        return (1, idx)
    if name == "final_norm":
        return (2, 0)
    if name == "logits":
        return (3, 0)
    return (4, name)
 def layerwise_differences(ours, hf_model, input_ids, rtol=1e-5, atol=1e-5):
    ours_traces, ours_handles = _attach_debug_hooks(ours, is_hf=False)
    hf_traces, hf_handles = _attach_debug_hooks(hf_model, is_hf=True)
    try:
        with torch.inference_mode():
            ours(input_ids)
            hf_model(input_ids)
    finally:
        for h in ours_handles + hf_handles:
            h.remove()
    layer_names = sorted(set(ours_traces) | set(hf_traces), key=_layer_sort_key)
    results = []
    for name in layer_names:
        ours_tensor = ours_traces.get(name)
        hf_tensor = hf_traces.get(name)
        if ours_tensor is None or hf_tensor is None:
            results.append(
                {
                    "name": name,
                    "status": "missing",
                    "ours_shape": None if ours_tensor is None else tuple(ours_tensor.shape),
                    "hf_shape": None if hf_tensor is None else tuple(hf_tensor.shape),
                    "max_diff": None,
                    "mean_abs_diff": None,
                }
            )
            continue
        shapes_match = ours_tensor.shape == hf_tensor.shape
        if not shapes_match:
            results.append(
                {
                    "name": name,
                    "status": "shape_mismatch",
                    "ours_shape": tuple(ours_tensor.shape),
                    "hf_shape": tuple(hf_tensor.shape),
                    "max_diff": None,
                    "mean_abs_diff": None,
                }
            )
            continue
        diff = (ours_tensor - hf_tensor).abs()
        max_diff = float(diff.max().item())
        mean_diff = float(diff.mean().item())
        allclose = torch.allclose(ours_tensor, hf_tensor, rtol=rtol, atol=atol)
        results.append(
            {
                "name": name,
                "status": "ok" if allclose else "mismatch",
                "ours_shape": tuple(ours_tensor.shape),
                "hf_shape": tuple(hf_tensor.shape),
                "max_diff": max_diff,
                "mean_abs_diff": mean_diff,
            }
        )
    return results
 def first_mismatch(differences):
    for diff in differences:
        if diff["status"] != "ok":
            return diff
    return None
 def format_report(differences):
    lines = []
    for diff in sorted(differences, key=lambda d: _layer_sort_key(d["name"])):
        if diff["status"] == "ok":
            lines.append(f"[OK] {diff['name']}: max={diff['max_diff']:.2e}, mean={diff['mean_abs_diff']:.2e}")
        elif diff["status"] == "mismatch":
            lines.append(
                f"[DIFF] {diff['name']}: max={diff['max_diff']:.2e}, mean={diff['mean_abs_diff']:.2e}"
            )
        elif diff["status"] == "shape_mismatch":
            lines.append(
                f"[SHAPE] {diff['name']}: ours={diff['ours_shape']}, hf={diff['hf_shape']}"
            )
        else:
            lines.append(f"[MISSING] {diff['name']}: ours={diff['ours_shape']}, hf={diff['hf_shape']}")
    return "\n".join(lines)
 if __name__ == "__main__":
    transformers_available = importlib.util.find_spec("transformers") is not None
    if not transformers_available:
        raise SystemExit("transformers is not installed; install it to run the debugger.")
    nb_imports = load_notebook_defs()
    cfg = tiny_debug_config()
    ours_model, hf_model = build_olmo3_pair(nb_imports, cfg)
    torch.manual_seed(0)
    input_ids = torch.randint(0, cfg["vocab_size"], (1, cfg["context_length"]), dtype=torch.long)
    diffs = layerwise_differences(ours_model, hf_model, input_ids)
    print(format_report(diffs))
--- a/ch05/13_olmo3/tests/test_olmo3_kvcache_nb.py
+++ b/ch05/13_olmo3/tests/test_olmo3_kvcache_nb.py
@@ -0,0 +1,142 @@
 # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt).
 # Source for "Build a Large Language Model From Scratch"
 #   - https://www.manning.com/books/build-a-large-language-model-from-scratch
 # Code: https://github.com/rasbt/LLMs-from-scratch
 import importlib
 from pathlib import Path
 import pytest
 import torch
 from llms_from_scratch.utils import import_definitions_from_notebook
 transformers_installed = importlib.util.find_spec("transformers") is not None
@pytest.fixture
 def nb_imports():
    nb_dir = Path(__file__).resolve().parents[1]
    mod = import_definitions_from_notebook(nb_dir, "standalone-olmo3-plus-kv-cache.ipynb")
    return mod
@pytest.fixture
 def dummy_input():
    torch.manual_seed(123)
    return torch.randint(0, 100, (1, 8))  # batch size 1, seq length 8
@pytest.fixture
 def dummy_cfg_base():
    return {
        "vocab_size": 100,
        "context_length": 64,
        "emb_dim": 32,
        "n_heads": 4,
        "n_layers": 2,
        "hidden_dim": 64,
        "head_dim": 8,
        "n_kv_heads": 1,  # 4 query heads, 1 KV groups -> group_size = 4
        "attention_bias": False,
        "attention_dropout": 0.0,
        "sliding_window": 4,
        "layer_types": ["full_attention"] * 2,
        # RoPE config
        "rope_base": 10_000.0,
        "rope_attention_factor": 1.0,
        "rope_type": "default",
        "rope_factor": 1.0,
        "rope_orig_max": 64,
        "rms_norm_eps": 1e-6,
        "dtype": torch.float32,
    }
@torch.inference_mode()
 def test_dummy_olmo3_forward(dummy_cfg_base, dummy_input, nb_imports):
    torch.manual_seed(123)
    model = nb_imports.Olmo3Model(dummy_cfg_base)
    out = model(dummy_input)
    assert out.shape == (1, dummy_input.size(1), dummy_cfg_base["vocab_size"]), \
        f"Expected shape (1, seq_len, vocab_size), got {out.shape}"
@torch.inference_mode()
@pytest.mark.skipif(not transformers_installed, reason="transformers not installed")
 def test_olmo3_base_equivalence_with_transformers(nb_imports):
    from transformers import Olmo3Config, Olmo3ForCausalLM
    # Tiny config so the test is fast
    cfg = {
        "vocab_size": 257,
        "context_length": 8,
        "emb_dim": 32,
        "n_heads": 4,
        "n_layers": 2,
        "hidden_dim": 64,
        "head_dim": 8,
        "qk_norm": True,
        "n_kv_heads": 2,
        "sliding_window": 4,
        "layer_types": ["full_attention", "full_attention"],
        "dtype": torch.float32,
        "query_pre_attn_scalar": 256,
        # required by TransformerBlock
        "attention_bias": False,
        # required by RMSNorm and RoPE setup in Olmo3Model
        "rms_norm_eps": 1e-6,
        "rope_base": 1_000_000.0,
        "rope_attention_factor": 1.0,
        "rope_type": "default",
        "rope_factor": 1.0,
        "rope_orig_max": 8,
        # extra HF-only stuff
        "rope_local_base": 10_000.0,
    }
    model = nb_imports.Olmo3Model(cfg)
    hf_cfg = Olmo3Config(
        vocab_size=cfg["vocab_size"],
        max_position_embeddings=cfg["context_length"],
        hidden_size=cfg["emb_dim"],
        num_attention_heads=cfg["n_heads"],
        num_hidden_layers=cfg["n_layers"],
        intermediate_size=cfg["hidden_dim"],
        head_dim=cfg["head_dim"],
        num_key_value_heads=cfg["n_kv_heads"],
        rope_theta=cfg["rope_base"],
        rope_local_base_freq=cfg["rope_local_base"],
        layer_types=cfg["layer_types"],
        sliding_window=cfg["sliding_window"],
        tie_word_embeddings=False,
        attn_implementation="eager",
        torch_dtype=torch.float32,
        query_pre_attn_scalar=cfg["query_pre_attn_scalar"],
        rope_scaling={"rope_type": "default"},
        qk_norm=cfg["qk_norm"],
        rms_norm_eps=cfg["rms_norm_eps"],
    )
    hf_model = Olmo3ForCausalLM(hf_cfg)
    hf_state = hf_model.state_dict()
    param_config = {
        "n_layers": cfg["n_layers"],
        "hidden_dim": cfg["hidden_dim"],
    }
    nb_imports.load_weights_into_olmo(model, param_config, hf_state)
    x = torch.randint(
        0,
        cfg["vocab_size"],
        (2, cfg["context_length"]),
        dtype=torch.long,
    )
    ours_logits = model(x)
    theirs_logits = hf_model(x).logits
    torch.testing.assert_close(ours_logits, theirs_logits, rtol=1e-5, atol=1e-5)
--- a/ch05/13_olmo3/tests/test_olmo3_nb.py
+++ b/ch05/13_olmo3/tests/test_olmo3_nb.py
@@ -0,0 +1,142 @@
 # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt).
 # Source for "Build a Large Language Model From Scratch"
 #   - https://www.manning.com/books/build-a-large-language-model-from-scratch
 # Code: https://github.com/rasbt/LLMs-from-scratch
 import importlib
 from pathlib import Path
 import pytest
 import torch
 from llms_from_scratch.utils import import_definitions_from_notebook
 transformers_installed = importlib.util.find_spec("transformers") is not None
@pytest.fixture
 def nb_imports():
    nb_dir = Path(__file__).resolve().parents[1]
    mod = import_definitions_from_notebook(nb_dir, "standalone-olmo3.ipynb")
    return mod
@pytest.fixture
 def dummy_input():
    torch.manual_seed(123)
    return torch.randint(0, 100, (1, 8))  # batch size 1, seq length 8
@pytest.fixture
 def dummy_cfg_base():
    return {
        "vocab_size": 100,
        "context_length": 64,
        "emb_dim": 32,
        "n_heads": 4,
        "n_layers": 2,
        "hidden_dim": 64,
        "head_dim": 8,
        "n_kv_heads": 1,  # 4 query heads, 1 KV groups -> group_size = 4
        "attention_bias": False,
        "attention_dropout": 0.0,
        "sliding_window": 4,
        "layer_types": ["full_attention"] * 2,
        # RoPE config
        "rope_base": 10_000.0,
        "rope_attention_factor": 1.0,
        "rope_type": "default",
        "rope_factor": 1.0,
        "rope_orig_max": 64,
        "rms_norm_eps": 1e-6,
        "dtype": torch.float32,
    }
@torch.inference_mode()
 def test_dummy_olmo3_forward(dummy_cfg_base, dummy_input, nb_imports):
    torch.manual_seed(123)
    model = nb_imports.Olmo3Model(dummy_cfg_base)
    out = model(dummy_input)
    assert out.shape == (1, dummy_input.size(1), dummy_cfg_base["vocab_size"]), \
        f"Expected shape (1, seq_len, vocab_size), got {out.shape}"
@torch.inference_mode()
@pytest.mark.skipif(not transformers_installed, reason="transformers not installed")
 def test_olmo3_base_equivalence_with_transformers(nb_imports):
    from transformers import Olmo3Config, Olmo3ForCausalLM
    # Tiny config so the test is fast
    cfg = {
        "vocab_size": 257,
        "context_length": 8,
        "emb_dim": 32,
        "n_heads": 4,
        "n_layers": 2,
        "hidden_dim": 64,
        "head_dim": 8,
        "qk_norm": True,
        "n_kv_heads": 2,
        "sliding_window": 4,
        "layer_types": ["full_attention", "full_attention"],
        "dtype": torch.float32,
        "query_pre_attn_scalar": 256,
        # required by TransformerBlock
        "attention_bias": False,
        # required by RMSNorm and RoPE setup in Olmo3Model
        "rms_norm_eps": 1e-6,
        "rope_base": 1_000_000.0,
        "rope_attention_factor": 1.0,
        "rope_type": "default",
        "rope_factor": 1.0,
        "rope_orig_max": 8,
        # extra HF-only stuff
        "rope_local_base": 10_000.0,
    }
    model = nb_imports.Olmo3Model(cfg)
    hf_cfg = Olmo3Config(
        vocab_size=cfg["vocab_size"],
        max_position_embeddings=cfg["context_length"],
        hidden_size=cfg["emb_dim"],
        num_attention_heads=cfg["n_heads"],
        num_hidden_layers=cfg["n_layers"],
        intermediate_size=cfg["hidden_dim"],
        head_dim=cfg["head_dim"],
        num_key_value_heads=cfg["n_kv_heads"],
        rope_theta=cfg["rope_base"],
        rope_local_base_freq=cfg["rope_local_base"],
        layer_types=cfg["layer_types"],
        sliding_window=cfg["sliding_window"],
        tie_word_embeddings=False,
        attn_implementation="eager",
        torch_dtype=torch.float32,
        query_pre_attn_scalar=cfg["query_pre_attn_scalar"],
        rope_scaling={"rope_type": "default"},
        qk_norm=cfg["qk_norm"],
        rms_norm_eps=cfg["rms_norm_eps"],
    )
    hf_model = Olmo3ForCausalLM(hf_cfg)
    hf_state = hf_model.state_dict()
    param_config = {
        "n_layers": cfg["n_layers"],
        "hidden_dim": cfg["hidden_dim"],
    }
    nb_imports.load_weights_into_olmo(model, param_config, hf_state)
    x = torch.randint(
        0,
        cfg["vocab_size"],
        (2, cfg["context_length"]),
        dtype=torch.long,
    )
    ours_logits = model(x)
    theirs_logits = hf_model(x).logits
    torch.testing.assert_close(ours_logits, theirs_logits, rtol=1e-5, atol=1e-5)
--- a/ch05/README.md
+++ b/ch05/README.md
@@ -13,14 +13,25 @@
 - [04_learning_rate_schedulers](04_learning_rate_schedulers) contains code implementing a more sophisticated training function including learning rate schedulers and gradient clipping
 - [05_bonus_hparam_tuning](05_bonus_hparam_tuning) contains an optional hyperparameter tuning script
 - [06_user_interface](06_user_interface) implements an interactive user interface to interact with the pretrained LLM
 - [07_gpt_to_llama](07_gpt_to_llama) contains a step-by-step guide for converting a GPT architecture implementation to Llama 3.2 and loads pretrained weights from Meta AI
 - [08_memory_efficient_weight_loading](08_memory_efficient_weight_loading) contains a bonus notebook showing how to load model weights via PyTorch's `load_state_dict` method more efficiently
 - [09_extending-tokenizers](09_extending-tokenizers) contains a from-scratch implementation of the GPT-2 BPE tokenizer
 - [10_llm-training-speed](10_llm-training-speed) shows PyTorch performance tips to improve the LLM training speed
 &nbsp;
 ## LLM Architectures From Scratch
 <img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/qwen/qwen-overview.webp">
 &nbsp;
 - [07_gpt_to_llama](07_gpt_to_llama) contains a step-by-step guide for converting a GPT architecture implementation to Llama 3.2 and loads pretrained weights from Meta AI
 - [11_qwen3](11_qwen3) A from-scratch implementation of Qwen3 0.6B and Qwen3 30B-A3B (Mixture-of-Experts) including code to load the pretrained weights of the base, reasoning, and coding model variants
 - [12_gemma3](12_gemma3) A from-scratch implementation of Gemma 3 270M and alternative with KV cache, including code to load the pretrained weights
 - [13_olmo3](13_olmo3) A from-scratch implementation of Olmo 3 7B and 32B (Base, Instruct, and Think variants) and alternative with KV cache, including code to load the pretrained weights
-
+&nbsp;
 ## Code-Along Video for This Chapter
 <br>
 <br>