Olmo 3 from scratch (#914)

* Olmo 3 from scratch

* update

* update

* update
This commit is contained in:
Sebastian Raschka
2025-11-22 22:42:18 -06:00
committed by GitHub
parent 398b079efa
commit bc6f335526
14 changed files with 3163 additions and 58 deletions

View File

@@ -57,6 +57,8 @@ jobs:
pytest ch05/11_qwen3/tests/test_qwen3_nb.py pytest ch05/11_qwen3/tests/test_qwen3_nb.py
pytest ch05/12_gemma3/tests/test_gemma3_nb.py pytest ch05/12_gemma3/tests/test_gemma3_nb.py
pytest ch05/12_gemma3/tests/test_gemma3_kv_nb.py pytest ch05/12_gemma3/tests/test_gemma3_kv_nb.py
pytest ch05/13_olmo3/tests/test_olmo3_nb.py
pytest ch05/13_olmo3/tests/test_olmo3_kvcache_nb.py
pytest ch06/01_main-chapter-code/tests.py pytest ch06/01_main-chapter-code/tests.py
- name: Validate Selected Jupyter Notebooks (uv) - name: Validate Selected Jupyter Notebooks (uv)

10
.gitignore vendored
View File

@@ -70,6 +70,16 @@ ch05/11_qwen3/Qwen3-8B
ch05/11_qwen3/Qwen3-8B-Base ch05/11_qwen3/Qwen3-8B-Base
ch05/11_qwen3/Qwen3-32B ch05/11_qwen3/Qwen3-32B
ch05/11_qwen3/Qwen3-32B-Base ch05/11_qwen3/Qwen3-32B-Base
ch05/12_gemma3/gemma-3-270M-it
ch05/12_gemma3/gemma-3-270M
ch05/13_olmo3/Olmo-3-1025-7B
ch05/13_olmo3/Olmo-3-1125-32B
ch05/13_olmo3/Olmo-3-7B-Instruct
ch05/13_olmo3/Olmo-3-32B-Instruct
ch05/13_olmo3/Olmo-3-7B-Think
ch05/13_olmo3/Olmo-3-32B-Think
ch05/13_olmo3/Olmo-3-7B-RLZero-IF
ch05/13_olmo3/Olmo-3-32B-RLZero-IF
ch06/01_main-chapter-code/gpt2 ch06/01_main-chapter-code/gpt2
ch06/02_bonus_additional-experiments/gpt2 ch06/02_bonus_additional-experiments/gpt2

View File

@@ -179,19 +179,19 @@ Several folders contain optional materials as a bonus for interested readers:
- [Optimizing Hyperparameters for Pretraining](ch05/05_bonus_hparam_tuning) - [Optimizing Hyperparameters for Pretraining](ch05/05_bonus_hparam_tuning)
- [Building a User Interface to Interact With the Pretrained LLM](ch05/06_user_interface) - [Building a User Interface to Interact With the Pretrained LLM](ch05/06_user_interface)
- [Converting GPT to Llama](ch05/07_gpt_to_llama) - [Converting GPT to Llama](ch05/07_gpt_to_llama)
- [Llama 3.2 From Scratch](ch05/07_gpt_to_llama/standalone-llama32.ipynb) - [Memory-efficient Model Weight Loading](ch05/08_memory_efficient_weight_loading/memory-efficient-state-dict.ipynb)
- [Qwen3 Dense and Mixture-of-Experts (MoE) From Scratch](ch05/11_qwen3/) - [Extending the Tiktoken BPE Tokenizer with New Tokens](ch05/09_extending-tokenizers/extend-tiktoken.ipynb)
- [Gemma 3 From Scratch](ch05/12_gemma3/)
- [Memory-Efficient Model Weight Loading](ch05/08_memory_efficient_weight_loading/memory-efficient-state-dict.ipynb)
- [Extending the Tiktoken BPE Tokenizer With New Tokens](ch05/09_extending-tokenizers/extend-tiktoken.ipynb)
- [PyTorch Performance Tips for Faster LLM Training](ch05/10_llm-training-speed) - [PyTorch Performance Tips for Faster LLM Training](ch05/10_llm-training-speed)
- [LLM Architectures](ch05/#llm-architectures-from-scratch)
- **Chapter 6: Finetuning for Classification** - [Llama 3.2 From Scratch](ch05/07_gpt_to_llama/standalone-llama32.ipynb)
- [Additional Experiments Finetuning Different Layers and Using Larger Models](ch06/02_bonus_additional-experiments) - [Qwen3 Dense and Mixture-of-Experts (MoE) From Scratch](ch05/11_qwen3/)
- [Finetuning Different Models on 50k IMDb Movie Review Dataset](ch06/03_bonus_imdb-classification) - [Gemma 3 From Scratch](ch05/12_gemma3/)
- [Building a User Interface to Interact With the GPT-Based Spam Classifier](ch06/04_user_interface) - [Olmo 3 From Scratch](ch05/13_olmo3/)
- **Chapter 6: Finetuning for classification**
- **Chapter 7: Finetuning to Follow Instructions** - [Additional experiments finetuning different layers and using larger models](ch06/02_bonus_additional-experiments)
- [Finetuning different models on 50k IMDb movie review dataset](ch06/03_bonus_imdb-classification)
- [Building a User Interface to Interact With the GPT-based Spam Classifier](ch06/04_user_interface)
- **Chapter 7: Finetuning to follow instructions**
- [Dataset Utilities for Finding Near Duplicates and Creating Passive Voice Entries](ch07/02_dataset-utilities) - [Dataset Utilities for Finding Near Duplicates and Creating Passive Voice Entries](ch07/02_dataset-utilities)
- [Evaluating Instruction Responses Using the OpenAI API and Ollama](ch07/03_model-evaluation) - [Evaluating Instruction Responses Using the OpenAI API and Ollama](ch07/03_model-evaluation)
- [Generating a Dataset for Instruction Finetuning](ch07/05_dataset-generation/llama3-ollama.ipynb) - [Generating a Dataset for Instruction Finetuning](ch07/05_dataset-generation/llama3-ollama.ipynb)

View File

@@ -1223,7 +1223,7 @@
"name": "python", "name": "python",
"nbconvert_exporter": "python", "nbconvert_exporter": "python",
"pygments_lexer": "ipython3", "pygments_lexer": "ipython3",
"version": "3.13.5" "version": "3.12.3"
} }
}, },
"nbformat": 4, "nbformat": 4,

View File

@@ -1253,7 +1253,7 @@
"name": "python", "name": "python",
"nbconvert_exporter": "python", "nbconvert_exporter": "python",
"pygments_lexer": "ipython3", "pygments_lexer": "ipython3",
"version": "3.13.5" "version": "3.12.3"
} }
}, },
"nbformat": 4, "nbformat": 4,

View File

@@ -1179,7 +1179,7 @@
"name": "python", "name": "python",
"nbconvert_exporter": "python", "nbconvert_exporter": "python",
"pygments_lexer": "ipython3", "pygments_lexer": "ipython3",
"version": "3.13.5" "version": "3.12.3"
} }
}, },
"nbformat": 4, "nbformat": 4,

View File

@@ -78,9 +78,9 @@
"name": "stdout", "name": "stdout",
"output_type": "stream", "output_type": "stream",
"text": [ "text": [
"huggingface_hub version: 0.34.4\n", "huggingface_hub version: 0.35.0\n",
"tokenizers version: 0.21.4\n", "tokenizers version: 0.22.1\n",
"torch version: 2.8.0\n" "torch version: 2.9.0+cu130\n"
] ]
} }
], ],
@@ -700,9 +700,9 @@
{ {
"data": { "data": {
"text/plain": [ "text/plain": [
"tensor([[[ 0.7500, 0.1060, 0.4844, ..., 0.9414, 0.3984, -0.2324],\n", "tensor([[[ 0.7500, 0.1011, 0.4863, ..., 0.9414, 0.3984, -0.2285],\n",
" [-0.3438, -0.0549, 0.8984, ..., -0.2402, 0.4570, 0.8242],\n", " [-0.3398, -0.0564, 0.9023, ..., -0.2480, 0.4551, 0.8203],\n",
" [-0.2676, -0.3281, 0.4121, ..., 0.8711, -0.9648, 0.9844]]],\n", " [-0.2695, -0.3242, 0.4121, ..., 0.8672, -0.9688, 0.9844]]],\n",
" dtype=torch.bfloat16, grad_fn=<UnsafeViewBackward0>)" " dtype=torch.bfloat16, grad_fn=<UnsafeViewBackward0>)"
] ]
}, },
@@ -806,7 +806,20 @@
"metadata": { "metadata": {
"id": "31f12baf-f79b-499f-85c0-51328a6a20f5" "id": "31f12baf-f79b-499f-85c0-51328a6a20f5"
}, },
"outputs": [], "outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/rasbt/jupyterlab/reasoning/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py:283: UserWarning: \n",
" Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.\n",
" Minimum and Maximum cuda capability supported by this version of PyTorch is\n",
" (8.0) - (12.0)\n",
" \n",
" warnings.warn(\n"
]
}
],
"source": [ "source": [
"if torch.cuda.is_available():\n", "if torch.cuda.is_available():\n",
" device = torch.device(\"cuda\")\n", " device = torch.device(\"cuda\")\n",
@@ -1038,6 +1051,20 @@
"outputId": "55b2f28c-142f-4698-9d23-d27456d3ed6d" "outputId": "55b2f28c-142f-4698-9d23-d27456d3ed6d"
}, },
"outputs": [ "outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "3396c08eab3f4cf980023483b969a337",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"model.safetensors: 0%| | 0.00/536M [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{ {
"name": "stdout", "name": "stdout",
"output_type": "stream", "output_type": "stream",
@@ -1131,7 +1158,22 @@
"execution_count": 22, "execution_count": 22,
"id": "7b6df8bc-7308-468e-93ce-2d5529ea7866", "id": "7b6df8bc-7308-468e-93ce-2d5529ea7866",
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "39b7b77c5c3448cdbd48fcde4e1b1a57",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"tokenizer.json: 0%| | 0.00/33.4M [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [ "source": [
"tokenizer_file_path = os.path.join(local_dir, \"tokenizer.json\")\n", "tokenizer_file_path = os.path.join(local_dir, \"tokenizer.json\")\n",
"if not os.path.exists(tokenizer_file_path):\n", "if not os.path.exists(tokenizer_file_path):\n",
@@ -1195,34 +1237,40 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 25, "execution_count": 27,
"id": "7b8401c6-e244-4cb7-9849-2ba71ce758d5", "id": "7b8401c6-e244-4cb7-9849-2ba71ce758d5",
"metadata": { "metadata": {
"id": "7b8401c6-e244-4cb7-9849-2ba71ce758d5" "id": "7b8401c6-e244-4cb7-9849-2ba71ce758d5"
}, },
"outputs": [], "outputs": [],
"source": [ "source": [
"def generate_text_basic_stream(model, token_ids, max_new_tokens, \n", "def generate_text_basic_stream(model, token_ids, max_new_tokens, eos_token_id=None, context_size=None):\n",
" eos_token_id=None):\n",
"\n",
" model.eval()\n", " model.eval()\n",
" with torch.no_grad():\n",
" for _ in range(max_new_tokens):\n",
" out = model(token_ids)[:, -1]\n",
" next_token = torch.argmax(out, dim=-1, keepdim=True)\n",
"\n", "\n",
" if (eos_token_id is not None\n", " with torch.no_grad():\n",
" and torch.all(next_token == eos_token_id)):\n", " cache = KVCache(n_layers=model.cfg[\"n_layers\"])\n",
" model.reset_kv_cache()\n",
"\n",
" # Prime the cache with the initial context\n",
" logits = model(token_ids, cache=cache)\n",
"\n",
" for _ in range(max_new_tokens):\n",
" next_token = torch.argmax(logits[:, -1], dim=-1, keepdim=True)\n",
"\n",
" if eos_token_id is not None and torch.all(next_token == eos_token_id):\n",
" break\n", " break\n",
"\n", "\n",
" yield next_token # New: Yield each token as it's generated\n", " yield next_token\n",
" \n", "\n",
" token_ids = torch.cat([token_ids, next_token], dim=1)" " token_ids = torch.cat([token_ids, next_token], dim=1)\n",
"\n",
" # Feed only the new token to the model; cache handles history\n",
" logits = model(next_token, cache=cache)"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 26, "execution_count": 28,
"id": "56c9d0cf-25e9-4375-8d5c-368fa6911fdf", "id": "56c9d0cf-25e9-4375-8d5c-368fa6911fdf",
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
@@ -1230,17 +1278,25 @@
"name": "stdout", "name": "stdout",
"output_type": "stream", "output_type": "stream",
"text": [ "text": [
"Large language models (LLMs) are sophisticated artificial intelligence systems that can understand, generate, and manipulate human language. They are trained on massive amounts of text data to learn patterns and relationships within language, enabling them to perform a wide range of tasks, from writing articles and answering questions to translating languages and summarizing information.\n" "Large language models (LLMs) are sophisticated artificial intelligence systems that can understand, generate, and manipulate human language. They are trained on massive amounts of text data to learn patterns and relationships within that data, enabling them to perform a wide range of tasks, from writing articles and answering questions to translating languages and summarizing information.\n",
"\n",
"\n",
"GPU memory used: 0.96 GB\n"
] ]
} }
], ],
"source": [ "source": [
"input_token_ids_tensor = torch.tensor(input_token_ids, device=device).unsqueeze(0)\n", "input_token_ids_tensor = torch.tensor(input_token_ids, device=device).unsqueeze(0)\n",
"\n", "\n",
"\n",
"if torch.cuda.is_available():\n",
" torch.cuda.reset_peak_memory_stats()\n",
"\n",
"\n",
"for token in generate_text_basic_stream(\n", "for token in generate_text_basic_stream(\n",
" model=model,\n", " model=model,\n",
" token_ids=input_token_ids_tensor,\n", " token_ids=input_token_ids_tensor,\n",
" max_new_tokens=150,\n", " max_new_tokens=500,\n",
" eos_token_id=tokenizer.encode(\"<end_of_turn>\")[-1]\n", " eos_token_id=tokenizer.encode(\"<end_of_turn>\")[-1]\n",
"):\n", "):\n",
" token_id = token.squeeze(0).tolist()\n", " token_id = token.squeeze(0).tolist()\n",
@@ -1248,7 +1304,13 @@
" tokenizer.decode(token_id),\n", " tokenizer.decode(token_id),\n",
" end=\"\",\n", " end=\"\",\n",
" flush=True\n", " flush=True\n",
" )" " )\n",
"\n",
"if torch.cuda.is_available():\n",
" def gpu_gb(x):\n",
" return f\"{x / 1024 / 1024 / 1024:.2f} GB\"\n",
" \n",
" print(f\"\\n\\nGPU memory used: {gpu_gb(torch.cuda.max_memory_allocated())}\")"
] ]
}, },
{ {
@@ -1269,7 +1331,6 @@
"id": "e6edaaae-2de1-406c-8ffa-897cdfa3808c" "id": "e6edaaae-2de1-406c-8ffa-897cdfa3808c"
}, },
"source": [ "source": [
"- Check out the [README.md](./README.md), to use this model via the `llms_from_scratch` package\n",
"- For those interested in a comprehensive guide on building a large language model from scratch and gaining a deeper understanding of its mechanics, you might like my [Build a Large Language Model (From Scratch)](http://mng.bz/orYv)\n", "- For those interested in a comprehensive guide on building a large language model from scratch and gaining a deeper understanding of its mechanics, you might like my [Build a Large Language Model (From Scratch)](http://mng.bz/orYv)\n",
"\n", "\n",
"<a href=\"http://mng.bz/orYv\"><img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp\" width=\"100px\"></a>" "<a href=\"http://mng.bz/orYv\"><img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp\" width=\"100px\"></a>"
@@ -1297,7 +1358,7 @@
"name": "python", "name": "python",
"nbconvert_exporter": "python", "nbconvert_exporter": "python",
"pygments_lexer": "ipython3", "pygments_lexer": "ipython3",
"version": "3.10.16" "version": "3.12.3"
} }
}, },
"nbformat": 4, "nbformat": 4,

View File

@@ -41,7 +41,6 @@
"source": [ "source": [
"- This notebook is purposefully minimal and focuses on the code to re-implement Gemma 3 270M in pure PyTorch without relying on other external LLM libraries\n", "- This notebook is purposefully minimal and focuses on the code to re-implement Gemma 3 270M in pure PyTorch without relying on other external LLM libraries\n",
"- For more information, see the official [Gemma 3 270M model card](https://huggingface.co/google/gemma-3-270m)\n", "- For more information, see the official [Gemma 3 270M model card](https://huggingface.co/google/gemma-3-270m)\n",
"\n",
"- Below is a side-by-side comparison with Qwen3 0.6B as a reference model; if you are interested in the Qwen3 0.6B standalone notebook, you can find it [here](../11_qwen3)\n", "- Below is a side-by-side comparison with Qwen3 0.6B as a reference model; if you are interested in the Qwen3 0.6B standalone notebook, you can find it [here](../11_qwen3)\n",
"<br>\n", "<br>\n",
"\n", "\n",
@@ -78,9 +77,9 @@
"name": "stdout", "name": "stdout",
"output_type": "stream", "output_type": "stream",
"text": [ "text": [
"huggingface_hub version: 0.34.4\n", "huggingface_hub version: 0.35.0\n",
"tokenizers version: 0.21.4\n", "tokenizers version: 0.22.1\n",
"torch version: 2.8.0\n" "torch version: 2.9.0+cu130\n"
] ]
} }
], ],
@@ -628,9 +627,9 @@
{ {
"data": { "data": {
"text/plain": [ "text/plain": [
"tensor([[[ 0.7500, 0.1060, 0.4844, ..., 0.9414, 0.3984, -0.2324],\n", "tensor([[[ 0.7500, 0.1011, 0.4863, ..., 0.9414, 0.3984, -0.2285],\n",
" [-0.3438, -0.0549, 0.8984, ..., -0.2402, 0.4570, 0.8242],\n", " [-0.3398, -0.0564, 0.9023, ..., -0.2480, 0.4551, 0.8203],\n",
" [-0.2676, -0.3281, 0.4121, ..., 0.8711, -0.9648, 0.9844]]],\n", " [-0.2695, -0.3242, 0.4121, ..., 0.8672, -0.9688, 0.9844]]],\n",
" dtype=torch.bfloat16, grad_fn=<UnsafeViewBackward0>)" " dtype=torch.bfloat16, grad_fn=<UnsafeViewBackward0>)"
] ]
}, },
@@ -731,7 +730,20 @@
"metadata": { "metadata": {
"id": "31f12baf-f79b-499f-85c0-51328a6a20f5" "id": "31f12baf-f79b-499f-85c0-51328a6a20f5"
}, },
"outputs": [], "outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/rasbt/jupyterlab/reasoning/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py:283: UserWarning: \n",
" Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.\n",
" Minimum and Maximum cuda capability supported by this version of PyTorch is\n",
" (8.0) - (12.0)\n",
" \n",
" warnings.warn(\n"
]
}
],
"source": [ "source": [
"if torch.cuda.is_available():\n", "if torch.cuda.is_available():\n",
" device = torch.device(\"cuda\")\n", " device = torch.device(\"cuda\")\n",
@@ -1095,7 +1107,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 24, "execution_count": 25,
"id": "7b8401c6-e244-4cb7-9849-2ba71ce758d5", "id": "7b8401c6-e244-4cb7-9849-2ba71ce758d5",
"metadata": { "metadata": {
"id": "7b8401c6-e244-4cb7-9849-2ba71ce758d5" "id": "7b8401c6-e244-4cb7-9849-2ba71ce758d5"
@@ -1121,7 +1133,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 25, "execution_count": 28,
"id": "1c7a04fa-6aac-416b-8f63-f1e19227633d", "id": "1c7a04fa-6aac-416b-8f63-f1e19227633d",
"metadata": { "metadata": {
"id": "1c7a04fa-6aac-416b-8f63-f1e19227633d" "id": "1c7a04fa-6aac-416b-8f63-f1e19227633d"
@@ -1131,7 +1143,10 @@
"name": "stdout", "name": "stdout",
"output_type": "stream", "output_type": "stream",
"text": [ "text": [
"Large language models (LLMs) are sophisticated artificial intelligence systems that can understand, generate, and manipulate human language. They are trained on massive amounts of text data to learn patterns and relationships within language, enabling them to perform a wide range of tasks, from writing articles and answering questions to translating languages and summarizing information.\n" "Large language models (LLMs) are sophisticated artificial intelligence systems that can understand, generate, and manipulate human language. They are trained on massive amounts of text data to learn patterns and relationships within that data, enabling them to perform a wide range of tasks, from writing articles and answering questions to translating languages and summarizing information.\n",
"\n",
"\n",
"GPU memory used: 1.04 GB\n"
] ]
} }
], ],
@@ -1139,6 +1154,10 @@
"input_token_ids_tensor = torch.tensor(input_token_ids, device=device).unsqueeze(0)\n", "input_token_ids_tensor = torch.tensor(input_token_ids, device=device).unsqueeze(0)\n",
"\n", "\n",
"\n", "\n",
"if torch.cuda.is_available():\n",
" torch.cuda.reset_peak_memory_stats()\n",
"\n",
"\n",
"for token in generate_text_basic_stream(\n", "for token in generate_text_basic_stream(\n",
" model=model,\n", " model=model,\n",
" token_ids=input_token_ids_tensor,\n", " token_ids=input_token_ids_tensor,\n",
@@ -1150,7 +1169,13 @@
" tokenizer.decode(token_id),\n", " tokenizer.decode(token_id),\n",
" end=\"\",\n", " end=\"\",\n",
" flush=True\n", " flush=True\n",
" )" " )\n",
"\n",
"if torch.cuda.is_available():\n",
" def gpu_gb(x):\n",
" return f\"{x / 1024 / 1024 / 1024:.2f} GB\"\n",
" \n",
" print(f\"\\n\\nGPU memory used: {gpu_gb(torch.cuda.max_memory_allocated())}\")"
] ]
}, },
{ {
@@ -1171,7 +1196,6 @@
"id": "e6edaaae-2de1-406c-8ffa-897cdfa3808c" "id": "e6edaaae-2de1-406c-8ffa-897cdfa3808c"
}, },
"source": [ "source": [
"- Check out the [README.md](./README.md), to use this model via the `llms_from_scratch` package\n",
"- For those interested in a comprehensive guide on building a large language model from scratch and gaining a deeper understanding of its mechanics, you might like my [Build a Large Language Model (From Scratch)](http://mng.bz/orYv)\n", "- For those interested in a comprehensive guide on building a large language model from scratch and gaining a deeper understanding of its mechanics, you might like my [Build a Large Language Model (From Scratch)](http://mng.bz/orYv)\n",
"\n", "\n",
"<a href=\"http://mng.bz/orYv\"><img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp\" width=\"100px\"></a>" "<a href=\"http://mng.bz/orYv\"><img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp\" width=\"100px\"></a>"
@@ -1199,7 +1223,7 @@
"name": "python", "name": "python",
"nbconvert_exporter": "python", "nbconvert_exporter": "python",
"pygments_lexer": "ipython3", "pygments_lexer": "ipython3",
"version": "3.10.16" "version": "3.12.3"
} }
}, },
"nbformat": 4, "nbformat": 4,

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,240 @@
# Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt).
# Source for "Build a Large Language Model From Scratch"
# - https://www.manning.com/books/build-a-large-language-model-from-scratch
# Code: https://github.com/rasbt/LLMs-from-scratch
import importlib
from pathlib import Path
import torch
from llms_from_scratch.utils import import_definitions_from_notebook
try:
from transformers import Olmo3Config, Olmo3ForCausalLM
except ImportError:
Olmo3Config = None
Olmo3ForCausalLM = None
def tiny_debug_config():
return {
"vocab_size": 257,
"context_length": 8,
"emb_dim": 32,
"n_heads": 4,
"n_layers": 2,
"hidden_dim": 64,
"head_dim": 8,
"qk_norm": True,
"n_kv_heads": 2,
"sliding_window": 4,
"layer_types": ["full_attention", "full_attention"],
"dtype": torch.float32,
"query_pre_attn_scalar": 256,
"attention_bias": False,
"rms_norm_eps": 1e-6,
"rope_base": 1_000_000.0,
"rope_attention_factor": 1.0,
"rope_type": "default",
"rope_factor": 1.0,
"rope_orig_max": 8,
"rope_local_base": 10_000.0,
}
def _hf_config_from_dict(cfg):
if Olmo3Config is None:
raise ImportError("transformers is required for the Olmo-3 debugger.")
return Olmo3Config(
vocab_size=cfg["vocab_size"],
max_position_embeddings=cfg["context_length"],
hidden_size=cfg["emb_dim"],
num_attention_heads=cfg["n_heads"],
num_hidden_layers=cfg["n_layers"],
intermediate_size=cfg["hidden_dim"],
head_dim=cfg["head_dim"],
num_key_value_heads=cfg["n_kv_heads"],
rope_theta=cfg["rope_base"],
rope_local_base_freq=cfg.get("rope_local_base", 10_000.0),
layer_types=cfg["layer_types"],
sliding_window=cfg["sliding_window"],
tie_word_embeddings=False,
attn_implementation="eager",
torch_dtype=cfg.get("dtype", torch.float32),
query_pre_attn_scalar=cfg.get("query_pre_attn_scalar", 256),
rope_scaling={"rope_type": cfg.get("rope_type", "default")},
qk_norm=cfg.get("qk_norm", False),
rms_norm_eps=cfg.get("rms_norm_eps", 1e-5),
)
def load_notebook_defs(nb_name="standalone-olmo3.ipynb"):
nb_dir = Path(__file__).resolve().parents[1]
return import_definitions_from_notebook(nb_dir, nb_name)
def build_olmo3_pair(nb_imports, cfg, hf_checkpoint=None):
if Olmo3ForCausalLM is None:
raise ImportError("transformers is required for the Olmo-3 debugger.")
ours = nb_imports.Olmo3Model(cfg)
hf_cfg = _hf_config_from_dict(cfg)
if hf_checkpoint:
hf_model = Olmo3ForCausalLM.from_pretrained(
hf_checkpoint,
torch_dtype=cfg.get("dtype", torch.float32),
attn_implementation="eager",
)
else:
hf_model = Olmo3ForCausalLM(hf_cfg)
param_config = {"n_layers": cfg["n_layers"], "hidden_dim": cfg["hidden_dim"]}
nb_imports.load_weights_into_olmo(ours, param_config, hf_model.state_dict())
ours.eval()
hf_model.eval()
return ours, hf_model
def _attach_debug_hooks(model, is_hf):
traces = {}
handles = []
def hook(name):
def _record(_, __, output):
traces[name] = output.detach().to(torch.float32).cpu()
return _record
if is_hf:
core = model.model
handles.append(core.embed_tokens.register_forward_hook(hook("embedding")))
for idx, layer in enumerate(core.layers):
handles.append(layer.register_forward_hook(hook(f"block_{idx}")))
handles.append(core.norm.register_forward_hook(hook("final_norm")))
handles.append(model.lm_head.register_forward_hook(hook("logits")))
else:
handles.append(model.tok_emb.register_forward_hook(hook("embedding")))
for idx, block in enumerate(model.blocks):
handles.append(block.register_forward_hook(hook(f"block_{idx}")))
handles.append(model.final_norm.register_forward_hook(hook("final_norm")))
handles.append(model.out_head.register_forward_hook(hook("logits")))
return traces, handles
def _layer_sort_key(name):
if name == "embedding":
return (0, 0)
if name.startswith("block_"):
idx = int(name.split("_")[1])
return (1, idx)
if name == "final_norm":
return (2, 0)
if name == "logits":
return (3, 0)
return (4, name)
def layerwise_differences(ours, hf_model, input_ids, rtol=1e-5, atol=1e-5):
ours_traces, ours_handles = _attach_debug_hooks(ours, is_hf=False)
hf_traces, hf_handles = _attach_debug_hooks(hf_model, is_hf=True)
try:
with torch.inference_mode():
ours(input_ids)
hf_model(input_ids)
finally:
for h in ours_handles + hf_handles:
h.remove()
layer_names = sorted(set(ours_traces) | set(hf_traces), key=_layer_sort_key)
results = []
for name in layer_names:
ours_tensor = ours_traces.get(name)
hf_tensor = hf_traces.get(name)
if ours_tensor is None or hf_tensor is None:
results.append(
{
"name": name,
"status": "missing",
"ours_shape": None if ours_tensor is None else tuple(ours_tensor.shape),
"hf_shape": None if hf_tensor is None else tuple(hf_tensor.shape),
"max_diff": None,
"mean_abs_diff": None,
}
)
continue
shapes_match = ours_tensor.shape == hf_tensor.shape
if not shapes_match:
results.append(
{
"name": name,
"status": "shape_mismatch",
"ours_shape": tuple(ours_tensor.shape),
"hf_shape": tuple(hf_tensor.shape),
"max_diff": None,
"mean_abs_diff": None,
}
)
continue
diff = (ours_tensor - hf_tensor).abs()
max_diff = float(diff.max().item())
mean_diff = float(diff.mean().item())
allclose = torch.allclose(ours_tensor, hf_tensor, rtol=rtol, atol=atol)
results.append(
{
"name": name,
"status": "ok" if allclose else "mismatch",
"ours_shape": tuple(ours_tensor.shape),
"hf_shape": tuple(hf_tensor.shape),
"max_diff": max_diff,
"mean_abs_diff": mean_diff,
}
)
return results
def first_mismatch(differences):
for diff in differences:
if diff["status"] != "ok":
return diff
return None
def format_report(differences):
lines = []
for diff in sorted(differences, key=lambda d: _layer_sort_key(d["name"])):
if diff["status"] == "ok":
lines.append(f"[OK] {diff['name']}: max={diff['max_diff']:.2e}, mean={diff['mean_abs_diff']:.2e}")
elif diff["status"] == "mismatch":
lines.append(
f"[DIFF] {diff['name']}: max={diff['max_diff']:.2e}, mean={diff['mean_abs_diff']:.2e}"
)
elif diff["status"] == "shape_mismatch":
lines.append(
f"[SHAPE] {diff['name']}: ours={diff['ours_shape']}, hf={diff['hf_shape']}"
)
else:
lines.append(f"[MISSING] {diff['name']}: ours={diff['ours_shape']}, hf={diff['hf_shape']}")
return "\n".join(lines)
if __name__ == "__main__":
transformers_available = importlib.util.find_spec("transformers") is not None
if not transformers_available:
raise SystemExit("transformers is not installed; install it to run the debugger.")
nb_imports = load_notebook_defs()
cfg = tiny_debug_config()
ours_model, hf_model = build_olmo3_pair(nb_imports, cfg)
torch.manual_seed(0)
input_ids = torch.randint(0, cfg["vocab_size"], (1, cfg["context_length"]), dtype=torch.long)
diffs = layerwise_differences(ours_model, hf_model, input_ids)
print(format_report(diffs))

View File

@@ -0,0 +1,142 @@
# Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt).
# Source for "Build a Large Language Model From Scratch"
# - https://www.manning.com/books/build-a-large-language-model-from-scratch
# Code: https://github.com/rasbt/LLMs-from-scratch
import importlib
from pathlib import Path
import pytest
import torch
from llms_from_scratch.utils import import_definitions_from_notebook
transformers_installed = importlib.util.find_spec("transformers") is not None
@pytest.fixture
def nb_imports():
nb_dir = Path(__file__).resolve().parents[1]
mod = import_definitions_from_notebook(nb_dir, "standalone-olmo3-plus-kv-cache.ipynb")
return mod
@pytest.fixture
def dummy_input():
torch.manual_seed(123)
return torch.randint(0, 100, (1, 8)) # batch size 1, seq length 8
@pytest.fixture
def dummy_cfg_base():
return {
"vocab_size": 100,
"context_length": 64,
"emb_dim": 32,
"n_heads": 4,
"n_layers": 2,
"hidden_dim": 64,
"head_dim": 8,
"n_kv_heads": 1, # 4 query heads, 1 KV groups -> group_size = 4
"attention_bias": False,
"attention_dropout": 0.0,
"sliding_window": 4,
"layer_types": ["full_attention"] * 2,
# RoPE config
"rope_base": 10_000.0,
"rope_attention_factor": 1.0,
"rope_type": "default",
"rope_factor": 1.0,
"rope_orig_max": 64,
"rms_norm_eps": 1e-6,
"dtype": torch.float32,
}
@torch.inference_mode()
def test_dummy_olmo3_forward(dummy_cfg_base, dummy_input, nb_imports):
torch.manual_seed(123)
model = nb_imports.Olmo3Model(dummy_cfg_base)
out = model(dummy_input)
assert out.shape == (1, dummy_input.size(1), dummy_cfg_base["vocab_size"]), \
f"Expected shape (1, seq_len, vocab_size), got {out.shape}"
@torch.inference_mode()
@pytest.mark.skipif(not transformers_installed, reason="transformers not installed")
def test_olmo3_base_equivalence_with_transformers(nb_imports):
from transformers import Olmo3Config, Olmo3ForCausalLM
# Tiny config so the test is fast
cfg = {
"vocab_size": 257,
"context_length": 8,
"emb_dim": 32,
"n_heads": 4,
"n_layers": 2,
"hidden_dim": 64,
"head_dim": 8,
"qk_norm": True,
"n_kv_heads": 2,
"sliding_window": 4,
"layer_types": ["full_attention", "full_attention"],
"dtype": torch.float32,
"query_pre_attn_scalar": 256,
# required by TransformerBlock
"attention_bias": False,
# required by RMSNorm and RoPE setup in Olmo3Model
"rms_norm_eps": 1e-6,
"rope_base": 1_000_000.0,
"rope_attention_factor": 1.0,
"rope_type": "default",
"rope_factor": 1.0,
"rope_orig_max": 8,
# extra HF-only stuff
"rope_local_base": 10_000.0,
}
model = nb_imports.Olmo3Model(cfg)
hf_cfg = Olmo3Config(
vocab_size=cfg["vocab_size"],
max_position_embeddings=cfg["context_length"],
hidden_size=cfg["emb_dim"],
num_attention_heads=cfg["n_heads"],
num_hidden_layers=cfg["n_layers"],
intermediate_size=cfg["hidden_dim"],
head_dim=cfg["head_dim"],
num_key_value_heads=cfg["n_kv_heads"],
rope_theta=cfg["rope_base"],
rope_local_base_freq=cfg["rope_local_base"],
layer_types=cfg["layer_types"],
sliding_window=cfg["sliding_window"],
tie_word_embeddings=False,
attn_implementation="eager",
torch_dtype=torch.float32,
query_pre_attn_scalar=cfg["query_pre_attn_scalar"],
rope_scaling={"rope_type": "default"},
qk_norm=cfg["qk_norm"],
rms_norm_eps=cfg["rms_norm_eps"],
)
hf_model = Olmo3ForCausalLM(hf_cfg)
hf_state = hf_model.state_dict()
param_config = {
"n_layers": cfg["n_layers"],
"hidden_dim": cfg["hidden_dim"],
}
nb_imports.load_weights_into_olmo(model, param_config, hf_state)
x = torch.randint(
0,
cfg["vocab_size"],
(2, cfg["context_length"]),
dtype=torch.long,
)
ours_logits = model(x)
theirs_logits = hf_model(x).logits
torch.testing.assert_close(ours_logits, theirs_logits, rtol=1e-5, atol=1e-5)

View File

@@ -0,0 +1,142 @@
# Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt).
# Source for "Build a Large Language Model From Scratch"
# - https://www.manning.com/books/build-a-large-language-model-from-scratch
# Code: https://github.com/rasbt/LLMs-from-scratch
import importlib
from pathlib import Path
import pytest
import torch
from llms_from_scratch.utils import import_definitions_from_notebook
transformers_installed = importlib.util.find_spec("transformers") is not None
@pytest.fixture
def nb_imports():
nb_dir = Path(__file__).resolve().parents[1]
mod = import_definitions_from_notebook(nb_dir, "standalone-olmo3.ipynb")
return mod
@pytest.fixture
def dummy_input():
torch.manual_seed(123)
return torch.randint(0, 100, (1, 8)) # batch size 1, seq length 8
@pytest.fixture
def dummy_cfg_base():
return {
"vocab_size": 100,
"context_length": 64,
"emb_dim": 32,
"n_heads": 4,
"n_layers": 2,
"hidden_dim": 64,
"head_dim": 8,
"n_kv_heads": 1, # 4 query heads, 1 KV groups -> group_size = 4
"attention_bias": False,
"attention_dropout": 0.0,
"sliding_window": 4,
"layer_types": ["full_attention"] * 2,
# RoPE config
"rope_base": 10_000.0,
"rope_attention_factor": 1.0,
"rope_type": "default",
"rope_factor": 1.0,
"rope_orig_max": 64,
"rms_norm_eps": 1e-6,
"dtype": torch.float32,
}
@torch.inference_mode()
def test_dummy_olmo3_forward(dummy_cfg_base, dummy_input, nb_imports):
torch.manual_seed(123)
model = nb_imports.Olmo3Model(dummy_cfg_base)
out = model(dummy_input)
assert out.shape == (1, dummy_input.size(1), dummy_cfg_base["vocab_size"]), \
f"Expected shape (1, seq_len, vocab_size), got {out.shape}"
@torch.inference_mode()
@pytest.mark.skipif(not transformers_installed, reason="transformers not installed")
def test_olmo3_base_equivalence_with_transformers(nb_imports):
from transformers import Olmo3Config, Olmo3ForCausalLM
# Tiny config so the test is fast
cfg = {
"vocab_size": 257,
"context_length": 8,
"emb_dim": 32,
"n_heads": 4,
"n_layers": 2,
"hidden_dim": 64,
"head_dim": 8,
"qk_norm": True,
"n_kv_heads": 2,
"sliding_window": 4,
"layer_types": ["full_attention", "full_attention"],
"dtype": torch.float32,
"query_pre_attn_scalar": 256,
# required by TransformerBlock
"attention_bias": False,
# required by RMSNorm and RoPE setup in Olmo3Model
"rms_norm_eps": 1e-6,
"rope_base": 1_000_000.0,
"rope_attention_factor": 1.0,
"rope_type": "default",
"rope_factor": 1.0,
"rope_orig_max": 8,
# extra HF-only stuff
"rope_local_base": 10_000.0,
}
model = nb_imports.Olmo3Model(cfg)
hf_cfg = Olmo3Config(
vocab_size=cfg["vocab_size"],
max_position_embeddings=cfg["context_length"],
hidden_size=cfg["emb_dim"],
num_attention_heads=cfg["n_heads"],
num_hidden_layers=cfg["n_layers"],
intermediate_size=cfg["hidden_dim"],
head_dim=cfg["head_dim"],
num_key_value_heads=cfg["n_kv_heads"],
rope_theta=cfg["rope_base"],
rope_local_base_freq=cfg["rope_local_base"],
layer_types=cfg["layer_types"],
sliding_window=cfg["sliding_window"],
tie_word_embeddings=False,
attn_implementation="eager",
torch_dtype=torch.float32,
query_pre_attn_scalar=cfg["query_pre_attn_scalar"],
rope_scaling={"rope_type": "default"},
qk_norm=cfg["qk_norm"],
rms_norm_eps=cfg["rms_norm_eps"],
)
hf_model = Olmo3ForCausalLM(hf_cfg)
hf_state = hf_model.state_dict()
param_config = {
"n_layers": cfg["n_layers"],
"hidden_dim": cfg["hidden_dim"],
}
nb_imports.load_weights_into_olmo(model, param_config, hf_state)
x = torch.randint(
0,
cfg["vocab_size"],
(2, cfg["context_length"]),
dtype=torch.long,
)
ours_logits = model(x)
theirs_logits = hf_model(x).logits
torch.testing.assert_close(ours_logits, theirs_logits, rtol=1e-5, atol=1e-5)

View File

@@ -13,14 +13,25 @@
- [04_learning_rate_schedulers](04_learning_rate_schedulers) contains code implementing a more sophisticated training function including learning rate schedulers and gradient clipping - [04_learning_rate_schedulers](04_learning_rate_schedulers) contains code implementing a more sophisticated training function including learning rate schedulers and gradient clipping
- [05_bonus_hparam_tuning](05_bonus_hparam_tuning) contains an optional hyperparameter tuning script - [05_bonus_hparam_tuning](05_bonus_hparam_tuning) contains an optional hyperparameter tuning script
- [06_user_interface](06_user_interface) implements an interactive user interface to interact with the pretrained LLM - [06_user_interface](06_user_interface) implements an interactive user interface to interact with the pretrained LLM
- [07_gpt_to_llama](07_gpt_to_llama) contains a step-by-step guide for converting a GPT architecture implementation to Llama 3.2 and loads pretrained weights from Meta AI
- [08_memory_efficient_weight_loading](08_memory_efficient_weight_loading) contains a bonus notebook showing how to load model weights via PyTorch's `load_state_dict` method more efficiently - [08_memory_efficient_weight_loading](08_memory_efficient_weight_loading) contains a bonus notebook showing how to load model weights via PyTorch's `load_state_dict` method more efficiently
- [09_extending-tokenizers](09_extending-tokenizers) contains a from-scratch implementation of the GPT-2 BPE tokenizer - [09_extending-tokenizers](09_extending-tokenizers) contains a from-scratch implementation of the GPT-2 BPE tokenizer
- [10_llm-training-speed](10_llm-training-speed) shows PyTorch performance tips to improve the LLM training speed - [10_llm-training-speed](10_llm-training-speed) shows PyTorch performance tips to improve the LLM training speed
&nbsp;
## LLM Architectures From Scratch
<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/qwen/qwen-overview.webp">
&nbsp;
- [07_gpt_to_llama](07_gpt_to_llama) contains a step-by-step guide for converting a GPT architecture implementation to Llama 3.2 and loads pretrained weights from Meta AI
- [11_qwen3](11_qwen3) A from-scratch implementation of Qwen3 0.6B and Qwen3 30B-A3B (Mixture-of-Experts) including code to load the pretrained weights of the base, reasoning, and coding model variants - [11_qwen3](11_qwen3) A from-scratch implementation of Qwen3 0.6B and Qwen3 30B-A3B (Mixture-of-Experts) including code to load the pretrained weights of the base, reasoning, and coding model variants
- [12_gemma3](12_gemma3) A from-scratch implementation of Gemma 3 270M and alternative with KV cache, including code to load the pretrained weights - [12_gemma3](12_gemma3) A from-scratch implementation of Gemma 3 270M and alternative with KV cache, including code to load the pretrained weights
- [13_olmo3](13_olmo3) A from-scratch implementation of Olmo 3 7B and 32B (Base, Instruct, and Think variants) and alternative with KV cache, including code to load the pretrained weights
&nbsp;
## Code-Along Video for This Chapter
<br> <br>
<br> <br>