mirror of
https://github.com/rasbt/LLMs-from-scratch.git
synced 2026-04-10 12:33:42 +00:00
Olmo 3 from scratch (#914)
* Olmo 3 from scratch * update * update * update
This commit is contained in:
committed by
GitHub
parent
398b079efa
commit
bc6f335526
2
.github/workflows/basic-tests-linux-uv.yml
vendored
2
.github/workflows/basic-tests-linux-uv.yml
vendored
@@ -57,6 +57,8 @@ jobs:
|
|||||||
pytest ch05/11_qwen3/tests/test_qwen3_nb.py
|
pytest ch05/11_qwen3/tests/test_qwen3_nb.py
|
||||||
pytest ch05/12_gemma3/tests/test_gemma3_nb.py
|
pytest ch05/12_gemma3/tests/test_gemma3_nb.py
|
||||||
pytest ch05/12_gemma3/tests/test_gemma3_kv_nb.py
|
pytest ch05/12_gemma3/tests/test_gemma3_kv_nb.py
|
||||||
|
pytest ch05/13_olmo3/tests/test_olmo3_nb.py
|
||||||
|
pytest ch05/13_olmo3/tests/test_olmo3_kvcache_nb.py
|
||||||
pytest ch06/01_main-chapter-code/tests.py
|
pytest ch06/01_main-chapter-code/tests.py
|
||||||
|
|
||||||
- name: Validate Selected Jupyter Notebooks (uv)
|
- name: Validate Selected Jupyter Notebooks (uv)
|
||||||
|
|||||||
10
.gitignore
vendored
10
.gitignore
vendored
@@ -70,6 +70,16 @@ ch05/11_qwen3/Qwen3-8B
|
|||||||
ch05/11_qwen3/Qwen3-8B-Base
|
ch05/11_qwen3/Qwen3-8B-Base
|
||||||
ch05/11_qwen3/Qwen3-32B
|
ch05/11_qwen3/Qwen3-32B
|
||||||
ch05/11_qwen3/Qwen3-32B-Base
|
ch05/11_qwen3/Qwen3-32B-Base
|
||||||
|
ch05/12_gemma3/gemma-3-270M-it
|
||||||
|
ch05/12_gemma3/gemma-3-270M
|
||||||
|
ch05/13_olmo3/Olmo-3-1025-7B
|
||||||
|
ch05/13_olmo3/Olmo-3-1125-32B
|
||||||
|
ch05/13_olmo3/Olmo-3-7B-Instruct
|
||||||
|
ch05/13_olmo3/Olmo-3-32B-Instruct
|
||||||
|
ch05/13_olmo3/Olmo-3-7B-Think
|
||||||
|
ch05/13_olmo3/Olmo-3-32B-Think
|
||||||
|
ch05/13_olmo3/Olmo-3-7B-RLZero-IF
|
||||||
|
ch05/13_olmo3/Olmo-3-32B-RLZero-IF
|
||||||
|
|
||||||
ch06/01_main-chapter-code/gpt2
|
ch06/01_main-chapter-code/gpt2
|
||||||
ch06/02_bonus_additional-experiments/gpt2
|
ch06/02_bonus_additional-experiments/gpt2
|
||||||
|
|||||||
24
README.md
24
README.md
@@ -179,19 +179,19 @@ Several folders contain optional materials as a bonus for interested readers:
|
|||||||
- [Optimizing Hyperparameters for Pretraining](ch05/05_bonus_hparam_tuning)
|
- [Optimizing Hyperparameters for Pretraining](ch05/05_bonus_hparam_tuning)
|
||||||
- [Building a User Interface to Interact With the Pretrained LLM](ch05/06_user_interface)
|
- [Building a User Interface to Interact With the Pretrained LLM](ch05/06_user_interface)
|
||||||
- [Converting GPT to Llama](ch05/07_gpt_to_llama)
|
- [Converting GPT to Llama](ch05/07_gpt_to_llama)
|
||||||
- [Llama 3.2 From Scratch](ch05/07_gpt_to_llama/standalone-llama32.ipynb)
|
- [Memory-efficient Model Weight Loading](ch05/08_memory_efficient_weight_loading/memory-efficient-state-dict.ipynb)
|
||||||
- [Qwen3 Dense and Mixture-of-Experts (MoE) From Scratch](ch05/11_qwen3/)
|
- [Extending the Tiktoken BPE Tokenizer with New Tokens](ch05/09_extending-tokenizers/extend-tiktoken.ipynb)
|
||||||
- [Gemma 3 From Scratch](ch05/12_gemma3/)
|
|
||||||
- [Memory-Efficient Model Weight Loading](ch05/08_memory_efficient_weight_loading/memory-efficient-state-dict.ipynb)
|
|
||||||
- [Extending the Tiktoken BPE Tokenizer With New Tokens](ch05/09_extending-tokenizers/extend-tiktoken.ipynb)
|
|
||||||
- [PyTorch Performance Tips for Faster LLM Training](ch05/10_llm-training-speed)
|
- [PyTorch Performance Tips for Faster LLM Training](ch05/10_llm-training-speed)
|
||||||
|
- [LLM Architectures](ch05/#llm-architectures-from-scratch)
|
||||||
- **Chapter 6: Finetuning for Classification**
|
- [Llama 3.2 From Scratch](ch05/07_gpt_to_llama/standalone-llama32.ipynb)
|
||||||
- [Additional Experiments Finetuning Different Layers and Using Larger Models](ch06/02_bonus_additional-experiments)
|
- [Qwen3 Dense and Mixture-of-Experts (MoE) From Scratch](ch05/11_qwen3/)
|
||||||
- [Finetuning Different Models on 50k IMDb Movie Review Dataset](ch06/03_bonus_imdb-classification)
|
- [Gemma 3 From Scratch](ch05/12_gemma3/)
|
||||||
- [Building a User Interface to Interact With the GPT-Based Spam Classifier](ch06/04_user_interface)
|
- [Olmo 3 From Scratch](ch05/13_olmo3/)
|
||||||
|
- **Chapter 6: Finetuning for classification**
|
||||||
- **Chapter 7: Finetuning to Follow Instructions**
|
- [Additional experiments finetuning different layers and using larger models](ch06/02_bonus_additional-experiments)
|
||||||
|
- [Finetuning different models on 50k IMDb movie review dataset](ch06/03_bonus_imdb-classification)
|
||||||
|
- [Building a User Interface to Interact With the GPT-based Spam Classifier](ch06/04_user_interface)
|
||||||
|
- **Chapter 7: Finetuning to follow instructions**
|
||||||
- [Dataset Utilities for Finding Near Duplicates and Creating Passive Voice Entries](ch07/02_dataset-utilities)
|
- [Dataset Utilities for Finding Near Duplicates and Creating Passive Voice Entries](ch07/02_dataset-utilities)
|
||||||
- [Evaluating Instruction Responses Using the OpenAI API and Ollama](ch07/03_model-evaluation)
|
- [Evaluating Instruction Responses Using the OpenAI API and Ollama](ch07/03_model-evaluation)
|
||||||
- [Generating a Dataset for Instruction Finetuning](ch07/05_dataset-generation/llama3-ollama.ipynb)
|
- [Generating a Dataset for Instruction Finetuning](ch07/05_dataset-generation/llama3-ollama.ipynb)
|
||||||
|
|||||||
@@ -1223,7 +1223,7 @@
|
|||||||
"name": "python",
|
"name": "python",
|
||||||
"nbconvert_exporter": "python",
|
"nbconvert_exporter": "python",
|
||||||
"pygments_lexer": "ipython3",
|
"pygments_lexer": "ipython3",
|
||||||
"version": "3.13.5"
|
"version": "3.12.3"
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
"nbformat": 4,
|
"nbformat": 4,
|
||||||
|
|||||||
@@ -1253,7 +1253,7 @@
|
|||||||
"name": "python",
|
"name": "python",
|
||||||
"nbconvert_exporter": "python",
|
"nbconvert_exporter": "python",
|
||||||
"pygments_lexer": "ipython3",
|
"pygments_lexer": "ipython3",
|
||||||
"version": "3.13.5"
|
"version": "3.12.3"
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
"nbformat": 4,
|
"nbformat": 4,
|
||||||
|
|||||||
@@ -1179,7 +1179,7 @@
|
|||||||
"name": "python",
|
"name": "python",
|
||||||
"nbconvert_exporter": "python",
|
"nbconvert_exporter": "python",
|
||||||
"pygments_lexer": "ipython3",
|
"pygments_lexer": "ipython3",
|
||||||
"version": "3.13.5"
|
"version": "3.12.3"
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
"nbformat": 4,
|
"nbformat": 4,
|
||||||
|
|||||||
@@ -78,9 +78,9 @@
|
|||||||
"name": "stdout",
|
"name": "stdout",
|
||||||
"output_type": "stream",
|
"output_type": "stream",
|
||||||
"text": [
|
"text": [
|
||||||
"huggingface_hub version: 0.34.4\n",
|
"huggingface_hub version: 0.35.0\n",
|
||||||
"tokenizers version: 0.21.4\n",
|
"tokenizers version: 0.22.1\n",
|
||||||
"torch version: 2.8.0\n"
|
"torch version: 2.9.0+cu130\n"
|
||||||
]
|
]
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
@@ -700,9 +700,9 @@
|
|||||||
{
|
{
|
||||||
"data": {
|
"data": {
|
||||||
"text/plain": [
|
"text/plain": [
|
||||||
"tensor([[[ 0.7500, 0.1060, 0.4844, ..., 0.9414, 0.3984, -0.2324],\n",
|
"tensor([[[ 0.7500, 0.1011, 0.4863, ..., 0.9414, 0.3984, -0.2285],\n",
|
||||||
" [-0.3438, -0.0549, 0.8984, ..., -0.2402, 0.4570, 0.8242],\n",
|
" [-0.3398, -0.0564, 0.9023, ..., -0.2480, 0.4551, 0.8203],\n",
|
||||||
" [-0.2676, -0.3281, 0.4121, ..., 0.8711, -0.9648, 0.9844]]],\n",
|
" [-0.2695, -0.3242, 0.4121, ..., 0.8672, -0.9688, 0.9844]]],\n",
|
||||||
" dtype=torch.bfloat16, grad_fn=<UnsafeViewBackward0>)"
|
" dtype=torch.bfloat16, grad_fn=<UnsafeViewBackward0>)"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
@@ -806,7 +806,20 @@
|
|||||||
"metadata": {
|
"metadata": {
|
||||||
"id": "31f12baf-f79b-499f-85c0-51328a6a20f5"
|
"id": "31f12baf-f79b-499f-85c0-51328a6a20f5"
|
||||||
},
|
},
|
||||||
"outputs": [],
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stderr",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"/home/rasbt/jupyterlab/reasoning/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py:283: UserWarning: \n",
|
||||||
|
" Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.\n",
|
||||||
|
" Minimum and Maximum cuda capability supported by this version of PyTorch is\n",
|
||||||
|
" (8.0) - (12.0)\n",
|
||||||
|
" \n",
|
||||||
|
" warnings.warn(\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
"source": [
|
"source": [
|
||||||
"if torch.cuda.is_available():\n",
|
"if torch.cuda.is_available():\n",
|
||||||
" device = torch.device(\"cuda\")\n",
|
" device = torch.device(\"cuda\")\n",
|
||||||
@@ -1038,6 +1051,20 @@
|
|||||||
"outputId": "55b2f28c-142f-4698-9d23-d27456d3ed6d"
|
"outputId": "55b2f28c-142f-4698-9d23-d27456d3ed6d"
|
||||||
},
|
},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"application/vnd.jupyter.widget-view+json": {
|
||||||
|
"model_id": "3396c08eab3f4cf980023483b969a337",
|
||||||
|
"version_major": 2,
|
||||||
|
"version_minor": 0
|
||||||
|
},
|
||||||
|
"text/plain": [
|
||||||
|
"model.safetensors: 0%| | 0.00/536M [00:00<?, ?B/s]"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"name": "stdout",
|
"name": "stdout",
|
||||||
"output_type": "stream",
|
"output_type": "stream",
|
||||||
@@ -1131,7 +1158,22 @@
|
|||||||
"execution_count": 22,
|
"execution_count": 22,
|
||||||
"id": "7b6df8bc-7308-468e-93ce-2d5529ea7866",
|
"id": "7b6df8bc-7308-468e-93ce-2d5529ea7866",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [],
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"application/vnd.jupyter.widget-view+json": {
|
||||||
|
"model_id": "39b7b77c5c3448cdbd48fcde4e1b1a57",
|
||||||
|
"version_major": 2,
|
||||||
|
"version_minor": 0
|
||||||
|
},
|
||||||
|
"text/plain": [
|
||||||
|
"tokenizer.json: 0%| | 0.00/33.4M [00:00<?, ?B/s]"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
}
|
||||||
|
],
|
||||||
"source": [
|
"source": [
|
||||||
"tokenizer_file_path = os.path.join(local_dir, \"tokenizer.json\")\n",
|
"tokenizer_file_path = os.path.join(local_dir, \"tokenizer.json\")\n",
|
||||||
"if not os.path.exists(tokenizer_file_path):\n",
|
"if not os.path.exists(tokenizer_file_path):\n",
|
||||||
@@ -1195,34 +1237,40 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 25,
|
"execution_count": 27,
|
||||||
"id": "7b8401c6-e244-4cb7-9849-2ba71ce758d5",
|
"id": "7b8401c6-e244-4cb7-9849-2ba71ce758d5",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"id": "7b8401c6-e244-4cb7-9849-2ba71ce758d5"
|
"id": "7b8401c6-e244-4cb7-9849-2ba71ce758d5"
|
||||||
},
|
},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"def generate_text_basic_stream(model, token_ids, max_new_tokens, \n",
|
"def generate_text_basic_stream(model, token_ids, max_new_tokens, eos_token_id=None, context_size=None):\n",
|
||||||
" eos_token_id=None):\n",
|
|
||||||
"\n",
|
|
||||||
" model.eval()\n",
|
" model.eval()\n",
|
||||||
" with torch.no_grad():\n",
|
|
||||||
" for _ in range(max_new_tokens):\n",
|
|
||||||
" out = model(token_ids)[:, -1]\n",
|
|
||||||
" next_token = torch.argmax(out, dim=-1, keepdim=True)\n",
|
|
||||||
"\n",
|
"\n",
|
||||||
" if (eos_token_id is not None\n",
|
" with torch.no_grad():\n",
|
||||||
" and torch.all(next_token == eos_token_id)):\n",
|
" cache = KVCache(n_layers=model.cfg[\"n_layers\"])\n",
|
||||||
|
" model.reset_kv_cache()\n",
|
||||||
|
"\n",
|
||||||
|
" # Prime the cache with the initial context\n",
|
||||||
|
" logits = model(token_ids, cache=cache)\n",
|
||||||
|
"\n",
|
||||||
|
" for _ in range(max_new_tokens):\n",
|
||||||
|
" next_token = torch.argmax(logits[:, -1], dim=-1, keepdim=True)\n",
|
||||||
|
"\n",
|
||||||
|
" if eos_token_id is not None and torch.all(next_token == eos_token_id):\n",
|
||||||
" break\n",
|
" break\n",
|
||||||
"\n",
|
"\n",
|
||||||
" yield next_token # New: Yield each token as it's generated\n",
|
" yield next_token\n",
|
||||||
" \n",
|
"\n",
|
||||||
" token_ids = torch.cat([token_ids, next_token], dim=1)"
|
" token_ids = torch.cat([token_ids, next_token], dim=1)\n",
|
||||||
|
"\n",
|
||||||
|
" # Feed only the new token to the model; cache handles history\n",
|
||||||
|
" logits = model(next_token, cache=cache)"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 26,
|
"execution_count": 28,
|
||||||
"id": "56c9d0cf-25e9-4375-8d5c-368fa6911fdf",
|
"id": "56c9d0cf-25e9-4375-8d5c-368fa6911fdf",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
@@ -1230,17 +1278,25 @@
|
|||||||
"name": "stdout",
|
"name": "stdout",
|
||||||
"output_type": "stream",
|
"output_type": "stream",
|
||||||
"text": [
|
"text": [
|
||||||
"Large language models (LLMs) are sophisticated artificial intelligence systems that can understand, generate, and manipulate human language. They are trained on massive amounts of text data to learn patterns and relationships within language, enabling them to perform a wide range of tasks, from writing articles and answering questions to translating languages and summarizing information.\n"
|
"Large language models (LLMs) are sophisticated artificial intelligence systems that can understand, generate, and manipulate human language. They are trained on massive amounts of text data to learn patterns and relationships within that data, enabling them to perform a wide range of tasks, from writing articles and answering questions to translating languages and summarizing information.\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"GPU memory used: 0.96 GB\n"
|
||||||
]
|
]
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"source": [
|
"source": [
|
||||||
"input_token_ids_tensor = torch.tensor(input_token_ids, device=device).unsqueeze(0)\n",
|
"input_token_ids_tensor = torch.tensor(input_token_ids, device=device).unsqueeze(0)\n",
|
||||||
"\n",
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"if torch.cuda.is_available():\n",
|
||||||
|
" torch.cuda.reset_peak_memory_stats()\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
"for token in generate_text_basic_stream(\n",
|
"for token in generate_text_basic_stream(\n",
|
||||||
" model=model,\n",
|
" model=model,\n",
|
||||||
" token_ids=input_token_ids_tensor,\n",
|
" token_ids=input_token_ids_tensor,\n",
|
||||||
" max_new_tokens=150,\n",
|
" max_new_tokens=500,\n",
|
||||||
" eos_token_id=tokenizer.encode(\"<end_of_turn>\")[-1]\n",
|
" eos_token_id=tokenizer.encode(\"<end_of_turn>\")[-1]\n",
|
||||||
"):\n",
|
"):\n",
|
||||||
" token_id = token.squeeze(0).tolist()\n",
|
" token_id = token.squeeze(0).tolist()\n",
|
||||||
@@ -1248,7 +1304,13 @@
|
|||||||
" tokenizer.decode(token_id),\n",
|
" tokenizer.decode(token_id),\n",
|
||||||
" end=\"\",\n",
|
" end=\"\",\n",
|
||||||
" flush=True\n",
|
" flush=True\n",
|
||||||
" )"
|
" )\n",
|
||||||
|
"\n",
|
||||||
|
"if torch.cuda.is_available():\n",
|
||||||
|
" def gpu_gb(x):\n",
|
||||||
|
" return f\"{x / 1024 / 1024 / 1024:.2f} GB\"\n",
|
||||||
|
" \n",
|
||||||
|
" print(f\"\\n\\nGPU memory used: {gpu_gb(torch.cuda.max_memory_allocated())}\")"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -1269,7 +1331,6 @@
|
|||||||
"id": "e6edaaae-2de1-406c-8ffa-897cdfa3808c"
|
"id": "e6edaaae-2de1-406c-8ffa-897cdfa3808c"
|
||||||
},
|
},
|
||||||
"source": [
|
"source": [
|
||||||
"- Check out the [README.md](./README.md), to use this model via the `llms_from_scratch` package\n",
|
|
||||||
"- For those interested in a comprehensive guide on building a large language model from scratch and gaining a deeper understanding of its mechanics, you might like my [Build a Large Language Model (From Scratch)](http://mng.bz/orYv)\n",
|
"- For those interested in a comprehensive guide on building a large language model from scratch and gaining a deeper understanding of its mechanics, you might like my [Build a Large Language Model (From Scratch)](http://mng.bz/orYv)\n",
|
||||||
"\n",
|
"\n",
|
||||||
"<a href=\"http://mng.bz/orYv\"><img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp\" width=\"100px\"></a>"
|
"<a href=\"http://mng.bz/orYv\"><img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp\" width=\"100px\"></a>"
|
||||||
@@ -1297,7 +1358,7 @@
|
|||||||
"name": "python",
|
"name": "python",
|
||||||
"nbconvert_exporter": "python",
|
"nbconvert_exporter": "python",
|
||||||
"pygments_lexer": "ipython3",
|
"pygments_lexer": "ipython3",
|
||||||
"version": "3.10.16"
|
"version": "3.12.3"
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
"nbformat": 4,
|
"nbformat": 4,
|
||||||
|
|||||||
@@ -41,7 +41,6 @@
|
|||||||
"source": [
|
"source": [
|
||||||
"- This notebook is purposefully minimal and focuses on the code to re-implement Gemma 3 270M in pure PyTorch without relying on other external LLM libraries\n",
|
"- This notebook is purposefully minimal and focuses on the code to re-implement Gemma 3 270M in pure PyTorch without relying on other external LLM libraries\n",
|
||||||
"- For more information, see the official [Gemma 3 270M model card](https://huggingface.co/google/gemma-3-270m)\n",
|
"- For more information, see the official [Gemma 3 270M model card](https://huggingface.co/google/gemma-3-270m)\n",
|
||||||
"\n",
|
|
||||||
"- Below is a side-by-side comparison with Qwen3 0.6B as a reference model; if you are interested in the Qwen3 0.6B standalone notebook, you can find it [here](../11_qwen3)\n",
|
"- Below is a side-by-side comparison with Qwen3 0.6B as a reference model; if you are interested in the Qwen3 0.6B standalone notebook, you can find it [here](../11_qwen3)\n",
|
||||||
"<br>\n",
|
"<br>\n",
|
||||||
"\n",
|
"\n",
|
||||||
@@ -78,9 +77,9 @@
|
|||||||
"name": "stdout",
|
"name": "stdout",
|
||||||
"output_type": "stream",
|
"output_type": "stream",
|
||||||
"text": [
|
"text": [
|
||||||
"huggingface_hub version: 0.34.4\n",
|
"huggingface_hub version: 0.35.0\n",
|
||||||
"tokenizers version: 0.21.4\n",
|
"tokenizers version: 0.22.1\n",
|
||||||
"torch version: 2.8.0\n"
|
"torch version: 2.9.0+cu130\n"
|
||||||
]
|
]
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
@@ -628,9 +627,9 @@
|
|||||||
{
|
{
|
||||||
"data": {
|
"data": {
|
||||||
"text/plain": [
|
"text/plain": [
|
||||||
"tensor([[[ 0.7500, 0.1060, 0.4844, ..., 0.9414, 0.3984, -0.2324],\n",
|
"tensor([[[ 0.7500, 0.1011, 0.4863, ..., 0.9414, 0.3984, -0.2285],\n",
|
||||||
" [-0.3438, -0.0549, 0.8984, ..., -0.2402, 0.4570, 0.8242],\n",
|
" [-0.3398, -0.0564, 0.9023, ..., -0.2480, 0.4551, 0.8203],\n",
|
||||||
" [-0.2676, -0.3281, 0.4121, ..., 0.8711, -0.9648, 0.9844]]],\n",
|
" [-0.2695, -0.3242, 0.4121, ..., 0.8672, -0.9688, 0.9844]]],\n",
|
||||||
" dtype=torch.bfloat16, grad_fn=<UnsafeViewBackward0>)"
|
" dtype=torch.bfloat16, grad_fn=<UnsafeViewBackward0>)"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
@@ -731,7 +730,20 @@
|
|||||||
"metadata": {
|
"metadata": {
|
||||||
"id": "31f12baf-f79b-499f-85c0-51328a6a20f5"
|
"id": "31f12baf-f79b-499f-85c0-51328a6a20f5"
|
||||||
},
|
},
|
||||||
"outputs": [],
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stderr",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"/home/rasbt/jupyterlab/reasoning/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py:283: UserWarning: \n",
|
||||||
|
" Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.\n",
|
||||||
|
" Minimum and Maximum cuda capability supported by this version of PyTorch is\n",
|
||||||
|
" (8.0) - (12.0)\n",
|
||||||
|
" \n",
|
||||||
|
" warnings.warn(\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
"source": [
|
"source": [
|
||||||
"if torch.cuda.is_available():\n",
|
"if torch.cuda.is_available():\n",
|
||||||
" device = torch.device(\"cuda\")\n",
|
" device = torch.device(\"cuda\")\n",
|
||||||
@@ -1095,7 +1107,7 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 24,
|
"execution_count": 25,
|
||||||
"id": "7b8401c6-e244-4cb7-9849-2ba71ce758d5",
|
"id": "7b8401c6-e244-4cb7-9849-2ba71ce758d5",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"id": "7b8401c6-e244-4cb7-9849-2ba71ce758d5"
|
"id": "7b8401c6-e244-4cb7-9849-2ba71ce758d5"
|
||||||
@@ -1121,7 +1133,7 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 25,
|
"execution_count": 28,
|
||||||
"id": "1c7a04fa-6aac-416b-8f63-f1e19227633d",
|
"id": "1c7a04fa-6aac-416b-8f63-f1e19227633d",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"id": "1c7a04fa-6aac-416b-8f63-f1e19227633d"
|
"id": "1c7a04fa-6aac-416b-8f63-f1e19227633d"
|
||||||
@@ -1131,7 +1143,10 @@
|
|||||||
"name": "stdout",
|
"name": "stdout",
|
||||||
"output_type": "stream",
|
"output_type": "stream",
|
||||||
"text": [
|
"text": [
|
||||||
"Large language models (LLMs) are sophisticated artificial intelligence systems that can understand, generate, and manipulate human language. They are trained on massive amounts of text data to learn patterns and relationships within language, enabling them to perform a wide range of tasks, from writing articles and answering questions to translating languages and summarizing information.\n"
|
"Large language models (LLMs) are sophisticated artificial intelligence systems that can understand, generate, and manipulate human language. They are trained on massive amounts of text data to learn patterns and relationships within that data, enabling them to perform a wide range of tasks, from writing articles and answering questions to translating languages and summarizing information.\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"GPU memory used: 1.04 GB\n"
|
||||||
]
|
]
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
@@ -1139,6 +1154,10 @@
|
|||||||
"input_token_ids_tensor = torch.tensor(input_token_ids, device=device).unsqueeze(0)\n",
|
"input_token_ids_tensor = torch.tensor(input_token_ids, device=device).unsqueeze(0)\n",
|
||||||
"\n",
|
"\n",
|
||||||
"\n",
|
"\n",
|
||||||
|
"if torch.cuda.is_available():\n",
|
||||||
|
" torch.cuda.reset_peak_memory_stats()\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
"for token in generate_text_basic_stream(\n",
|
"for token in generate_text_basic_stream(\n",
|
||||||
" model=model,\n",
|
" model=model,\n",
|
||||||
" token_ids=input_token_ids_tensor,\n",
|
" token_ids=input_token_ids_tensor,\n",
|
||||||
@@ -1150,7 +1169,13 @@
|
|||||||
" tokenizer.decode(token_id),\n",
|
" tokenizer.decode(token_id),\n",
|
||||||
" end=\"\",\n",
|
" end=\"\",\n",
|
||||||
" flush=True\n",
|
" flush=True\n",
|
||||||
" )"
|
" )\n",
|
||||||
|
"\n",
|
||||||
|
"if torch.cuda.is_available():\n",
|
||||||
|
" def gpu_gb(x):\n",
|
||||||
|
" return f\"{x / 1024 / 1024 / 1024:.2f} GB\"\n",
|
||||||
|
" \n",
|
||||||
|
" print(f\"\\n\\nGPU memory used: {gpu_gb(torch.cuda.max_memory_allocated())}\")"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -1171,7 +1196,6 @@
|
|||||||
"id": "e6edaaae-2de1-406c-8ffa-897cdfa3808c"
|
"id": "e6edaaae-2de1-406c-8ffa-897cdfa3808c"
|
||||||
},
|
},
|
||||||
"source": [
|
"source": [
|
||||||
"- Check out the [README.md](./README.md), to use this model via the `llms_from_scratch` package\n",
|
|
||||||
"- For those interested in a comprehensive guide on building a large language model from scratch and gaining a deeper understanding of its mechanics, you might like my [Build a Large Language Model (From Scratch)](http://mng.bz/orYv)\n",
|
"- For those interested in a comprehensive guide on building a large language model from scratch and gaining a deeper understanding of its mechanics, you might like my [Build a Large Language Model (From Scratch)](http://mng.bz/orYv)\n",
|
||||||
"\n",
|
"\n",
|
||||||
"<a href=\"http://mng.bz/orYv\"><img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp\" width=\"100px\"></a>"
|
"<a href=\"http://mng.bz/orYv\"><img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp\" width=\"100px\"></a>"
|
||||||
@@ -1199,7 +1223,7 @@
|
|||||||
"name": "python",
|
"name": "python",
|
||||||
"nbconvert_exporter": "python",
|
"nbconvert_exporter": "python",
|
||||||
"pygments_lexer": "ipython3",
|
"pygments_lexer": "ipython3",
|
||||||
"version": "3.10.16"
|
"version": "3.12.3"
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
"nbformat": 4,
|
"nbformat": 4,
|
||||||
|
|||||||
1290
ch05/13_olmo3/standalone-olmo3-plus-kv-cache.ipynb
Normal file
1290
ch05/13_olmo3/standalone-olmo3-plus-kv-cache.ipynb
Normal file
File diff suppressed because it is too large
Load Diff
1183
ch05/13_olmo3/standalone-olmo3.ipynb
Normal file
1183
ch05/13_olmo3/standalone-olmo3.ipynb
Normal file
File diff suppressed because it is too large
Load Diff
240
ch05/13_olmo3/tests/olmo3_layer_debugger.py
Normal file
240
ch05/13_olmo3/tests/olmo3_layer_debugger.py
Normal file
@@ -0,0 +1,240 @@
|
|||||||
|
# Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt).
|
||||||
|
# Source for "Build a Large Language Model From Scratch"
|
||||||
|
# - https://www.manning.com/books/build-a-large-language-model-from-scratch
|
||||||
|
# Code: https://github.com/rasbt/LLMs-from-scratch
|
||||||
|
|
||||||
|
import importlib
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import torch
|
||||||
|
|
||||||
|
from llms_from_scratch.utils import import_definitions_from_notebook
|
||||||
|
|
||||||
|
try:
|
||||||
|
from transformers import Olmo3Config, Olmo3ForCausalLM
|
||||||
|
except ImportError:
|
||||||
|
Olmo3Config = None
|
||||||
|
Olmo3ForCausalLM = None
|
||||||
|
|
||||||
|
|
||||||
|
def tiny_debug_config():
|
||||||
|
return {
|
||||||
|
"vocab_size": 257,
|
||||||
|
"context_length": 8,
|
||||||
|
"emb_dim": 32,
|
||||||
|
"n_heads": 4,
|
||||||
|
"n_layers": 2,
|
||||||
|
"hidden_dim": 64,
|
||||||
|
"head_dim": 8,
|
||||||
|
"qk_norm": True,
|
||||||
|
"n_kv_heads": 2,
|
||||||
|
"sliding_window": 4,
|
||||||
|
"layer_types": ["full_attention", "full_attention"],
|
||||||
|
"dtype": torch.float32,
|
||||||
|
"query_pre_attn_scalar": 256,
|
||||||
|
"attention_bias": False,
|
||||||
|
"rms_norm_eps": 1e-6,
|
||||||
|
"rope_base": 1_000_000.0,
|
||||||
|
"rope_attention_factor": 1.0,
|
||||||
|
"rope_type": "default",
|
||||||
|
"rope_factor": 1.0,
|
||||||
|
"rope_orig_max": 8,
|
||||||
|
"rope_local_base": 10_000.0,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _hf_config_from_dict(cfg):
|
||||||
|
if Olmo3Config is None:
|
||||||
|
raise ImportError("transformers is required for the Olmo-3 debugger.")
|
||||||
|
|
||||||
|
return Olmo3Config(
|
||||||
|
vocab_size=cfg["vocab_size"],
|
||||||
|
max_position_embeddings=cfg["context_length"],
|
||||||
|
hidden_size=cfg["emb_dim"],
|
||||||
|
num_attention_heads=cfg["n_heads"],
|
||||||
|
num_hidden_layers=cfg["n_layers"],
|
||||||
|
intermediate_size=cfg["hidden_dim"],
|
||||||
|
head_dim=cfg["head_dim"],
|
||||||
|
num_key_value_heads=cfg["n_kv_heads"],
|
||||||
|
rope_theta=cfg["rope_base"],
|
||||||
|
rope_local_base_freq=cfg.get("rope_local_base", 10_000.0),
|
||||||
|
layer_types=cfg["layer_types"],
|
||||||
|
sliding_window=cfg["sliding_window"],
|
||||||
|
tie_word_embeddings=False,
|
||||||
|
attn_implementation="eager",
|
||||||
|
torch_dtype=cfg.get("dtype", torch.float32),
|
||||||
|
query_pre_attn_scalar=cfg.get("query_pre_attn_scalar", 256),
|
||||||
|
rope_scaling={"rope_type": cfg.get("rope_type", "default")},
|
||||||
|
qk_norm=cfg.get("qk_norm", False),
|
||||||
|
rms_norm_eps=cfg.get("rms_norm_eps", 1e-5),
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def load_notebook_defs(nb_name="standalone-olmo3.ipynb"):
|
||||||
|
nb_dir = Path(__file__).resolve().parents[1]
|
||||||
|
return import_definitions_from_notebook(nb_dir, nb_name)
|
||||||
|
|
||||||
|
|
||||||
|
def build_olmo3_pair(nb_imports, cfg, hf_checkpoint=None):
|
||||||
|
if Olmo3ForCausalLM is None:
|
||||||
|
raise ImportError("transformers is required for the Olmo-3 debugger.")
|
||||||
|
|
||||||
|
ours = nb_imports.Olmo3Model(cfg)
|
||||||
|
hf_cfg = _hf_config_from_dict(cfg)
|
||||||
|
|
||||||
|
if hf_checkpoint:
|
||||||
|
hf_model = Olmo3ForCausalLM.from_pretrained(
|
||||||
|
hf_checkpoint,
|
||||||
|
torch_dtype=cfg.get("dtype", torch.float32),
|
||||||
|
attn_implementation="eager",
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
hf_model = Olmo3ForCausalLM(hf_cfg)
|
||||||
|
|
||||||
|
param_config = {"n_layers": cfg["n_layers"], "hidden_dim": cfg["hidden_dim"]}
|
||||||
|
nb_imports.load_weights_into_olmo(ours, param_config, hf_model.state_dict())
|
||||||
|
|
||||||
|
ours.eval()
|
||||||
|
hf_model.eval()
|
||||||
|
return ours, hf_model
|
||||||
|
|
||||||
|
|
||||||
|
def _attach_debug_hooks(model, is_hf):
|
||||||
|
traces = {}
|
||||||
|
handles = []
|
||||||
|
|
||||||
|
def hook(name):
|
||||||
|
def _record(_, __, output):
|
||||||
|
traces[name] = output.detach().to(torch.float32).cpu()
|
||||||
|
return _record
|
||||||
|
|
||||||
|
if is_hf:
|
||||||
|
core = model.model
|
||||||
|
handles.append(core.embed_tokens.register_forward_hook(hook("embedding")))
|
||||||
|
for idx, layer in enumerate(core.layers):
|
||||||
|
handles.append(layer.register_forward_hook(hook(f"block_{idx}")))
|
||||||
|
handles.append(core.norm.register_forward_hook(hook("final_norm")))
|
||||||
|
handles.append(model.lm_head.register_forward_hook(hook("logits")))
|
||||||
|
else:
|
||||||
|
handles.append(model.tok_emb.register_forward_hook(hook("embedding")))
|
||||||
|
for idx, block in enumerate(model.blocks):
|
||||||
|
handles.append(block.register_forward_hook(hook(f"block_{idx}")))
|
||||||
|
handles.append(model.final_norm.register_forward_hook(hook("final_norm")))
|
||||||
|
handles.append(model.out_head.register_forward_hook(hook("logits")))
|
||||||
|
|
||||||
|
return traces, handles
|
||||||
|
|
||||||
|
|
||||||
|
def _layer_sort_key(name):
|
||||||
|
if name == "embedding":
|
||||||
|
return (0, 0)
|
||||||
|
if name.startswith("block_"):
|
||||||
|
idx = int(name.split("_")[1])
|
||||||
|
return (1, idx)
|
||||||
|
if name == "final_norm":
|
||||||
|
return (2, 0)
|
||||||
|
if name == "logits":
|
||||||
|
return (3, 0)
|
||||||
|
return (4, name)
|
||||||
|
|
||||||
|
|
||||||
|
def layerwise_differences(ours, hf_model, input_ids, rtol=1e-5, atol=1e-5):
|
||||||
|
ours_traces, ours_handles = _attach_debug_hooks(ours, is_hf=False)
|
||||||
|
hf_traces, hf_handles = _attach_debug_hooks(hf_model, is_hf=True)
|
||||||
|
|
||||||
|
try:
|
||||||
|
with torch.inference_mode():
|
||||||
|
ours(input_ids)
|
||||||
|
hf_model(input_ids)
|
||||||
|
finally:
|
||||||
|
for h in ours_handles + hf_handles:
|
||||||
|
h.remove()
|
||||||
|
|
||||||
|
layer_names = sorted(set(ours_traces) | set(hf_traces), key=_layer_sort_key)
|
||||||
|
results = []
|
||||||
|
for name in layer_names:
|
||||||
|
ours_tensor = ours_traces.get(name)
|
||||||
|
hf_tensor = hf_traces.get(name)
|
||||||
|
|
||||||
|
if ours_tensor is None or hf_tensor is None:
|
||||||
|
results.append(
|
||||||
|
{
|
||||||
|
"name": name,
|
||||||
|
"status": "missing",
|
||||||
|
"ours_shape": None if ours_tensor is None else tuple(ours_tensor.shape),
|
||||||
|
"hf_shape": None if hf_tensor is None else tuple(hf_tensor.shape),
|
||||||
|
"max_diff": None,
|
||||||
|
"mean_abs_diff": None,
|
||||||
|
}
|
||||||
|
)
|
||||||
|
continue
|
||||||
|
|
||||||
|
shapes_match = ours_tensor.shape == hf_tensor.shape
|
||||||
|
if not shapes_match:
|
||||||
|
results.append(
|
||||||
|
{
|
||||||
|
"name": name,
|
||||||
|
"status": "shape_mismatch",
|
||||||
|
"ours_shape": tuple(ours_tensor.shape),
|
||||||
|
"hf_shape": tuple(hf_tensor.shape),
|
||||||
|
"max_diff": None,
|
||||||
|
"mean_abs_diff": None,
|
||||||
|
}
|
||||||
|
)
|
||||||
|
continue
|
||||||
|
|
||||||
|
diff = (ours_tensor - hf_tensor).abs()
|
||||||
|
max_diff = float(diff.max().item())
|
||||||
|
mean_diff = float(diff.mean().item())
|
||||||
|
allclose = torch.allclose(ours_tensor, hf_tensor, rtol=rtol, atol=atol)
|
||||||
|
results.append(
|
||||||
|
{
|
||||||
|
"name": name,
|
||||||
|
"status": "ok" if allclose else "mismatch",
|
||||||
|
"ours_shape": tuple(ours_tensor.shape),
|
||||||
|
"hf_shape": tuple(hf_tensor.shape),
|
||||||
|
"max_diff": max_diff,
|
||||||
|
"mean_abs_diff": mean_diff,
|
||||||
|
}
|
||||||
|
)
|
||||||
|
return results
|
||||||
|
|
||||||
|
|
||||||
|
def first_mismatch(differences):
|
||||||
|
for diff in differences:
|
||||||
|
if diff["status"] != "ok":
|
||||||
|
return diff
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def format_report(differences):
|
||||||
|
lines = []
|
||||||
|
for diff in sorted(differences, key=lambda d: _layer_sort_key(d["name"])):
|
||||||
|
if diff["status"] == "ok":
|
||||||
|
lines.append(f"[OK] {diff['name']}: max={diff['max_diff']:.2e}, mean={diff['mean_abs_diff']:.2e}")
|
||||||
|
elif diff["status"] == "mismatch":
|
||||||
|
lines.append(
|
||||||
|
f"[DIFF] {diff['name']}: max={diff['max_diff']:.2e}, mean={diff['mean_abs_diff']:.2e}"
|
||||||
|
)
|
||||||
|
elif diff["status"] == "shape_mismatch":
|
||||||
|
lines.append(
|
||||||
|
f"[SHAPE] {diff['name']}: ours={diff['ours_shape']}, hf={diff['hf_shape']}"
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
lines.append(f"[MISSING] {diff['name']}: ours={diff['ours_shape']}, hf={diff['hf_shape']}")
|
||||||
|
return "\n".join(lines)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
transformers_available = importlib.util.find_spec("transformers") is not None
|
||||||
|
if not transformers_available:
|
||||||
|
raise SystemExit("transformers is not installed; install it to run the debugger.")
|
||||||
|
|
||||||
|
nb_imports = load_notebook_defs()
|
||||||
|
cfg = tiny_debug_config()
|
||||||
|
|
||||||
|
ours_model, hf_model = build_olmo3_pair(nb_imports, cfg)
|
||||||
|
torch.manual_seed(0)
|
||||||
|
input_ids = torch.randint(0, cfg["vocab_size"], (1, cfg["context_length"]), dtype=torch.long)
|
||||||
|
diffs = layerwise_differences(ours_model, hf_model, input_ids)
|
||||||
|
print(format_report(diffs))
|
||||||
142
ch05/13_olmo3/tests/test_olmo3_kvcache_nb.py
Normal file
142
ch05/13_olmo3/tests/test_olmo3_kvcache_nb.py
Normal file
@@ -0,0 +1,142 @@
|
|||||||
|
# Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt).
|
||||||
|
# Source for "Build a Large Language Model From Scratch"
|
||||||
|
# - https://www.manning.com/books/build-a-large-language-model-from-scratch
|
||||||
|
# Code: https://github.com/rasbt/LLMs-from-scratch
|
||||||
|
|
||||||
|
import importlib
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
import torch
|
||||||
|
|
||||||
|
from llms_from_scratch.utils import import_definitions_from_notebook
|
||||||
|
|
||||||
|
|
||||||
|
transformers_installed = importlib.util.find_spec("transformers") is not None
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def nb_imports():
|
||||||
|
nb_dir = Path(__file__).resolve().parents[1]
|
||||||
|
mod = import_definitions_from_notebook(nb_dir, "standalone-olmo3-plus-kv-cache.ipynb")
|
||||||
|
return mod
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def dummy_input():
|
||||||
|
torch.manual_seed(123)
|
||||||
|
return torch.randint(0, 100, (1, 8)) # batch size 1, seq length 8
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def dummy_cfg_base():
|
||||||
|
return {
|
||||||
|
"vocab_size": 100,
|
||||||
|
"context_length": 64,
|
||||||
|
"emb_dim": 32,
|
||||||
|
"n_heads": 4,
|
||||||
|
"n_layers": 2,
|
||||||
|
"hidden_dim": 64,
|
||||||
|
"head_dim": 8,
|
||||||
|
"n_kv_heads": 1, # 4 query heads, 1 KV groups -> group_size = 4
|
||||||
|
"attention_bias": False,
|
||||||
|
"attention_dropout": 0.0,
|
||||||
|
"sliding_window": 4,
|
||||||
|
"layer_types": ["full_attention"] * 2,
|
||||||
|
|
||||||
|
# RoPE config
|
||||||
|
"rope_base": 10_000.0,
|
||||||
|
"rope_attention_factor": 1.0,
|
||||||
|
"rope_type": "default",
|
||||||
|
"rope_factor": 1.0,
|
||||||
|
"rope_orig_max": 64,
|
||||||
|
"rms_norm_eps": 1e-6,
|
||||||
|
"dtype": torch.float32,
|
||||||
|
}
|
||||||
|
|
||||||
|
@torch.inference_mode()
|
||||||
|
def test_dummy_olmo3_forward(dummy_cfg_base, dummy_input, nb_imports):
|
||||||
|
torch.manual_seed(123)
|
||||||
|
model = nb_imports.Olmo3Model(dummy_cfg_base)
|
||||||
|
out = model(dummy_input)
|
||||||
|
assert out.shape == (1, dummy_input.size(1), dummy_cfg_base["vocab_size"]), \
|
||||||
|
f"Expected shape (1, seq_len, vocab_size), got {out.shape}"
|
||||||
|
|
||||||
|
|
||||||
|
@torch.inference_mode()
|
||||||
|
@pytest.mark.skipif(not transformers_installed, reason="transformers not installed")
|
||||||
|
def test_olmo3_base_equivalence_with_transformers(nb_imports):
|
||||||
|
from transformers import Olmo3Config, Olmo3ForCausalLM
|
||||||
|
|
||||||
|
# Tiny config so the test is fast
|
||||||
|
cfg = {
|
||||||
|
"vocab_size": 257,
|
||||||
|
"context_length": 8,
|
||||||
|
"emb_dim": 32,
|
||||||
|
"n_heads": 4,
|
||||||
|
"n_layers": 2,
|
||||||
|
"hidden_dim": 64,
|
||||||
|
"head_dim": 8,
|
||||||
|
"qk_norm": True,
|
||||||
|
"n_kv_heads": 2,
|
||||||
|
"sliding_window": 4,
|
||||||
|
"layer_types": ["full_attention", "full_attention"],
|
||||||
|
"dtype": torch.float32,
|
||||||
|
"query_pre_attn_scalar": 256,
|
||||||
|
|
||||||
|
# required by TransformerBlock
|
||||||
|
"attention_bias": False,
|
||||||
|
|
||||||
|
# required by RMSNorm and RoPE setup in Olmo3Model
|
||||||
|
"rms_norm_eps": 1e-6,
|
||||||
|
"rope_base": 1_000_000.0,
|
||||||
|
"rope_attention_factor": 1.0,
|
||||||
|
"rope_type": "default",
|
||||||
|
"rope_factor": 1.0,
|
||||||
|
"rope_orig_max": 8,
|
||||||
|
|
||||||
|
# extra HF-only stuff
|
||||||
|
"rope_local_base": 10_000.0,
|
||||||
|
}
|
||||||
|
|
||||||
|
model = nb_imports.Olmo3Model(cfg)
|
||||||
|
|
||||||
|
hf_cfg = Olmo3Config(
|
||||||
|
vocab_size=cfg["vocab_size"],
|
||||||
|
max_position_embeddings=cfg["context_length"],
|
||||||
|
hidden_size=cfg["emb_dim"],
|
||||||
|
num_attention_heads=cfg["n_heads"],
|
||||||
|
num_hidden_layers=cfg["n_layers"],
|
||||||
|
intermediate_size=cfg["hidden_dim"],
|
||||||
|
head_dim=cfg["head_dim"],
|
||||||
|
num_key_value_heads=cfg["n_kv_heads"],
|
||||||
|
rope_theta=cfg["rope_base"],
|
||||||
|
rope_local_base_freq=cfg["rope_local_base"],
|
||||||
|
layer_types=cfg["layer_types"],
|
||||||
|
sliding_window=cfg["sliding_window"],
|
||||||
|
tie_word_embeddings=False,
|
||||||
|
attn_implementation="eager",
|
||||||
|
torch_dtype=torch.float32,
|
||||||
|
query_pre_attn_scalar=cfg["query_pre_attn_scalar"],
|
||||||
|
rope_scaling={"rope_type": "default"},
|
||||||
|
qk_norm=cfg["qk_norm"],
|
||||||
|
rms_norm_eps=cfg["rms_norm_eps"],
|
||||||
|
)
|
||||||
|
hf_model = Olmo3ForCausalLM(hf_cfg)
|
||||||
|
|
||||||
|
hf_state = hf_model.state_dict()
|
||||||
|
param_config = {
|
||||||
|
"n_layers": cfg["n_layers"],
|
||||||
|
"hidden_dim": cfg["hidden_dim"],
|
||||||
|
}
|
||||||
|
nb_imports.load_weights_into_olmo(model, param_config, hf_state)
|
||||||
|
|
||||||
|
x = torch.randint(
|
||||||
|
0,
|
||||||
|
cfg["vocab_size"],
|
||||||
|
(2, cfg["context_length"]),
|
||||||
|
dtype=torch.long,
|
||||||
|
)
|
||||||
|
ours_logits = model(x)
|
||||||
|
theirs_logits = hf_model(x).logits
|
||||||
|
torch.testing.assert_close(ours_logits, theirs_logits, rtol=1e-5, atol=1e-5)
|
||||||
142
ch05/13_olmo3/tests/test_olmo3_nb.py
Normal file
142
ch05/13_olmo3/tests/test_olmo3_nb.py
Normal file
@@ -0,0 +1,142 @@
|
|||||||
|
# Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt).
|
||||||
|
# Source for "Build a Large Language Model From Scratch"
|
||||||
|
# - https://www.manning.com/books/build-a-large-language-model-from-scratch
|
||||||
|
# Code: https://github.com/rasbt/LLMs-from-scratch
|
||||||
|
|
||||||
|
import importlib
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
import torch
|
||||||
|
|
||||||
|
from llms_from_scratch.utils import import_definitions_from_notebook
|
||||||
|
|
||||||
|
|
||||||
|
transformers_installed = importlib.util.find_spec("transformers") is not None
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def nb_imports():
|
||||||
|
nb_dir = Path(__file__).resolve().parents[1]
|
||||||
|
mod = import_definitions_from_notebook(nb_dir, "standalone-olmo3.ipynb")
|
||||||
|
return mod
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def dummy_input():
|
||||||
|
torch.manual_seed(123)
|
||||||
|
return torch.randint(0, 100, (1, 8)) # batch size 1, seq length 8
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def dummy_cfg_base():
|
||||||
|
return {
|
||||||
|
"vocab_size": 100,
|
||||||
|
"context_length": 64,
|
||||||
|
"emb_dim": 32,
|
||||||
|
"n_heads": 4,
|
||||||
|
"n_layers": 2,
|
||||||
|
"hidden_dim": 64,
|
||||||
|
"head_dim": 8,
|
||||||
|
"n_kv_heads": 1, # 4 query heads, 1 KV groups -> group_size = 4
|
||||||
|
"attention_bias": False,
|
||||||
|
"attention_dropout": 0.0,
|
||||||
|
"sliding_window": 4,
|
||||||
|
"layer_types": ["full_attention"] * 2,
|
||||||
|
|
||||||
|
# RoPE config
|
||||||
|
"rope_base": 10_000.0,
|
||||||
|
"rope_attention_factor": 1.0,
|
||||||
|
"rope_type": "default",
|
||||||
|
"rope_factor": 1.0,
|
||||||
|
"rope_orig_max": 64,
|
||||||
|
"rms_norm_eps": 1e-6,
|
||||||
|
"dtype": torch.float32,
|
||||||
|
}
|
||||||
|
|
||||||
|
@torch.inference_mode()
|
||||||
|
def test_dummy_olmo3_forward(dummy_cfg_base, dummy_input, nb_imports):
|
||||||
|
torch.manual_seed(123)
|
||||||
|
model = nb_imports.Olmo3Model(dummy_cfg_base)
|
||||||
|
out = model(dummy_input)
|
||||||
|
assert out.shape == (1, dummy_input.size(1), dummy_cfg_base["vocab_size"]), \
|
||||||
|
f"Expected shape (1, seq_len, vocab_size), got {out.shape}"
|
||||||
|
|
||||||
|
|
||||||
|
@torch.inference_mode()
|
||||||
|
@pytest.mark.skipif(not transformers_installed, reason="transformers not installed")
|
||||||
|
def test_olmo3_base_equivalence_with_transformers(nb_imports):
|
||||||
|
from transformers import Olmo3Config, Olmo3ForCausalLM
|
||||||
|
|
||||||
|
# Tiny config so the test is fast
|
||||||
|
cfg = {
|
||||||
|
"vocab_size": 257,
|
||||||
|
"context_length": 8,
|
||||||
|
"emb_dim": 32,
|
||||||
|
"n_heads": 4,
|
||||||
|
"n_layers": 2,
|
||||||
|
"hidden_dim": 64,
|
||||||
|
"head_dim": 8,
|
||||||
|
"qk_norm": True,
|
||||||
|
"n_kv_heads": 2,
|
||||||
|
"sliding_window": 4,
|
||||||
|
"layer_types": ["full_attention", "full_attention"],
|
||||||
|
"dtype": torch.float32,
|
||||||
|
"query_pre_attn_scalar": 256,
|
||||||
|
|
||||||
|
# required by TransformerBlock
|
||||||
|
"attention_bias": False,
|
||||||
|
|
||||||
|
# required by RMSNorm and RoPE setup in Olmo3Model
|
||||||
|
"rms_norm_eps": 1e-6,
|
||||||
|
"rope_base": 1_000_000.0,
|
||||||
|
"rope_attention_factor": 1.0,
|
||||||
|
"rope_type": "default",
|
||||||
|
"rope_factor": 1.0,
|
||||||
|
"rope_orig_max": 8,
|
||||||
|
|
||||||
|
# extra HF-only stuff
|
||||||
|
"rope_local_base": 10_000.0,
|
||||||
|
}
|
||||||
|
|
||||||
|
model = nb_imports.Olmo3Model(cfg)
|
||||||
|
|
||||||
|
hf_cfg = Olmo3Config(
|
||||||
|
vocab_size=cfg["vocab_size"],
|
||||||
|
max_position_embeddings=cfg["context_length"],
|
||||||
|
hidden_size=cfg["emb_dim"],
|
||||||
|
num_attention_heads=cfg["n_heads"],
|
||||||
|
num_hidden_layers=cfg["n_layers"],
|
||||||
|
intermediate_size=cfg["hidden_dim"],
|
||||||
|
head_dim=cfg["head_dim"],
|
||||||
|
num_key_value_heads=cfg["n_kv_heads"],
|
||||||
|
rope_theta=cfg["rope_base"],
|
||||||
|
rope_local_base_freq=cfg["rope_local_base"],
|
||||||
|
layer_types=cfg["layer_types"],
|
||||||
|
sliding_window=cfg["sliding_window"],
|
||||||
|
tie_word_embeddings=False,
|
||||||
|
attn_implementation="eager",
|
||||||
|
torch_dtype=torch.float32,
|
||||||
|
query_pre_attn_scalar=cfg["query_pre_attn_scalar"],
|
||||||
|
rope_scaling={"rope_type": "default"},
|
||||||
|
qk_norm=cfg["qk_norm"],
|
||||||
|
rms_norm_eps=cfg["rms_norm_eps"],
|
||||||
|
)
|
||||||
|
hf_model = Olmo3ForCausalLM(hf_cfg)
|
||||||
|
|
||||||
|
hf_state = hf_model.state_dict()
|
||||||
|
param_config = {
|
||||||
|
"n_layers": cfg["n_layers"],
|
||||||
|
"hidden_dim": cfg["hidden_dim"],
|
||||||
|
}
|
||||||
|
nb_imports.load_weights_into_olmo(model, param_config, hf_state)
|
||||||
|
|
||||||
|
x = torch.randint(
|
||||||
|
0,
|
||||||
|
cfg["vocab_size"],
|
||||||
|
(2, cfg["context_length"]),
|
||||||
|
dtype=torch.long,
|
||||||
|
)
|
||||||
|
ours_logits = model(x)
|
||||||
|
theirs_logits = hf_model(x).logits
|
||||||
|
torch.testing.assert_close(ours_logits, theirs_logits, rtol=1e-5, atol=1e-5)
|
||||||
@@ -13,14 +13,25 @@
|
|||||||
- [04_learning_rate_schedulers](04_learning_rate_schedulers) contains code implementing a more sophisticated training function including learning rate schedulers and gradient clipping
|
- [04_learning_rate_schedulers](04_learning_rate_schedulers) contains code implementing a more sophisticated training function including learning rate schedulers and gradient clipping
|
||||||
- [05_bonus_hparam_tuning](05_bonus_hparam_tuning) contains an optional hyperparameter tuning script
|
- [05_bonus_hparam_tuning](05_bonus_hparam_tuning) contains an optional hyperparameter tuning script
|
||||||
- [06_user_interface](06_user_interface) implements an interactive user interface to interact with the pretrained LLM
|
- [06_user_interface](06_user_interface) implements an interactive user interface to interact with the pretrained LLM
|
||||||
- [07_gpt_to_llama](07_gpt_to_llama) contains a step-by-step guide for converting a GPT architecture implementation to Llama 3.2 and loads pretrained weights from Meta AI
|
|
||||||
- [08_memory_efficient_weight_loading](08_memory_efficient_weight_loading) contains a bonus notebook showing how to load model weights via PyTorch's `load_state_dict` method more efficiently
|
- [08_memory_efficient_weight_loading](08_memory_efficient_weight_loading) contains a bonus notebook showing how to load model weights via PyTorch's `load_state_dict` method more efficiently
|
||||||
- [09_extending-tokenizers](09_extending-tokenizers) contains a from-scratch implementation of the GPT-2 BPE tokenizer
|
- [09_extending-tokenizers](09_extending-tokenizers) contains a from-scratch implementation of the GPT-2 BPE tokenizer
|
||||||
- [10_llm-training-speed](10_llm-training-speed) shows PyTorch performance tips to improve the LLM training speed
|
- [10_llm-training-speed](10_llm-training-speed) shows PyTorch performance tips to improve the LLM training speed
|
||||||
|
|
||||||
|
|
||||||
|
## LLM Architectures From Scratch
|
||||||
|
|
||||||
|
<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/qwen/qwen-overview.webp">
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
- [07_gpt_to_llama](07_gpt_to_llama) contains a step-by-step guide for converting a GPT architecture implementation to Llama 3.2 and loads pretrained weights from Meta AI
|
||||||
- [11_qwen3](11_qwen3) A from-scratch implementation of Qwen3 0.6B and Qwen3 30B-A3B (Mixture-of-Experts) including code to load the pretrained weights of the base, reasoning, and coding model variants
|
- [11_qwen3](11_qwen3) A from-scratch implementation of Qwen3 0.6B and Qwen3 30B-A3B (Mixture-of-Experts) including code to load the pretrained weights of the base, reasoning, and coding model variants
|
||||||
- [12_gemma3](12_gemma3) A from-scratch implementation of Gemma 3 270M and alternative with KV cache, including code to load the pretrained weights
|
- [12_gemma3](12_gemma3) A from-scratch implementation of Gemma 3 270M and alternative with KV cache, including code to load the pretrained weights
|
||||||
|
- [13_olmo3](13_olmo3) A from-scratch implementation of Olmo 3 7B and 32B (Base, Instruct, and Think variants) and alternative with KV cache, including code to load the pretrained weights
|
||||||
|
|
||||||
|
|
||||||
|
## Code-Along Video for This Chapter
|
||||||
|
|
||||||
<br>
|
<br>
|
||||||
<br>
|
<br>
|
||||||
|
|||||||
Reference in New Issue
Block a user