Files
LLMs-from-scratch/ch05/15_tiny-aya/standalone-tiny-aya.ipynb

1943 lines
66 KiB
Plaintext
Raw Permalink Normal View History

2026-02-19 17:33:22 -05:00
{
"cells": [
{
"cell_type": "markdown",
"id": "e1b280ab-b61f-4d1a-bf7e-44e5f9ed3a5c",
"metadata": {
"id": "e1b280ab-b61f-4d1a-bf7e-44e5f9ed3a5c"
},
"source": [
"<table style=\"width:100%\">\n",
"<tr>\n",
"<td style=\"vertical-align:middle; text-align:left;\">\n",
"<font size=\"2\">\n",
"Supplementary code for the <a href=\"http://mng.bz/orYv\">Build a Large Language Model From Scratch</a> book by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",
"<br>Code repository: <a href=\"https://github.com/rasbt/LLMs-from-scratch\">https://github.com/rasbt/LLMs-from-scratch</a>\n",
"</font>\n",
"</td>\n",
"<td style=\"vertical-align:middle; text-align:left;\">\n",
"<a href=\"http://mng.bz/orYv\"><img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp\" width=\"100px\"></a>\n",
"</td>\n",
"</tr>\n",
"</table>"
]
},
{
"cell_type": "markdown",
"id": "efde77f2-6af3-4781-8597-89ecd3f41a52",
"metadata": {
"id": "efde77f2-6af3-4781-8597-89ecd3f41a52"
},
"source": [
"# Tiny Aya From Scratch (A Standalone Notebook)"
]
},
{
"cell_type": "markdown",
"id": "55cdef4d-de59-4a65-89f9-fa2a8ef3471d",
"metadata": {
"id": "55cdef4d-de59-4a65-89f9-fa2a8ef3471d"
},
"source": [
"- This notebook is purposefully minimal and focuses on the code to re-implement Tiny Aya (3.35B) models from Cohere in pure PyTorch without relying on other external LLM libraries; Tiny Aya is interesting because it is a small but strong model with good multi-lingual support\n",
"- For more information, see the official [Tiny Aya announcement](https://cohere.com/blog/cohere-labs-tiny-aya) and model cards:\n",
" - [tiny-aya-base](https://huggingface.co/CohereLabs/tiny-aya-base) (base model)\n",
" - [tiny-aya-global](https://huggingface.co/CohereLabs/tiny-aya-global) (best balance across languages and regions; notebook default)\n",
" - [tiny-aya-fire](https://huggingface.co/CohereLabs/tiny-aya-fire) (optimized for South Asian languages)\n",
" - [tiny-aya-water](https://huggingface.co/CohereLabs/tiny-aya-water) (optimized for European and Asia Pacific languages)\n",
" - [tiny-aya-earth](https://huggingface.co/CohereLabs/tiny-aya-earth) (optimized for West Asian and African languages)\n"
]
},
{
"cell_type": "markdown",
"id": "4e2a716d-31e6-4d28-be32-94585dcae082",
"metadata": {},
"source": [
"- Below is a table with more details regarding the language specialization (taken from their announcement blog post linked above)\n",
"\n",
"| Region | Languages | Optimized Model |\n",
"|---------------|-----------|----------------|\n",
"| **Asia Pacific** | Traditional Chinese, Cantonese, Vietnamese, Tagalog, Javanese, Khmer, Thai, Burmese, Malay, Korean, Lao, Indonesian, Simplified Chinese, Japanese | tiny-aya-water |\n",
"| **Africa** | Zulu, Amharic, Hausa, Igbo, Swahili, Xhosa, Wolof, Shona, Yoruba, Nigerian Pidgin, Malagasy | tiny-aya-earth |\n",
"| **South Asia** | Telugu, Marathi, Bengali, Tamil, Hindi, Punjabi, Gujarati, Urdu, Nepali | tiny-aya-fire |\n",
"| **Europe** | Catalan, Galician, Dutch, Danish, Finnish, Czech, Portuguese, French, Lithuanian, Slovak, Basque, English, Swedish, Polish, Spanish, Slovenian, Ukrainian, Greek, Bokmål, Romanian, Serbian, German, Italian, Russian, Irish, Hungarian, Bulgarian, Croatian, Estonian, Latvian, Welsh | tiny-aya-water |\n",
"| **West Asia** | Arabic, Maltese, Turkish, Hebrew, Persian | tiny-aya-earth |\n"
]
},
{
"cell_type": "markdown",
"id": "66b43549-585f-43ab-be19-addcc2dfc669",
"metadata": {},
"source": [
"- Below is a side-by-side comparison with Qwen3 4B as a reference model; if you are interested in the Qwen3 standalone notebook, you can find it [here](../11_qwen3)\n",
"<br>\n",
"\n",
2026-02-19 16:42:19 -06:00
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/tiny-aya/01.webp\" width=\"900px\">\n",
2026-02-19 17:33:22 -05:00
"\n",
" \n",
"- About the code:\n",
" - all code is my own code, mapping the Tiny Aya architecture onto the model code implemented in my [Build A Large Language Model (From Scratch)](http://mng.bz/orYv) book; the code is released under a permissive open-source Apache 2.0 license (see [LICENSE.txt](https://github.com/rasbt/LLMs-from-scratch/blob/main/LICENSE.txt))"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "7c201adb-747e-437b-9a62-442802941e01",
"metadata": {},
"outputs": [],
"source": [
"# pip install -r https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/refs/heads/main/ch05/07_gpt_to_llama/requirements-extra.txt"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "dd1b65a8-4301-444a-bd7c-a6f2bd1df9df",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "dd1b65a8-4301-444a-bd7c-a6f2bd1df9df",
"outputId": "4f762354-e0a3-4cc2-e5d4-e61a227a202c"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"huggingface_hub version: 1.4.1\n",
"tiktoken version: 0.12.0\n",
"torch version: 2.10.0\n"
]
}
],
"source": [
"from importlib.metadata import version\n",
"\n",
"pkgs = [\n",
" #\"blobfile\", # to download pretrained weights\n",
" \"huggingface_hub\", # to download pretrained weights\n",
" \"tiktoken\", # to implement the tokenizer\n",
" \"torch\", # to implement the model\n",
"]\n",
"for p in pkgs:\n",
" print(f\"{p} version: {version(p)}\")"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "574bc51e-876e-46c3-bcf7-ef4675582ad2",
"metadata": {},
"outputs": [],
"source": [
"from pathlib import Path\n",
"\n",
"REPO_ID = \"CohereLabs/tiny-aya-global\"\n",
"#REPO_ID = \"CohereLabs/tiny-aya-fire\" \n",
"#REPO_ID = \"CohereLabs/tiny-aya-water\"\n",
"#REPO_ID = \"CohereLabs/tiny-aya-earth\"\n",
"\n",
"LOCAL_DIR = Path(REPO_ID).parts[-1]"
]
},
{
"cell_type": "markdown",
"id": "653410a6-dd2b-4eb2-a722-23d9782e726d",
"metadata": {
"id": "653410a6-dd2b-4eb2-a722-23d9782e726d"
},
"source": [
"&nbsp;\n",
"# 1. Architecture code"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "82076c21-9331-4dcd-b017-42b046cf1a60",
"metadata": {
"id": "82076c21-9331-4dcd-b017-42b046cf1a60"
},
"outputs": [],
"source": [
"import torch\n",
"import torch.nn as nn\n",
"\n",
"\n",
"class FeedForward(nn.Module):\n",
" def __init__(self, cfg):\n",
" super().__init__()\n",
" self.fc1 = nn.Linear(cfg[\"emb_dim\"], cfg[\"hidden_dim\"], dtype=cfg[\"dtype\"], bias=False)\n",
" self.fc2 = nn.Linear(cfg[\"emb_dim\"], cfg[\"hidden_dim\"], dtype=cfg[\"dtype\"], bias=False)\n",
" self.fc3 = nn.Linear(cfg[\"hidden_dim\"], cfg[\"emb_dim\"], dtype=cfg[\"dtype\"], bias=False)\n",
"\n",
" def forward(self, x):\n",
" x_fc1 = self.fc1(x)\n",
" x_fc2 = self.fc2(x)\n",
" x = nn.functional.silu(x_fc1) * x_fc2\n",
" return self.fc3(x)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "1a36d4a0-ee44-4727-ab7e-c73dd5e1ddba",
"metadata": {},
"outputs": [],
"source": [
"# Aya uses a bias-less LayerNorm variant. \n",
"# The difference to classic LayerNorm is that it only \n",
"# has a scale parameter (weight), no shift parameter (bias).\n",
"\n",
"class CohereLayerNorm(nn.Module):\n",
" def __init__(self, emb_dim, eps=1e-5):\n",
" super().__init__()\n",
" self.eps = eps\n",
" self.weight = nn.Parameter(torch.ones(emb_dim))\n",
"\n",
" def forward(self, x):\n",
" input_dtype = x.dtype\n",
" x = x.to(torch.float32)\n",
" mean = x.mean(dim=-1, keepdim=True)\n",
" variance = (x - mean).pow(2).mean(dim=-1, keepdim=True)\n",
" x = (x - mean) * torch.rsqrt(variance + self.eps)\n",
" return (self.weight.to(torch.float32) * x).to(input_dtype)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "4b9a346f-5826-4083-9162-abd56afc03f0",
"metadata": {
"id": "4b9a346f-5826-4083-9162-abd56afc03f0"
},
"outputs": [],
"source": [
"def compute_rope_params(head_dim, theta_base=10_000, context_length=4096, dtype=torch.float32):\n",
" assert head_dim % 2 == 0, \"head_dim must be even\"\n",
"\n",
" # Compute the inverse frequencies\n",
" inv_freq = 1.0 / (\n",
" theta_base ** (torch.arange(0, head_dim, 2, dtype=dtype)[: (head_dim // 2)].float() / head_dim)\n",
" )\n",
" positions = torch.arange(context_length, dtype=dtype)\n",
"\n",
" # Compute the angles\n",
" angles = positions.unsqueeze(1) * inv_freq.unsqueeze(0) # Shape: (context_length, head_dim // 2)\n",
"\n",
" # Cohere uses interleaved even/odd angle layout per head-dim pair.\n",
" # Llama2 notebook examples often use a split-halves layout via cat([angles, angles]).\n",
" # Both are equivalent only when paired with the matching rotate logic:\n",
" # - interleaved layout -> even/odd rotation implementation (below)\n",
" # - split-halves layout -> half/half rotate implementation\n",
" angles = torch.repeat_interleave(angles, 2, dim=1) # Shape: (context_length, head_dim)\n",
"\n",
" # Precompute sine and cosine\n",
" return torch.cos(angles), torch.sin(angles)\n",
"\n",
"def apply_rope(x, cos, sin):\n",
" # x: (batch_size, num_heads, seq_len, head_dim)\n",
" batch_size, num_heads, seq_len, head_dim = x.shape\n",
" assert head_dim % 2 == 0, \"head_dim must be even\"\n",
"\n",
" # Split x into even and odd components (interleaved layout)\n",
" x_even = x[..., ::2]\n",
" x_odd = x[..., 1::2]\n",
"\n",
" # Adjust sin and cos shapes\n",
" cos = cos[:seq_len, :].unsqueeze(0).unsqueeze(0)\n",
" sin = sin[:seq_len, :].unsqueeze(0).unsqueeze(0)\n",
"\n",
" # Apply the rotary transformation\n",
" x_float = x.float()\n",
" rotated = torch.stack((-x_odd.float(), x_even.float()), dim=-1).flatten(-2)\n",
" x_rotated = (x_float * cos) + (rotated * sin)\n",
"\n",
" return x_rotated.to(dtype=x.dtype)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "e8169ab5-f976-4222-a2e1-eb1cabf267cb",
"metadata": {
"id": "e8169ab5-f976-4222-a2e1-eb1cabf267cb"
},
"outputs": [],
"source": [
"class GroupedQueryAttention(nn.Module):\n",
" def __init__(\n",
" self,\n",
" d_in,\n",
" num_heads,\n",
" num_kv_groups,\n",
" head_dim=None,\n",
" qk_norm=False,\n",
" attention_bias=False,\n",
" dtype=None,\n",
" attn_type=\"full_attention\",\n",
" ):\n",
" super().__init__()\n",
" assert num_heads % num_kv_groups == 0, \"num_heads must be divisible by num_kv_groups\"\n",
"\n",
" self.num_heads = num_heads\n",
" self.num_kv_groups = num_kv_groups\n",
" self.group_size = num_heads // num_kv_groups\n",
"\n",
" if head_dim is None:\n",
" assert d_in % num_heads == 0, \"`d_in` must be divisible by `num_heads` if `head_dim` is not set\"\n",
" head_dim = d_in // num_heads\n",
"\n",
" self.head_dim = head_dim\n",
" self.d_out = num_heads * head_dim\n",
" self.attn_type = attn_type\n",
"\n",
" self.W_query = nn.Linear(\n",
" d_in,\n",
" self.d_out,\n",
" bias=attention_bias,\n",
" dtype=dtype,\n",
" )\n",
" self.W_key = nn.Linear(\n",
" d_in,\n",
" num_kv_groups * head_dim,\n",
" bias=attention_bias,\n",
" dtype=dtype,\n",
" )\n",
" self.W_value = nn.Linear(\n",
" d_in,\n",
" num_kv_groups * head_dim,\n",
" bias=attention_bias,\n",
" dtype=dtype,\n",
" )\n",
" self.out_proj = nn.Linear(\n",
" self.d_out,\n",
" d_in,\n",
" bias=attention_bias,\n",
" dtype=dtype,\n",
" )\n",
"\n",
" if qk_norm:\n",
" self.q_norm = CohereLayerNorm(head_dim, eps=1e-6)\n",
" self.k_norm = CohereLayerNorm(head_dim, eps=1e-6)\n",
" else:\n",
" self.q_norm = self.k_norm = None\n",
"\n",
" def forward(self, x, mask, cos, sin):\n",
" b, num_tokens, _ = x.shape\n",
"\n",
" # Apply projections\n",
" queries = self.W_query(x) # (b, num_tokens, num_heads * head_dim)\n",
" keys = self.W_key(x) # (b, num_tokens, num_kv_groups * head_dim)\n",
" values = self.W_value(x) # (b, num_tokens, num_kv_groups * head_dim)\n",
"\n",
" # Reshape\n",
" queries = queries.view(b, num_tokens, self.num_heads, self.head_dim).transpose(1, 2)\n",
" keys = keys.view(b, num_tokens, self.num_kv_groups, self.head_dim).transpose(1, 2)\n",
" values = values.view(b, num_tokens, self.num_kv_groups, self.head_dim).transpose(1, 2)\n",
"\n",
" # Optional normalization\n",
" if self.q_norm:\n",
" queries = self.q_norm(queries)\n",
" if self.k_norm:\n",
" keys = self.k_norm(keys)\n",
"\n",
" # Cohere applies RoPE only on sliding-attention layers.\n",
" if self.attn_type == \"sliding_attention\":\n",
" queries = apply_rope(queries, cos, sin)\n",
" keys = apply_rope(keys, cos, sin)\n",
"\n",
" # Expand K and V to match number of heads\n",
" keys = keys.repeat_interleave(self.group_size, dim=1)\n",
" values = values.repeat_interleave(self.group_size, dim=1)\n",
"\n",
" # Attention\n",
" attn_scores = queries @ keys.transpose(2, 3)\n",
" attn_scores = attn_scores.masked_fill(mask, -torch.inf)\n",
"\n",
" attn_weights = torch.softmax(attn_scores / self.head_dim**0.5, dim=-1)\n",
" context = (attn_weights @ values).transpose(1, 2).reshape(b, num_tokens, self.d_out)\n",
"\n",
" return self.out_proj(context)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "457cb2f8-50c1-4045-8a74-f181bfb5fea9",
"metadata": {
"id": "457cb2f8-50c1-4045-8a74-f181bfb5fea9"
},
"outputs": [],
"source": [
"class TransformerBlock(nn.Module):\n",
" def __init__(self, cfg, attn_type):\n",
" super().__init__()\n",
" self.attn_type = attn_type\n",
"\n",
" self.att = GroupedQueryAttention(\n",
" d_in=cfg[\"emb_dim\"],\n",
" num_heads=cfg[\"n_heads\"],\n",
" num_kv_groups=cfg[\"n_kv_heads\"],\n",
" head_dim=cfg[\"head_dim\"],\n",
" qk_norm=False,\n",
" attention_bias=cfg[\"attention_bias\"],\n",
" dtype=cfg[\"dtype\"],\n",
" attn_type=attn_type,\n",
" )\n",
" self.ff = FeedForward(cfg)\n",
" self.input_layernorm = CohereLayerNorm(cfg[\"emb_dim\"], eps=cfg[\"layer_norm_eps\"])\n",
"\n",
" def forward(self, x, mask_global, mask_local, cos, sin):\n",
" attn_mask = mask_local if self.attn_type == \"sliding_attention\" else mask_global\n",
"\n",
" shortcut = x\n",
" x = self.input_layernorm(x)\n",
" x_attn = self.att(x, attn_mask, cos, sin) # Shape [batch_size, num_tokens, emb_dim]\n",
" x_ff = self.ff(x)\n",
"\n",
" # Cohere parallel residual block\n",
" x = shortcut + x_attn + x_ff\n",
" return x"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "e88de3e3-9f07-42cc-816b-28dbd46e96c4",
"metadata": {
"id": "e88de3e3-9f07-42cc-816b-28dbd46e96c4"
},
"outputs": [],
"source": [
"class TinyAyaModel(nn.Module):\n",
" def __init__(self, cfg):\n",
" super().__init__()\n",
" assert len(cfg[\"layer_types\"]) == cfg[\"n_layers\"], \"layer_types must match n_layers\"\n",
"\n",
" self.cfg = cfg\n",
"\n",
" self.tok_emb = nn.Embedding(cfg[\"vocab_size\"], cfg[\"emb_dim\"], dtype=cfg[\"dtype\"])\n",
" self.trf_blocks = nn.ModuleList([TransformerBlock(cfg, t) for t in cfg[\"layer_types\"]])\n",
"\n",
" self.final_norm = CohereLayerNorm(cfg[\"emb_dim\"], eps=cfg[\"layer_norm_eps\"])\n",
" self.out_head = nn.Linear(cfg[\"emb_dim\"], cfg[\"vocab_size\"], bias=False, dtype=cfg[\"dtype\"])\n",
"\n",
" self.logit_scale = cfg[\"logit_scale\"]\n",
"\n",
" cos, sin = compute_rope_params(\n",
" head_dim=cfg[\"head_dim\"],\n",
" theta_base=cfg[\"rope_base\"],\n",
" context_length=cfg[\"context_length\"],\n",
" )\n",
" self.register_buffer(\"cos\", cos, persistent=False)\n",
" self.register_buffer(\"sin\", sin, persistent=False)\n",
"\n",
" if cfg[\"tie_word_embeddings\"]:\n",
" self.out_head.weight = self.tok_emb.weight\n",
"\n",
" def create_masks(self, num_tokens, device):\n",
" ones = torch.ones((num_tokens, num_tokens), dtype=torch.bool, device=device)\n",
"\n",
" # Future mask\n",
" mask_global = torch.triu(ones, diagonal=1)\n",
"\n",
" # Sliding-window mask\n",
" far_past = torch.triu(ones, diagonal=self.cfg[\"sliding_window\"]).T\n",
" mask_local = mask_global | far_past\n",
"\n",
" # Expand to [batch, heads, seq, seq]-broadcastable shape\n",
" return mask_global.unsqueeze(0).unsqueeze(0), mask_local.unsqueeze(0).unsqueeze(0)\n",
"\n",
" def forward(self, input_ids, attention_mask=None):\n",
" tok_embeds = self.tok_emb(input_ids)\n",
" x = tok_embeds\n",
" num_tokens = input_ids.shape[1]\n",
"\n",
" mask_global, mask_local = self.create_masks(num_tokens, x.device)\n",
"\n",
" if attention_mask is not None:\n",
" # True means mask in this implementation.\n",
" pad_mask = attention_mask[:, None, None, :].to(dtype=torch.bool).logical_not()\n",
" mask_global = mask_global | pad_mask\n",
" mask_local = mask_local | pad_mask\n",
"\n",
" cos = self.cos[:num_tokens, :].to(x.device, dtype=x.dtype)\n",
" sin = self.sin[:num_tokens, :].to(x.device, dtype=x.dtype)\n",
"\n",
" for block in self.trf_blocks:\n",
" x = block(x, mask_global, mask_local, cos, sin)\n",
"\n",
" x = self.final_norm(x)\n",
" logits = self.out_head(x.to(self.cfg[\"dtype\"]))\n",
" return logits * self.logit_scale"
]
},
{
"cell_type": "markdown",
"id": "be2d201f-74ad-4d63-ab9c-601b00674a48",
"metadata": {
"id": "be2d201f-74ad-4d63-ab9c-601b00674a48"
},
"source": [
"&nbsp;\n",
"# 2. Initialize model"
]
},
{
"cell_type": "markdown",
"id": "23dea40c-fe20-4a75-be25-d6fce5863c01",
"metadata": {
"id": "23dea40c-fe20-4a75-be25-d6fce5863c01"
},
"source": [
"- The remainder of this notebook uses the Llama 3.2 1B model; to use the 3B model variant, just uncomment the second configuration file in the following code cell"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "caa142fa-b375-4e78-b392-2072ced666f3",
"metadata": {
"id": "caa142fa-b375-4e78-b392-2072ced666f3"
},
"outputs": [],
"source": [
"TINY_AYA_CONFIG = {\n",
" \"vocab_size\": 262_144, # Vocabulary size\n",
" \"context_length\": 500_000, # Context length in the HF config\n",
" \"emb_dim\": 2048, # Embedding dimension\n",
" \"n_heads\": 16, # Number of attention heads\n",
" \"n_layers\": 36, # Number of layers\n",
" \"hidden_dim\": 11_008, # Size of the intermediate dimension in FeedForward\n",
" \"head_dim\": 128, # Size of the heads in GQA\n",
" \"n_kv_heads\": 4, # Number of KV heads for grouped-query attention\n",
" \"attention_bias\": False, # Whether attention projections use bias terms\n",
" \"attention_dropout\": 0.0, # Attention dropout\n",
" \"sliding_window\": 4096, # Sliding-window attention context\n",
" \"layer_types\": [\n",
" \"sliding_attention\",\n",
" \"sliding_attention\",\n",
" \"sliding_attention\",\n",
" \"full_attention\",\n",
" \"sliding_attention\",\n",
" \"sliding_attention\",\n",
" \"sliding_attention\",\n",
" \"full_attention\",\n",
" \"sliding_attention\",\n",
" \"sliding_attention\",\n",
" \"sliding_attention\",\n",
" \"full_attention\",\n",
" \"sliding_attention\",\n",
" \"sliding_attention\",\n",
" \"sliding_attention\",\n",
" \"full_attention\",\n",
" \"sliding_attention\",\n",
" \"sliding_attention\",\n",
" \"sliding_attention\",\n",
" \"full_attention\",\n",
" \"sliding_attention\",\n",
" \"sliding_attention\",\n",
" \"sliding_attention\",\n",
" \"full_attention\",\n",
" \"sliding_attention\",\n",
" \"sliding_attention\",\n",
" \"sliding_attention\",\n",
" \"full_attention\",\n",
" \"sliding_attention\",\n",
" \"sliding_attention\",\n",
" \"sliding_attention\",\n",
" \"full_attention\",\n",
" \"sliding_attention\",\n",
" \"sliding_attention\",\n",
" \"sliding_attention\",\n",
" \"full_attention\",\n",
" ],\n",
" \"rope_base\": 50_000.0, # The base in RoPE's \"theta\"\n",
" \"layer_norm_eps\": 1e-5, # Epsilon used by layer normalization\n",
" \"logit_scale\": 1.0, # Final logits scaling factor\n",
" \"tie_word_embeddings\": True, # Whether input embedding and output head are tied\n",
" \"bos_token_id\": 2,\n",
" \"eos_token_id\": 3,\n",
" \"pad_token_id\": 0,\n",
" \"dtype\": torch.bfloat16, # Lower-precision dtype to reduce memory usage\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "156253fe-aacd-4da2-8f13-705f05c4b11e",
"metadata": {
"id": "156253fe-aacd-4da2-8f13-705f05c4b11e"
},
"outputs": [],
"source": [
"model = TinyAyaModel(TINY_AYA_CONFIG)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "fd5efb03-5a07-46e8-8607-93ed47549d2b",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "fd5efb03-5a07-46e8-8607-93ed47549d2b",
"outputId": "65c1a95e-b502-4150-9e2e-da619d9053d5"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"float32 (PyTorch default): 25.43 GB\n",
"bfloat16: 12.72 GB\n"
]
}
],
"source": [
"def calc_model_memory_size(model, input_dtype=torch.float32):\n",
" total_params = 0\n",
" total_grads = 0\n",
" for param in model.parameters():\n",
" # Calculate total number of elements per parameter\n",
" param_size = param.numel()\n",
" total_params += param_size\n",
" # Check if gradients are stored for this parameter\n",
" if param.requires_grad:\n",
" total_grads += param_size\n",
"\n",
" # Calculate buffer size (non-parameters that require memory)\n",
" total_buffers = sum(buf.numel() for buf in model.buffers())\n",
"\n",
" # Size in bytes = (Number of elements) * (Size of each element in bytes)\n",
" # We assume parameters and gradients are stored in the same type as input dtype\n",
" element_size = torch.tensor(0, dtype=input_dtype).element_size()\n",
" total_memory_bytes = (total_params + total_grads + total_buffers) * element_size\n",
"\n",
" # Convert bytes to gigabytes\n",
" total_memory_gb = total_memory_bytes / (1024**3)\n",
"\n",
" return total_memory_gb\n",
"\n",
"print(f\"float32 (PyTorch default): {calc_model_memory_size(model, input_dtype=torch.float32):.2f} GB\")\n",
"print(f\"bfloat16: {calc_model_memory_size(model, input_dtype=torch.bfloat16):.2f} GB\")"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "41176fb0-d58a-443a-912f-4f436564b5f8",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total number of parameters: 3,349,227,520\n",
"\n",
"Total number of unique parameters: 2,812,356,608\n"
]
}
],
"source": [
"total_params = sum(p.numel() for p in model.parameters())\n",
"print(f\"Total number of parameters: {total_params:,}\")\n",
"\n",
"# Account for weight tying\n",
"total_params_normalized = total_params - model.tok_emb.weight.numel()\n",
"print(f\"\\nTotal number of unique parameters: {total_params_normalized:,}\")"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "31f12baf-f79b-499f-85c0-51328a6a20f5",
"metadata": {
"id": "31f12baf-f79b-499f-85c0-51328a6a20f5"
},
"outputs": [],
"source": [
"if torch.cuda.is_available():\n",
" device = torch.device(\"cuda\")\n",
"elif torch.backends.mps.is_available():\n",
" device = torch.device(\"mps\")\n",
"else:\n",
" device = torch.device(\"cpu\")\n",
"\n",
"model.to(device);"
]
},
{
"cell_type": "markdown",
"id": "78e091e1-afa8-4d23-9aea-cced86181bfd",
"metadata": {
"id": "78e091e1-afa8-4d23-9aea-cced86181bfd"
},
"source": [
"&nbsp;\n",
"# 3. Load tokenizer"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "9482b01c-49f9-48e4-ab2c-4a4c75240e77",
"metadata": {
"id": "9482b01c-49f9-48e4-ab2c-4a4c75240e77"
},
"outputs": [],
"source": [
"from tokenizers import Tokenizer\n",
"\n",
"\n",
"class TinyAyaTokenizer:\n",
" def __init__(self, tokenizer_file_path, eos_token_id=3, pad_token_id=0, bos_token_id=2):\n",
" tok_file = Path(tokenizer_file_path)\n",
" self._tok = Tokenizer.from_file(str(tok_file))\n",
"\n",
" eos_from_tok = self._tok.token_to_id(\"<EOS_TOKEN>\")\n",
" pad_from_tok = self._tok.token_to_id(\"<PAD>\")\n",
" bos_from_tok = self._tok.token_to_id(\"<BOS_TOKEN>\")\n",
"\n",
" self.eos_token_id = eos_from_tok if eos_from_tok is not None else eos_token_id\n",
" self.pad_token_id = pad_from_tok if pad_from_tok is not None else pad_token_id\n",
" self.bos_token_id = bos_from_tok if bos_from_tok is not None else bos_token_id\n",
"\n",
" def encode(self, text):\n",
" return self._tok.encode(text).ids\n",
"\n",
" def decode(self, ids):\n",
" return self._tok.decode(ids, skip_special_tokens=False)\n",
"\n",
"\n",
"def apply_chat_template(user_text):\n",
" return (\n",
" \"<BOS_TOKEN>\"\n",
" \"<|START_OF_TURN_TOKEN|><|USER_TOKEN|>\"\n",
" f\"{user_text}\"\n",
" \"<|END_OF_TURN_TOKEN|>\"\n",
" \"<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|><|START_RESPONSE|>\"\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "b771b60c-c198-4b30-bf10-42031197ae86",
"metadata": {
"id": "b771b60c-c198-4b30-bf10-42031197ae86"
},
"source": [
"- Please note that Cohere requires that you accept the Tiny Aya licensing terms before you can download the files; to do this, you have to create a Hugging Face Hub account and visit the [CohereLabs/tiny-aya-global](https://huggingface.co/CohereLabs/tiny-aya-global) repository to accept the terms\n",
"- Next, you will need to create an access token; to generate an access token with READ permissions, click on the profile picture in the upper right and click on \"Settings\"\n",
"\n",
"\n",
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gpt-to-llama/settings.webp?1\" width=\"300px\">\n",
"\n",
"- Then, create and copy the access token so you can copy & paste it into the next code cell\n",
"\n",
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gpt-to-llama/access-token.webp?1\" width=\"600px\">"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "05104b25-71fb-462f-8f2d-336184833eda",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CohereLabs/tiny-aya-global\n"
]
}
],
"source": [
"print(REPO_ID)"
]
},
{
"cell_type": "markdown",
"id": "7e327c26-ae3e-4f07-845f-eeb4a6b31283",
"metadata": {},
"source": [
"- Note that if you use the fire, water, base, or earth model, you'd have to accept the licensing terms separately:\n",
" - [CohereLabs/tiny-aya-fire](https://huggingface.co/CohereLabs/tiny-aya-fire)\n",
" - [CohereLabs/tiny-aya-water](https://huggingface.co/CohereLabs/tiny-aya-water)\n",
" - [CohereLabs/tiny-aya-earth](https://huggingface.co/CohereLabs/tiny-aya-earth)\n",
" - [CohereLabs/tiny-aya-base](https://huggingface.co/CohereLabs/tiny-aya-base)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "e9d96dc8-603a-4cb5-8c3e-4d2ca56862ed",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "e9d96dc8-603a-4cb5-8c3e-4d2ca56862ed",
"outputId": "e6e6dc05-7330-45bc-a9a7-331919155bdd"
},
"outputs": [],
"source": [
"# Uncomment and run the following code if you are executing the notebook for the first time\n",
"\n",
"from huggingface_hub import login\n",
"login()"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "986bc1a0-804f-4154-80f8-44cefbee1368",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 141,
"referenced_widgets": [
"a1608feac06d4687967a3e398f01c489",
"518fb202e4b44aaba47f07d1a61b6762",
"672cdc5aea954de3af851c001a667ad3",
"eebf8874618746b39cf4a21a2728dc7f",
"5176834aa8784bba9ec21234b87a8948",
"e2dc407afcd945c798e30597fddfcb3c",
"0dccd57dcc5c43a588157cef957c07e8",
"33ca0cdf2c7f41598a381c4ebe6a4ee1",
"ee44487f58454dacb522b1e084ffb733",
"d2c41e71a3f441deaed091b620ac5603",
"3326b6141a1a4eba9f316df528a9b99a"
]
},
"id": "986bc1a0-804f-4154-80f8-44cefbee1368",
"outputId": "5dd7334b-4c71-465a-94d2-c3e95b9ddc58"
},
"outputs": [],
"source": [
"from huggingface_hub import hf_hub_download\n",
"\n",
"tokenizer_file_path = Path(LOCAL_DIR) / \"tokenizer.json\"\n",
"if not tokenizer_file_path.exists():\n",
" try:\n",
" tokenizer_file_path = hf_hub_download(repo_id=REPO_ID, filename=\"tokenizer.json\", local_dir=LOCAL_DIR)\n",
" except Exception as e:\n",
" print(f\"Warning: failed to download tokenizer.json: {e}\")\n",
" tokenizer_file_path = \"tokenizer.json\""
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "_gBhxDtU_nxo",
"metadata": {
"id": "_gBhxDtU_nxo"
},
"outputs": [
{
"data": {
"text/plain": [
"'<BOS_TOKEN><BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Give me a short introduction to large language models in 3 sentences.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|><|START_RESPONSE|>'"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tokenizer = TinyAyaTokenizer(\n",
" tokenizer_file_path=Path(LOCAL_DIR) / \"tokenizer.json\",\n",
" eos_token_id=TINY_AYA_CONFIG[\"eos_token_id\"],\n",
" pad_token_id=TINY_AYA_CONFIG[\"pad_token_id\"],\n",
" bos_token_id=TINY_AYA_CONFIG[\"bos_token_id\"],\n",
")\n",
"\n",
"prompt = apply_chat_template(\"Give me a short introduction to large language models in 3 sentences.\")\n",
"input_token_ids = tokenizer.encode(prompt)\n",
"text = tokenizer.decode(input_token_ids)\n",
"text"
]
},
{
"cell_type": "markdown",
"id": "c172f89f-d301-439f-b809-46169e5f5945",
"metadata": {
"id": "c172f89f-d301-439f-b809-46169e5f5945"
},
"source": [
"&nbsp;\n",
"# 4. Load pretrained weights"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "75166128-5899-4995-9b88-9672e135650e",
"metadata": {
"id": "75166128-5899-4995-9b88-9672e135650e"
},
"outputs": [],
"source": [
"def load_weights_into_tiny_aya(model, param_config, params):\n",
" def assign(left, right, tensor_name=\"unknown\"):\n",
" if left.shape != right.shape:\n",
" raise ValueError(\n",
" f\"Shape mismatch in tensor '{tensor_name}'. Left: {left.shape}, Right: {right.shape}\"\n",
" )\n",
"\n",
" with torch.no_grad():\n",
" if isinstance(right, torch.Tensor):\n",
" left.copy_(right.to(dtype=left.dtype, device=left.device))\n",
" else:\n",
" left.copy_(torch.as_tensor(right, dtype=left.dtype, device=left.device))\n",
"\n",
" return left\n",
"\n",
" model.tok_emb.weight = assign(\n",
" model.tok_emb.weight,\n",
" params[\"model.embed_tokens.weight\"],\n",
" \"model.embed_tokens.weight\",\n",
" )\n",
"\n",
" for l in range(param_config[\"n_layers\"]):\n",
" block = model.trf_blocks[l]\n",
" att = block.att\n",
"\n",
" # Q, K, V projections\n",
" att.W_query.weight = assign(\n",
" att.W_query.weight,\n",
" params[f\"model.layers.{l}.self_attn.q_proj.weight\"],\n",
" f\"model.layers.{l}.self_attn.q_proj.weight\",\n",
" )\n",
" att.W_key.weight = assign(\n",
" att.W_key.weight,\n",
" params[f\"model.layers.{l}.self_attn.k_proj.weight\"],\n",
" f\"model.layers.{l}.self_attn.k_proj.weight\",\n",
" )\n",
" att.W_value.weight = assign(\n",
" att.W_value.weight,\n",
" params[f\"model.layers.{l}.self_attn.v_proj.weight\"],\n",
" f\"model.layers.{l}.self_attn.v_proj.weight\",\n",
" )\n",
"\n",
" # Output projection\n",
" att.out_proj.weight = assign(\n",
" att.out_proj.weight,\n",
" params[f\"model.layers.{l}.self_attn.o_proj.weight\"],\n",
" f\"model.layers.{l}.self_attn.o_proj.weight\",\n",
" )\n",
"\n",
" # Feedforward weights\n",
" block.ff.fc1.weight = assign(\n",
" block.ff.fc1.weight,\n",
" params[f\"model.layers.{l}.mlp.gate_proj.weight\"],\n",
" f\"model.layers.{l}.mlp.gate_proj.weight\",\n",
" )\n",
" block.ff.fc2.weight = assign(\n",
" block.ff.fc2.weight,\n",
" params[f\"model.layers.{l}.mlp.up_proj.weight\"],\n",
" f\"model.layers.{l}.mlp.up_proj.weight\",\n",
" )\n",
" block.ff.fc3.weight = assign(\n",
" block.ff.fc3.weight,\n",
" params[f\"model.layers.{l}.mlp.down_proj.weight\"],\n",
" f\"model.layers.{l}.mlp.down_proj.weight\",\n",
" )\n",
"\n",
" # Layernorm\n",
" block.input_layernorm.weight = assign(\n",
" block.input_layernorm.weight,\n",
" params[f\"model.layers.{l}.input_layernorm.weight\"],\n",
" f\"model.layers.{l}.input_layernorm.weight\",\n",
" )\n",
"\n",
" # Final normalization and output head\n",
" model.final_norm.weight = assign(\n",
" model.final_norm.weight,\n",
" params[\"model.norm.weight\"],\n",
" \"model.norm.weight\",\n",
" )\n",
"\n",
" if \"lm_head.weight\" in params:\n",
" model.out_head.weight = assign(model.out_head.weight, params[\"lm_head.weight\"], \"lm_head.weight\")\n",
" else:\n",
" if param_config[\"tie_word_embeddings\"]:\n",
" model.out_head.weight = model.tok_emb.weight\n",
" print(\"Model uses weight tying.\")"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "d1ad9fe4-1330-46b6-9d73-d0203065753f",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "29541041b2b14206a5ac72a6f04ebc61",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading (incomplete total...): 0.00B [00:00, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "ec974980488342e5b16f12d9e4a76400",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Fetching 15 files: 0%| | 0/15 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model uses weight tying.\n"
]
}
],
"source": [
"import json\n",
"from safetensors.torch import load_file\n",
"from huggingface_hub import snapshot_download\n",
"\n",
"\n",
"repo_dir = snapshot_download(repo_id=REPO_ID, local_dir=LOCAL_DIR)\n",
"index_path = Path(repo_dir) / \"model.safetensors.index.json\"\n",
"with open(index_path, \"r\") as f:\n",
" index = json.load(f)\n",
"\n",
"weights_dict = {}\n",
"for filename in sorted(set(index[\"weight_map\"].values())):\n",
" shard_path = Path(repo_dir) / filename\n",
" shard = load_file(shard_path)\n",
" weights_dict.update(shard)\n",
"\n",
"load_weights_into_tiny_aya(model, TINY_AYA_CONFIG, weights_dict)\n",
"model.to(device)\n",
"del weights_dict"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "364e76ca-52f8-4fa5-af37-c4069f9694bc",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "364e76ca-52f8-4fa5-af37-c4069f9694bc",
"outputId": "00d7e983-262e-4c65-f322-f4d999311988"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total number of unique parameters: 3,349,227,520\n"
]
}
],
"source": [
"def count_unique_parameters(model):\n",
" unique_params = set()\n",
" total_unique_params = 0\n",
" \n",
" for param in model.parameters():\n",
" if param.data_ptr() not in unique_params:\n",
" total_unique_params += param.numel()\n",
" unique_params.add(param.data_ptr())\n",
" \n",
" return total_unique_params\n",
"\n",
"total_params_uniq = count_unique_parameters(model)\n",
"print(f\"Total number of unique parameters: {total_params_uniq:,}\")"
]
},
{
"cell_type": "markdown",
"id": "57d07df1-4401-4792-b549-7c4cc5632323",
"metadata": {
"id": "57d07df1-4401-4792-b549-7c4cc5632323"
},
"source": [
"&nbsp;\n",
"# 5. Generate text"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "7b8401c6-e244-4cb7-9849-2ba71ce758d5",
"metadata": {
"id": "7b8401c6-e244-4cb7-9849-2ba71ce758d5"
},
"outputs": [],
"source": [
"stop_ids = {\n",
" tokenizer.eos_token_id,\n",
" tokenizer._tok.token_to_id(\"<|END_RESPONSE|>\"),\n",
" tokenizer._tok.token_to_id(\"<|END_OF_TURN_TOKEN|>\"),\n",
"}\n",
"stop_ids = {x for x in stop_ids if x is not None}\n",
"\n",
"\n",
"def generate_text_basic_stream(model, token_ids, max_new_tokens, stop_token_ids=None):\n",
" stop_token_ids = set(stop_token_ids or [])\n",
"\n",
" model.eval()\n",
" with torch.no_grad():\n",
" for _ in range(max_new_tokens):\n",
" out = model(token_ids)[:, -1]\n",
" next_token = torch.argmax(out, dim=-1, keepdim=True)\n",
"\n",
" # batch size 1\n",
" if next_token.item() in stop_token_ids:\n",
" break\n",
"\n",
" yield next_token\n",
" token_ids = torch.cat([token_ids, next_token], dim=1)"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "1c7a04fa-6aac-416b-8f63-f1e19227633d",
"metadata": {
"id": "1c7a04fa-6aac-416b-8f63-f1e19227633d"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Large language models are advanced AI systems trained on vast amounts of text data to understand and generate human-like language. They use deep learning techniques, particularly transformer architectures, to process and predict text patterns, enabling tasks like translation, summarization, and conversational dialogue. These models have revolutionized natural language processing, powering applications from chatbots to content creation."
]
}
],
"source": [
"prompt = apply_chat_template(\"Give me a short introduction to large language models in 3 sentences.\")\n",
"input_token_ids = tokenizer.encode(prompt)\n",
"input_token_ids_tensor = torch.tensor(input_token_ids, device=device).unsqueeze(0)\n",
"\n",
"\n",
"if torch.cuda.is_available():\n",
" torch.cuda.reset_peak_memory_stats()\n",
"\n",
"\n",
"for token in generate_text_basic_stream(\n",
" model=model,\n",
" token_ids=input_token_ids_tensor,\n",
" max_new_tokens=500,\n",
" stop_token_ids=stop_ids\n",
"):\n",
" token_id = token.squeeze(0).tolist()\n",
" print(\n",
" tokenizer.decode(token_id),\n",
" end=\"\",\n",
" flush=True\n",
" )\n",
"\n",
"if torch.cuda.is_available():\n",
" def calc_gpu_gb(x):\n",
" return f\"{x / 1024 / 1024 / 1024:.2f} GB\"\n",
" \n",
" print(f\"\\n\\nGPU memory used: {calc_gpu_gb(torch.cuda.max_memory_allocated())}\")"
]
},
{
"cell_type": "markdown",
"id": "549324d6-5c71-4147-ae21-2e67675faa3d",
"metadata": {
"id": "549324d6-5c71-4147-ae21-2e67675faa3d"
},
"source": [
"&nbsp;\n",
"# What's next?"
]
},
{
"cell_type": "markdown",
"id": "e6edaaae-2de1-406c-8ffa-897cdfa3808c",
"metadata": {
"id": "e6edaaae-2de1-406c-8ffa-897cdfa3808c"
},
"source": [
"- For those interested in a comprehensive guide on building a large language model from scratch and gaining a deeper understanding of its mechanics, you might like my [Build a Large Language Model (From Scratch)](http://mng.bz/orYv)\n",
"\n",
"<a href=\"http://mng.bz/orYv\"><img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp\" width=\"100px\"></a>"
]
}
],
"metadata": {
"accelerator": "GPU",
"colab": {
"gpuType": "A100",
"provenance": []
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.5"
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
2026-02-19 16:40:28 -06:00
"state": {
"0dccd57dcc5c43a588157cef957c07e8": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "2.0.0",
"model_name": "HTMLStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "2.0.0",
"_model_name": "HTMLStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "2.0.0",
"_view_name": "StyleView",
"background": null,
"description_width": "",
"font_size": null,
"text_color": null
}
},
"17a3174e65c54476b2e0d1faf8f011ca": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "2.0.0",
"model_name": "HTMLModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "2.0.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "2.0.0",
"_view_name": "HTMLView",
"description": "",
"description_allow_html": false,
"layout": "IPY_MODEL_90a79523187446dfa692723b2e5833a7",
"placeholder": "",
"style": "IPY_MODEL_431ffb83b8c14bf182f0430e07ea6154",
"tabbable": null,
"tooltip": null,
"value": "model.safetensors:35%"
}
},
"1bbf2e62c0754d1593beb4105a7f1ac1": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "2.0.0",
"model_name": "FloatProgressModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "2.0.0",
"_model_name": "FloatProgressModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "2.0.0",
"_view_name": "ProgressView",
"bar_style": "",
"description": "",
"description_allow_html": false,
"layout": "IPY_MODEL_a8f1b72a33dd4b548de23fbd95e0da18",
"max": 2471645608,
"min": 0,
"orientation": "horizontal",
"style": "IPY_MODEL_25cc36132d384189acfbecc59483134b",
"tabbable": null,
"tooltip": null,
"value": 880803840
}
},
"25cc36132d384189acfbecc59483134b": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "2.0.0",
"model_name": "ProgressStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "2.0.0",
"_model_name": "ProgressStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "2.0.0",
"_view_name": "StyleView",
"bar_color": null,
"description_width": ""
}
},
"271e2bd6a35e4a8b92de8697f7c0be5f": {
"model_module": "@jupyter-widgets/base",
"model_module_version": "2.0.0",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "2.0.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "2.0.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border_bottom": null,
"border_left": null,
"border_right": null,
"border_top": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"3326b6141a1a4eba9f316df528a9b99a": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "2.0.0",
"model_name": "HTMLStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "2.0.0",
"_model_name": "HTMLStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "2.0.0",
"_view_name": "StyleView",
"background": null,
"description_width": "",
"font_size": null,
"text_color": null
}
},
"33ca0cdf2c7f41598a381c4ebe6a4ee1": {
"model_module": "@jupyter-widgets/base",
"model_module_version": "2.0.0",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "2.0.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "2.0.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border_bottom": null,
"border_left": null,
"border_right": null,
"border_top": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"431ffb83b8c14bf182f0430e07ea6154": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "2.0.0",
"model_name": "HTMLStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "2.0.0",
"_model_name": "HTMLStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "2.0.0",
"_view_name": "StyleView",
"background": null,
"description_width": "",
"font_size": null,
"text_color": null
}
},
"5176834aa8784bba9ec21234b87a8948": {
"model_module": "@jupyter-widgets/base",
"model_module_version": "2.0.0",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "2.0.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "2.0.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border_bottom": null,
"border_left": null,
"border_right": null,
"border_top": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"518fb202e4b44aaba47f07d1a61b6762": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "2.0.0",
"model_name": "HTMLModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "2.0.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "2.0.0",
"_view_name": "HTMLView",
"description": "",
"description_allow_html": false,
"layout": "IPY_MODEL_e2dc407afcd945c798e30597fddfcb3c",
"placeholder": "",
"style": "IPY_MODEL_0dccd57dcc5c43a588157cef957c07e8",
"tabbable": null,
"tooltip": null,
"value": "tokenizer.model:100%"
}
},
"672cdc5aea954de3af851c001a667ad3": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "2.0.0",
"model_name": "FloatProgressModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "2.0.0",
"_model_name": "FloatProgressModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "2.0.0",
"_view_name": "ProgressView",
"bar_style": "success",
"description": "",
"description_allow_html": false,
"layout": "IPY_MODEL_33ca0cdf2c7f41598a381c4ebe6a4ee1",
"max": 2183982,
"min": 0,
"orientation": "horizontal",
"style": "IPY_MODEL_ee44487f58454dacb522b1e084ffb733",
"tabbable": null,
"tooltip": null,
"value": 2183982
}
},
"90a79523187446dfa692723b2e5833a7": {
"model_module": "@jupyter-widgets/base",
"model_module_version": "2.0.0",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "2.0.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "2.0.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border_bottom": null,
"border_left": null,
"border_right": null,
"border_top": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"9881b6995c3f49dc89e6992fd9ab660b": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "2.0.0",
"model_name": "HBoxModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "2.0.0",
"_model_name": "HBoxModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "2.0.0",
"_view_name": "HBoxView",
"box_style": "",
"children": [
"IPY_MODEL_17a3174e65c54476b2e0d1faf8f011ca",
"IPY_MODEL_1bbf2e62c0754d1593beb4105a7f1ac1",
"IPY_MODEL_b82112e1dec645d98aa1c1ba64abcb61"
],
"layout": "IPY_MODEL_271e2bd6a35e4a8b92de8697f7c0be5f",
"tabbable": null,
"tooltip": null
}
},
"a1608feac06d4687967a3e398f01c489": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "2.0.0",
"model_name": "HBoxModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "2.0.0",
"_model_name": "HBoxModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "2.0.0",
"_view_name": "HBoxView",
"box_style": "",
"children": [
"IPY_MODEL_518fb202e4b44aaba47f07d1a61b6762",
"IPY_MODEL_672cdc5aea954de3af851c001a667ad3",
"IPY_MODEL_eebf8874618746b39cf4a21a2728dc7f"
],
"layout": "IPY_MODEL_5176834aa8784bba9ec21234b87a8948",
"tabbable": null,
"tooltip": null
}
},
"a8f1b72a33dd4b548de23fbd95e0da18": {
"model_module": "@jupyter-widgets/base",
"model_module_version": "2.0.0",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "2.0.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "2.0.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border_bottom": null,
"border_left": null,
"border_right": null,
"border_top": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"b82112e1dec645d98aa1c1ba64abcb61": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "2.0.0",
"model_name": "HTMLModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "2.0.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "2.0.0",
"_view_name": "HTMLView",
"description": "",
"description_allow_html": false,
"layout": "IPY_MODEL_bfd06423ad544218968648016e731a46",
"placeholder": "",
"style": "IPY_MODEL_d029630b63ff44cf807ade428d2eb421",
"tabbable": null,
"tooltip": null,
"value": "870M/2.47G[00:20&lt;00:37,42.8MB/s]"
}
},
"bfd06423ad544218968648016e731a46": {
"model_module": "@jupyter-widgets/base",
"model_module_version": "2.0.0",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "2.0.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "2.0.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border_bottom": null,
"border_left": null,
"border_right": null,
"border_top": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"d029630b63ff44cf807ade428d2eb421": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "2.0.0",
"model_name": "HTMLStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "2.0.0",
"_model_name": "HTMLStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "2.0.0",
"_view_name": "StyleView",
"background": null,
"description_width": "",
"font_size": null,
"text_color": null
}
},
"d2c41e71a3f441deaed091b620ac5603": {
"model_module": "@jupyter-widgets/base",
"model_module_version": "2.0.0",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "2.0.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "2.0.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border_bottom": null,
"border_left": null,
"border_right": null,
"border_top": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"e2dc407afcd945c798e30597fddfcb3c": {
"model_module": "@jupyter-widgets/base",
"model_module_version": "2.0.0",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "2.0.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "2.0.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border_bottom": null,
"border_left": null,
"border_right": null,
"border_top": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"ee44487f58454dacb522b1e084ffb733": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "2.0.0",
"model_name": "ProgressStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "2.0.0",
"_model_name": "ProgressStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "2.0.0",
"_view_name": "StyleView",
"bar_color": null,
"description_width": ""
}
},
"eebf8874618746b39cf4a21a2728dc7f": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "2.0.0",
"model_name": "HTMLModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "2.0.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "2.0.0",
"_view_name": "HTMLView",
"description": "",
"description_allow_html": false,
"layout": "IPY_MODEL_d2c41e71a3f441deaed091b620ac5603",
"placeholder": "",
"style": "IPY_MODEL_3326b6141a1a4eba9f316df528a9b99a",
"tabbable": null,
"tooltip": null,
"value": "2.18M/2.18M[00:00&lt;00:00,9.47MB/s]"
}
2026-02-19 17:33:22 -05:00
}
},
2026-02-19 16:40:28 -06:00
"version_major": 2,
"version_minor": 0
2026-02-19 17:33:22 -05:00
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}