{ "cells": [ { "cell_type": "markdown", "id": "e1b280ab-b61f-4d1a-bf7e-44e5f9ed3a5c", "metadata": { "id": "e1b280ab-b61f-4d1a-bf7e-44e5f9ed3a5c" }, "source": [ "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "Supplementary code for the Build a Large Language Model From Scratch book by Sebastian Raschka
\n", "
Code repository: https://github.com/rasbt/LLMs-from-scratch\n", "
\n", "
\n", "\n", "
" ] }, { "cell_type": "markdown", "id": "efde77f2-6af3-4781-8597-89ecd3f41a52", "metadata": { "id": "efde77f2-6af3-4781-8597-89ecd3f41a52" }, "source": [ "# Tiny Aya From Scratch (A Standalone Notebook)" ] }, { "cell_type": "markdown", "id": "55cdef4d-de59-4a65-89f9-fa2a8ef3471d", "metadata": { "id": "55cdef4d-de59-4a65-89f9-fa2a8ef3471d" }, "source": [ "- This notebook is purposefully minimal and focuses on the code to re-implement Tiny Aya (3.35B) models from Cohere in pure PyTorch without relying on other external LLM libraries; Tiny Aya is interesting because it is a small but strong model with good multi-lingual support\n", "- For more information, see the official [Tiny Aya announcement](https://cohere.com/blog/cohere-labs-tiny-aya) and model cards:\n", " - [tiny-aya-base](https://huggingface.co/CohereLabs/tiny-aya-base) (base model)\n", " - [tiny-aya-global](https://huggingface.co/CohereLabs/tiny-aya-global) (best balance across languages and regions; notebook default)\n", " - [tiny-aya-fire](https://huggingface.co/CohereLabs/tiny-aya-fire) (optimized for South Asian languages)\n", " - [tiny-aya-water](https://huggingface.co/CohereLabs/tiny-aya-water) (optimized for European and Asia Pacific languages)\n", " - [tiny-aya-earth](https://huggingface.co/CohereLabs/tiny-aya-earth) (optimized for West Asian and African languages)\n" ] }, { "cell_type": "markdown", "id": "4e2a716d-31e6-4d28-be32-94585dcae082", "metadata": {}, "source": [ "- Below is a table with more details regarding the language specialization (taken from their announcement blog post linked above)\n", "\n", "| Region | Languages | Optimized Model |\n", "|---------------|-----------|----------------|\n", "| **Asia Pacific** | Traditional Chinese, Cantonese, Vietnamese, Tagalog, Javanese, Khmer, Thai, Burmese, Malay, Korean, Lao, Indonesian, Simplified Chinese, Japanese | tiny-aya-water |\n", "| **Africa** | Zulu, Amharic, Hausa, Igbo, Swahili, Xhosa, Wolof, Shona, Yoruba, Nigerian Pidgin, Malagasy | tiny-aya-earth |\n", "| **South Asia** | Telugu, Marathi, Bengali, Tamil, Hindi, Punjabi, Gujarati, Urdu, Nepali | tiny-aya-fire |\n", "| **Europe** | Catalan, Galician, Dutch, Danish, Finnish, Czech, Portuguese, French, Lithuanian, Slovak, Basque, English, Swedish, Polish, Spanish, Slovenian, Ukrainian, Greek, Bokmål, Romanian, Serbian, German, Italian, Russian, Irish, Hungarian, Bulgarian, Croatian, Estonian, Latvian, Welsh | tiny-aya-water |\n", "| **West Asia** | Arabic, Maltese, Turkish, Hebrew, Persian | tiny-aya-earth |\n" ] }, { "cell_type": "markdown", "id": "66b43549-585f-43ab-be19-addcc2dfc669", "metadata": {}, "source": [ "- Below is a side-by-side comparison with Qwen3 4B as a reference model; if you are interested in the Qwen3 standalone notebook, you can find it [here](../11_qwen3)\n", "
\n", "\n", "\n", "\n", " \n", "- About the code:\n", " - all code is my own code, mapping the Tiny Aya architecture onto the model code implemented in my [Build A Large Language Model (From Scratch)](http://mng.bz/orYv) book; the code is released under a permissive open-source Apache 2.0 license (see [LICENSE.txt](https://github.com/rasbt/LLMs-from-scratch/blob/main/LICENSE.txt))" ] }, { "cell_type": "code", "execution_count": 1, "id": "7c201adb-747e-437b-9a62-442802941e01", "metadata": {}, "outputs": [], "source": [ "# pip install -r https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/refs/heads/main/ch05/07_gpt_to_llama/requirements-extra.txt" ] }, { "cell_type": "code", "execution_count": 2, "id": "dd1b65a8-4301-444a-bd7c-a6f2bd1df9df", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "dd1b65a8-4301-444a-bd7c-a6f2bd1df9df", "outputId": "4f762354-e0a3-4cc2-e5d4-e61a227a202c" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "huggingface_hub version: 1.4.1\n", "tiktoken version: 0.12.0\n", "torch version: 2.10.0\n" ] } ], "source": [ "from importlib.metadata import version\n", "\n", "pkgs = [\n", " #\"blobfile\", # to download pretrained weights\n", " \"huggingface_hub\", # to download pretrained weights\n", " \"tiktoken\", # to implement the tokenizer\n", " \"torch\", # to implement the model\n", "]\n", "for p in pkgs:\n", " print(f\"{p} version: {version(p)}\")" ] }, { "cell_type": "code", "execution_count": 3, "id": "574bc51e-876e-46c3-bcf7-ef4675582ad2", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "\n", "REPO_ID = \"CohereLabs/tiny-aya-global\"\n", "#REPO_ID = \"CohereLabs/tiny-aya-fire\" \n", "#REPO_ID = \"CohereLabs/tiny-aya-water\"\n", "#REPO_ID = \"CohereLabs/tiny-aya-earth\"\n", "\n", "LOCAL_DIR = Path(REPO_ID).parts[-1]" ] }, { "cell_type": "markdown", "id": "653410a6-dd2b-4eb2-a722-23d9782e726d", "metadata": { "id": "653410a6-dd2b-4eb2-a722-23d9782e726d" }, "source": [ " \n", "# 1. Architecture code" ] }, { "cell_type": "code", "execution_count": 4, "id": "82076c21-9331-4dcd-b017-42b046cf1a60", "metadata": { "id": "82076c21-9331-4dcd-b017-42b046cf1a60" }, "outputs": [], "source": [ "import torch\n", "import torch.nn as nn\n", "\n", "\n", "class FeedForward(nn.Module):\n", " def __init__(self, cfg):\n", " super().__init__()\n", " self.fc1 = nn.Linear(cfg[\"emb_dim\"], cfg[\"hidden_dim\"], dtype=cfg[\"dtype\"], bias=False)\n", " self.fc2 = nn.Linear(cfg[\"emb_dim\"], cfg[\"hidden_dim\"], dtype=cfg[\"dtype\"], bias=False)\n", " self.fc3 = nn.Linear(cfg[\"hidden_dim\"], cfg[\"emb_dim\"], dtype=cfg[\"dtype\"], bias=False)\n", "\n", " def forward(self, x):\n", " x_fc1 = self.fc1(x)\n", " x_fc2 = self.fc2(x)\n", " x = nn.functional.silu(x_fc1) * x_fc2\n", " return self.fc3(x)" ] }, { "cell_type": "code", "execution_count": 5, "id": "1a36d4a0-ee44-4727-ab7e-c73dd5e1ddba", "metadata": {}, "outputs": [], "source": [ "# Aya uses a bias-less LayerNorm variant. \n", "# The difference to classic LayerNorm is that it only \n", "# has a scale parameter (weight), no shift parameter (bias).\n", "\n", "class CohereLayerNorm(nn.Module):\n", " def __init__(self, emb_dim, eps=1e-5):\n", " super().__init__()\n", " self.eps = eps\n", " self.weight = nn.Parameter(torch.ones(emb_dim))\n", "\n", " def forward(self, x):\n", " input_dtype = x.dtype\n", " x = x.to(torch.float32)\n", " mean = x.mean(dim=-1, keepdim=True)\n", " variance = (x - mean).pow(2).mean(dim=-1, keepdim=True)\n", " x = (x - mean) * torch.rsqrt(variance + self.eps)\n", " return (self.weight.to(torch.float32) * x).to(input_dtype)" ] }, { "cell_type": "code", "execution_count": 6, "id": "4b9a346f-5826-4083-9162-abd56afc03f0", "metadata": { "id": "4b9a346f-5826-4083-9162-abd56afc03f0" }, "outputs": [], "source": [ "def compute_rope_params(head_dim, theta_base=10_000, context_length=4096, dtype=torch.float32):\n", " assert head_dim % 2 == 0, \"head_dim must be even\"\n", "\n", " # Compute the inverse frequencies\n", " inv_freq = 1.0 / (\n", " theta_base ** (torch.arange(0, head_dim, 2, dtype=dtype)[: (head_dim // 2)].float() / head_dim)\n", " )\n", " positions = torch.arange(context_length, dtype=dtype)\n", "\n", " # Compute the angles\n", " angles = positions.unsqueeze(1) * inv_freq.unsqueeze(0) # Shape: (context_length, head_dim // 2)\n", "\n", " # Cohere uses interleaved even/odd angle layout per head-dim pair.\n", " # Llama2 notebook examples often use a split-halves layout via cat([angles, angles]).\n", " # Both are equivalent only when paired with the matching rotate logic:\n", " # - interleaved layout -> even/odd rotation implementation (below)\n", " # - split-halves layout -> half/half rotate implementation\n", " angles = torch.repeat_interleave(angles, 2, dim=1) # Shape: (context_length, head_dim)\n", "\n", " # Precompute sine and cosine\n", " return torch.cos(angles), torch.sin(angles)\n", "\n", "def apply_rope(x, cos, sin, offset=0):\n", " # x: (batch_size, num_heads, seq_len, head_dim)\n", " batch_size, num_heads, seq_len, head_dim = x.shape\n", " assert head_dim % 2 == 0, \"head_dim must be even\"\n", "\n", " # Split x into even and odd components (interleaved layout)\n", " x_even = x[..., ::2]\n", " x_odd = x[..., 1::2]\n", "\n", " # Adjust sin and cos shapes\n", " cos = cos[offset:offset + seq_len, :].unsqueeze(0).unsqueeze(0)\n", " sin = sin[offset:offset + seq_len, :].unsqueeze(0).unsqueeze(0)\n", "\n", " # Apply the rotary transformation\n", " x_float = x.float()\n", " rotated = torch.stack((-x_odd.float(), x_even.float()), dim=-1).flatten(-2)\n", " x_rotated = (x_float * cos) + (rotated * sin)\n", "\n", " return x_rotated.to(dtype=x.dtype)" ] }, { "cell_type": "code", "execution_count": 7, "id": "e8169ab5-f976-4222-a2e1-eb1cabf267cb", "metadata": { "id": "e8169ab5-f976-4222-a2e1-eb1cabf267cb" }, "outputs": [], "source": [ "class GroupedQueryAttention(nn.Module):\n", " def __init__(\n", " self,\n", " d_in,\n", " num_heads,\n", " num_kv_groups,\n", " head_dim=None,\n", " qk_norm=False,\n", " attention_bias=False,\n", " dtype=None,\n", " attn_type=\"full_attention\",\n", " ):\n", " super().__init__()\n", " assert num_heads % num_kv_groups == 0, \"num_heads must be divisible by num_kv_groups\"\n", "\n", " self.num_heads = num_heads\n", " self.num_kv_groups = num_kv_groups\n", " self.group_size = num_heads // num_kv_groups\n", "\n", " if head_dim is None:\n", " assert d_in % num_heads == 0, \"`d_in` must be divisible by `num_heads` if `head_dim` is not set\"\n", " head_dim = d_in // num_heads\n", "\n", " self.head_dim = head_dim\n", " self.d_out = num_heads * head_dim\n", " self.attn_type = attn_type\n", "\n", " self.W_query = nn.Linear(\n", " d_in,\n", " self.d_out,\n", " bias=attention_bias,\n", " dtype=dtype,\n", " )\n", " self.W_key = nn.Linear(\n", " d_in,\n", " num_kv_groups * head_dim,\n", " bias=attention_bias,\n", " dtype=dtype,\n", " )\n", " self.W_value = nn.Linear(\n", " d_in,\n", " num_kv_groups * head_dim,\n", " bias=attention_bias,\n", " dtype=dtype,\n", " )\n", " self.out_proj = nn.Linear(\n", " self.d_out,\n", " d_in,\n", " bias=attention_bias,\n", " dtype=dtype,\n", " )\n", "\n", " if qk_norm:\n", " self.q_norm = CohereLayerNorm(head_dim, eps=1e-6)\n", " self.k_norm = CohereLayerNorm(head_dim, eps=1e-6)\n", " else:\n", " self.q_norm = self.k_norm = None\n", "\n", " def forward(self, x, mask, cos, sin, start_pos=0, cache=None):\n", " b, num_tokens, _ = x.shape\n", "\n", " # Apply projections\n", " queries = self.W_query(x) # (b, num_tokens, num_heads * head_dim)\n", " keys = self.W_key(x) # (b, num_tokens, num_kv_groups * head_dim)\n", " values = self.W_value(x) # (b, num_tokens, num_kv_groups * head_dim)\n", "\n", " # Reshape\n", " queries = queries.view(b, num_tokens, self.num_heads, self.head_dim).transpose(1, 2)\n", " keys_new = keys.view(b, num_tokens, self.num_kv_groups, self.head_dim).transpose(1, 2)\n", " values_new = values.view(b, num_tokens, self.num_kv_groups, self.head_dim).transpose(1, 2)\n", "\n", " # Optional normalization\n", " if self.q_norm:\n", " queries = self.q_norm(queries)\n", " if self.k_norm:\n", " keys_new = self.k_norm(keys_new)\n", "\n", " # Cohere2 applies RoPE only on sliding-attention layers.\n", " if self.attn_type == \"sliding_attention\":\n", " queries = apply_rope(queries, cos, sin, offset=start_pos)\n", " keys_new = apply_rope(keys_new, cos, sin, offset=start_pos)\n", "\n", " if cache is not None:\n", " prev_k, prev_v = cache\n", " keys = torch.cat([prev_k, keys_new], dim=2)\n", " values = torch.cat([prev_v, values_new], dim=2)\n", " next_cache = (keys, values)\n", " else:\n", " keys, values = keys_new, values_new\n", " next_cache = (keys, values)\n", "\n", " # Expand K and V to match number of heads\n", " keys = keys.repeat_interleave(self.group_size, dim=1)\n", " values = values.repeat_interleave(self.group_size, dim=1)\n", "\n", " # Attention\n", " attn_scores = queries @ keys.transpose(2, 3)\n", " attn_scores = attn_scores.masked_fill(mask, -torch.inf)\n", "\n", " attn_weights = torch.softmax(attn_scores / self.head_dim**0.5, dim=-1, dtype=torch.float32).to(queries.dtype)\n", " context = (attn_weights @ values).transpose(1, 2).reshape(b, num_tokens, self.d_out)\n", "\n", " return self.out_proj(context), next_cache" ] }, { "cell_type": "code", "execution_count": 8, "id": "457cb2f8-50c1-4045-8a74-f181bfb5fea9", "metadata": { "id": "457cb2f8-50c1-4045-8a74-f181bfb5fea9" }, "outputs": [], "source": [ "class TransformerBlock(nn.Module):\n", " def __init__(self, cfg, attn_type):\n", " super().__init__()\n", " self.attn_type = attn_type\n", "\n", " self.att = GroupedQueryAttention(\n", " d_in=cfg[\"emb_dim\"],\n", " num_heads=cfg[\"n_heads\"],\n", " num_kv_groups=cfg[\"n_kv_heads\"],\n", " head_dim=cfg[\"head_dim\"],\n", " qk_norm=False,\n", " attention_bias=cfg[\"attention_bias\"],\n", " dtype=cfg[\"dtype\"],\n", " attn_type=attn_type,\n", " )\n", " self.ff = FeedForward(cfg)\n", " self.input_layernorm = CohereLayerNorm(cfg[\"emb_dim\"], eps=cfg[\"layer_norm_eps\"])\n", "\n", " def forward(self, x, mask_global, mask_local, cos, sin, start_pos=0, cache=None):\n", " attn_mask = mask_local if self.attn_type == \"sliding_attention\" else mask_global\n", "\n", " shortcut = x\n", " x = self.input_layernorm(x)\n", " x_attn, next_cache = self.att(\n", " x,\n", " attn_mask,\n", " cos,\n", " sin,\n", " start_pos=start_pos,\n", " cache=cache,\n", " ) # Shape [batch_size, num_tokens, emb_dim]\n", " x_ff = self.ff(x)\n", "\n", " # Cohere2 parallel residual block\n", " x = shortcut + x_attn + x_ff\n", " return x, next_cache" ] }, { "cell_type": "code", "execution_count": 9, "id": "e88de3e3-9f07-42cc-816b-28dbd46e96c4", "metadata": { "id": "e88de3e3-9f07-42cc-816b-28dbd46e96c4" }, "outputs": [], "source": [ "class TinyAyaModel(nn.Module):\n", " def __init__(self, cfg):\n", " super().__init__()\n", " assert len(cfg[\"layer_types\"]) == cfg[\"n_layers\"], \"layer_types must match n_layers\"\n", "\n", " self.cfg = cfg\n", "\n", " self.tok_emb = nn.Embedding(cfg[\"vocab_size\"], cfg[\"emb_dim\"], dtype=cfg[\"dtype\"])\n", " self.trf_blocks = nn.ModuleList([TransformerBlock(cfg, t) for t in cfg[\"layer_types\"]])\n", "\n", " self.final_norm = CohereLayerNorm(cfg[\"emb_dim\"], eps=cfg[\"layer_norm_eps\"])\n", " self.out_head = nn.Linear(cfg[\"emb_dim\"], cfg[\"vocab_size\"], bias=False, dtype=cfg[\"dtype\"])\n", "\n", " self.logit_scale = cfg[\"logit_scale\"]\n", "\n", " cos, sin = compute_rope_params(\n", " head_dim=cfg[\"head_dim\"],\n", " theta_base=cfg[\"rope_base\"],\n", " context_length=cfg[\"context_length\"],\n", " )\n", " self.register_buffer(\"cos\", cos, persistent=False)\n", " self.register_buffer(\"sin\", sin, persistent=False)\n", "\n", " if cfg[\"tie_word_embeddings\"]:\n", " self.out_head.weight = self.tok_emb.weight\n", "\n", " self.current_pos = 0 # Track current position in KV cache\n", "\n", " def create_masks(self, num_tokens, device, pos_start=0, total_kv_tokens=None):\n", " if total_kv_tokens is None:\n", " total_kv_tokens = pos_start + num_tokens\n", "\n", " query_positions = torch.arange(pos_start, pos_start + num_tokens, device=device).unsqueeze(1)\n", " key_positions = torch.arange(total_kv_tokens, device=device).unsqueeze(0)\n", "\n", " # Future mask\n", " mask_global = key_positions > query_positions\n", "\n", " # Sliding-window mask\n", " far_past = key_positions + self.cfg[\"sliding_window\"] <= query_positions\n", " mask_local = mask_global | far_past\n", "\n", " # Expand to [batch, heads, seq, seq]-broadcastable shape\n", " return mask_global.unsqueeze(0).unsqueeze(0), mask_local.unsqueeze(0).unsqueeze(0)\n", "\n", " def forward(self, input_ids, attention_mask=None, cache=None):\n", " tok_embeds = self.tok_emb(input_ids)\n", " x = tok_embeds\n", " num_tokens = x.shape[1]\n", "\n", " if cache is not None:\n", " pos_start = self.current_pos\n", " pos_end = pos_start + num_tokens\n", " self.current_pos = pos_end\n", " total_kv_tokens = pos_end\n", " else:\n", " pos_start = 0\n", " total_kv_tokens = num_tokens\n", "\n", " mask_global, mask_local = self.create_masks(\n", " num_tokens,\n", " x.device,\n", " pos_start=pos_start,\n", " total_kv_tokens=total_kv_tokens,\n", " )\n", "\n", " if attention_mask is not None:\n", " # True means mask in this implementation.\n", " pad_mask = attention_mask[:, None, None, :total_kv_tokens].to(dtype=torch.bool).logical_not()\n", " mask_global = mask_global | pad_mask\n", " mask_local = mask_local | pad_mask\n", "\n", " cos = self.cos.to(x.device, dtype=x.dtype)\n", " sin = self.sin.to(x.device, dtype=x.dtype)\n", "\n", " for i, block in enumerate(self.trf_blocks):\n", " blk_cache = cache.get(i) if cache else None\n", " x, new_blk_cache = block(\n", " x,\n", " mask_global,\n", " mask_local,\n", " cos,\n", " sin,\n", " start_pos=pos_start,\n", " cache=blk_cache,\n", " )\n", " if cache is not None:\n", " cache.update(i, new_blk_cache)\n", "\n", " x = self.final_norm(x)\n", " logits = self.out_head(x.to(self.cfg[\"dtype\"]))\n", " return logits * self.logit_scale\n", "\n", " def reset_kv_cache(self):\n", " self.current_pos = 0\n", "\n", "\n", "class KVCache:\n", " def __init__(self, n_layers):\n", " self.cache = [None] * n_layers\n", "\n", " def get(self, layer_idx):\n", " return self.cache[layer_idx]\n", "\n", " def update(self, layer_idx, value):\n", " self.cache[layer_idx] = value\n", "\n", " def get_all(self):\n", " return self.cache\n", "\n", " def reset(self):\n", " for i in range(len(self.cache)):\n", " self.cache[i] = None" ] }, { "cell_type": "markdown", "id": "be2d201f-74ad-4d63-ab9c-601b00674a48", "metadata": { "id": "be2d201f-74ad-4d63-ab9c-601b00674a48" }, "source": [ " \n", "# 2. Initialize model" ] }, { "cell_type": "markdown", "id": "23dea40c-fe20-4a75-be25-d6fce5863c01", "metadata": { "id": "23dea40c-fe20-4a75-be25-d6fce5863c01" }, "source": [ "- The remainder of this notebook uses the Llama 3.2 1B model; to use the 3B model variant, just uncomment the second configuration file in the following code cell" ] }, { "cell_type": "code", "execution_count": 10, "id": "caa142fa-b375-4e78-b392-2072ced666f3", "metadata": { "id": "caa142fa-b375-4e78-b392-2072ced666f3" }, "outputs": [], "source": [ "TINY_AYA_CONFIG = {\n", " \"vocab_size\": 262_144, # Vocabulary size\n", " \"context_length\": 500_000, # Context length in the HF config\n", " \"emb_dim\": 2048, # Embedding dimension\n", " \"n_heads\": 16, # Number of attention heads\n", " \"n_layers\": 36, # Number of layers\n", " \"hidden_dim\": 11_008, # Size of the intermediate dimension in FeedForward\n", " \"head_dim\": 128, # Size of the heads in GQA\n", " \"n_kv_heads\": 4, # Number of KV heads for grouped-query attention\n", " \"attention_bias\": False, # Whether attention projections use bias terms\n", " \"attention_dropout\": 0.0, # Attention dropout\n", " \"sliding_window\": 4096, # Sliding-window attention context\n", " \"layer_types\": [\n", " \"sliding_attention\",\n", " \"sliding_attention\",\n", " \"sliding_attention\",\n", " \"full_attention\",\n", " \"sliding_attention\",\n", " \"sliding_attention\",\n", " \"sliding_attention\",\n", " \"full_attention\",\n", " \"sliding_attention\",\n", " \"sliding_attention\",\n", " \"sliding_attention\",\n", " \"full_attention\",\n", " \"sliding_attention\",\n", " \"sliding_attention\",\n", " \"sliding_attention\",\n", " \"full_attention\",\n", " \"sliding_attention\",\n", " \"sliding_attention\",\n", " \"sliding_attention\",\n", " \"full_attention\",\n", " \"sliding_attention\",\n", " \"sliding_attention\",\n", " \"sliding_attention\",\n", " \"full_attention\",\n", " \"sliding_attention\",\n", " \"sliding_attention\",\n", " \"sliding_attention\",\n", " \"full_attention\",\n", " \"sliding_attention\",\n", " \"sliding_attention\",\n", " \"sliding_attention\",\n", " \"full_attention\",\n", " \"sliding_attention\",\n", " \"sliding_attention\",\n", " \"sliding_attention\",\n", " \"full_attention\",\n", " ],\n", " \"rope_base\": 50_000.0, # The base in RoPE's \"theta\"\n", " \"layer_norm_eps\": 1e-5, # Epsilon used by layer normalization\n", " \"logit_scale\": 1.0, # Final logits scaling factor\n", " \"tie_word_embeddings\": True, # Whether input embedding and output head are tied\n", " \"bos_token_id\": 2,\n", " \"eos_token_id\": 3,\n", " \"pad_token_id\": 0,\n", " \"dtype\": torch.bfloat16, # Lower-precision dtype to reduce memory usage\n", "}" ] }, { "cell_type": "code", "execution_count": 11, "id": "156253fe-aacd-4da2-8f13-705f05c4b11e", "metadata": { "id": "156253fe-aacd-4da2-8f13-705f05c4b11e" }, "outputs": [], "source": [ "model = TinyAyaModel(TINY_AYA_CONFIG)" ] }, { "cell_type": "code", "execution_count": 12, "id": "fd5efb03-5a07-46e8-8607-93ed47549d2b", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "fd5efb03-5a07-46e8-8607-93ed47549d2b", "outputId": "65c1a95e-b502-4150-9e2e-da619d9053d5" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "float32 (PyTorch default): 25.43 GB\n", "bfloat16: 12.72 GB\n" ] } ], "source": [ "def calc_model_memory_size(model, input_dtype=torch.float32):\n", " total_params = 0\n", " total_grads = 0\n", " for param in model.parameters():\n", " # Calculate total number of elements per parameter\n", " param_size = param.numel()\n", " total_params += param_size\n", " # Check if gradients are stored for this parameter\n", " if param.requires_grad:\n", " total_grads += param_size\n", "\n", " # Calculate buffer size (non-parameters that require memory)\n", " total_buffers = sum(buf.numel() for buf in model.buffers())\n", "\n", " # Size in bytes = (Number of elements) * (Size of each element in bytes)\n", " # We assume parameters and gradients are stored in the same type as input dtype\n", " element_size = torch.tensor(0, dtype=input_dtype).element_size()\n", " total_memory_bytes = (total_params + total_grads + total_buffers) * element_size\n", "\n", " # Convert bytes to gigabytes\n", " total_memory_gb = total_memory_bytes / (1024**3)\n", "\n", " return total_memory_gb\n", "\n", "print(f\"float32 (PyTorch default): {calc_model_memory_size(model, input_dtype=torch.float32):.2f} GB\")\n", "print(f\"bfloat16: {calc_model_memory_size(model, input_dtype=torch.bfloat16):.2f} GB\")" ] }, { "cell_type": "code", "execution_count": 13, "id": "41176fb0-d58a-443a-912f-4f436564b5f8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total number of parameters: 3,349,227,520\n", "\n", "Total number of unique parameters: 2,812,356,608\n" ] } ], "source": [ "total_params = sum(p.numel() for p in model.parameters())\n", "print(f\"Total number of parameters: {total_params:,}\")\n", "\n", "# Account for weight tying\n", "total_params_normalized = total_params - model.tok_emb.weight.numel()\n", "print(f\"\\nTotal number of unique parameters: {total_params_normalized:,}\")" ] }, { "cell_type": "code", "execution_count": 14, "id": "31f12baf-f79b-499f-85c0-51328a6a20f5", "metadata": { "id": "31f12baf-f79b-499f-85c0-51328a6a20f5" }, "outputs": [], "source": [ "if torch.cuda.is_available():\n", " device = torch.device(\"cuda\")\n", "elif torch.backends.mps.is_available():\n", " device = torch.device(\"mps\")\n", "else:\n", " device = torch.device(\"cpu\")\n", "\n", "model.to(device);" ] }, { "cell_type": "markdown", "id": "78e091e1-afa8-4d23-9aea-cced86181bfd", "metadata": { "id": "78e091e1-afa8-4d23-9aea-cced86181bfd" }, "source": [ " \n", "# 3. Load tokenizer" ] }, { "cell_type": "code", "execution_count": 15, "id": "9482b01c-49f9-48e4-ab2c-4a4c75240e77", "metadata": { "id": "9482b01c-49f9-48e4-ab2c-4a4c75240e77" }, "outputs": [], "source": [ "from tokenizers import Tokenizer\n", "\n", "\n", "class TinyAyaTokenizer:\n", " def __init__(self, tokenizer_file_path, eos_token_id=3, pad_token_id=0, bos_token_id=2):\n", " tok_file = Path(tokenizer_file_path)\n", " self._tok = Tokenizer.from_file(str(tok_file))\n", "\n", " eos_from_tok = self._tok.token_to_id(\"\")\n", " pad_from_tok = self._tok.token_to_id(\"\")\n", " bos_from_tok = self._tok.token_to_id(\"\")\n", "\n", " self.eos_token_id = eos_from_tok if eos_from_tok is not None else eos_token_id\n", " self.pad_token_id = pad_from_tok if pad_from_tok is not None else pad_token_id\n", " self.bos_token_id = bos_from_tok if bos_from_tok is not None else bos_token_id\n", "\n", " def encode(self, text):\n", " return self._tok.encode(text).ids\n", "\n", " def decode(self, ids):\n", " return self._tok.decode(ids, skip_special_tokens=False)\n", "\n", "\n", "def apply_chat_template(user_text):\n", " return (\n", " \"\"\n", " \"<|START_OF_TURN_TOKEN|><|USER_TOKEN|>\"\n", " f\"{user_text}\"\n", " \"<|END_OF_TURN_TOKEN|>\"\n", " \"<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|><|START_RESPONSE|>\"\n", " )" ] }, { "cell_type": "markdown", "id": "b771b60c-c198-4b30-bf10-42031197ae86", "metadata": { "id": "b771b60c-c198-4b30-bf10-42031197ae86" }, "source": [ "- Please note that Cohere requires that you accept the Tiny Aya licensing terms before you can download the files; to do this, you have to create a Hugging Face Hub account and visit the [CohereLabs/tiny-aya-global](https://huggingface.co/CohereLabs/tiny-aya-global) repository to accept the terms\n", "- Next, you will need to create an access token; to generate an access token with READ permissions, click on the profile picture in the upper right and click on \"Settings\"\n", "\n", "\n", "\n", "\n", "- Then, create and copy the access token so you can copy & paste it into the next code cell\n", "\n", "" ] }, { "cell_type": "code", "execution_count": 16, "id": "05104b25-71fb-462f-8f2d-336184833eda", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CohereLabs/tiny-aya-global\n" ] } ], "source": [ "print(REPO_ID)" ] }, { "cell_type": "markdown", "id": "7e327c26-ae3e-4f07-845f-eeb4a6b31283", "metadata": {}, "source": [ "- Note that if you use the fire, water, base, or earth model, you'd have to accept the licensing terms separately:\n", " - [CohereLabs/tiny-aya-fire](https://huggingface.co/CohereLabs/tiny-aya-fire)\n", " - [CohereLabs/tiny-aya-water](https://huggingface.co/CohereLabs/tiny-aya-water)\n", " - [CohereLabs/tiny-aya-earth](https://huggingface.co/CohereLabs/tiny-aya-earth)\n", " - [CohereLabs/tiny-aya-base](https://huggingface.co/CohereLabs/tiny-aya-base)" ] }, { "cell_type": "code", "execution_count": 17, "id": "e9d96dc8-603a-4cb5-8c3e-4d2ca56862ed", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "e9d96dc8-603a-4cb5-8c3e-4d2ca56862ed", "outputId": "e6e6dc05-7330-45bc-a9a7-331919155bdd" }, "outputs": [], "source": [ "# Uncomment and run the following code if you are executing the notebook for the first time\n", "\n", "from huggingface_hub import login\n", "login()" ] }, { "cell_type": "code", "execution_count": 18, "id": "986bc1a0-804f-4154-80f8-44cefbee1368", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 141, "referenced_widgets": [ "a1608feac06d4687967a3e398f01c489", "518fb202e4b44aaba47f07d1a61b6762", "672cdc5aea954de3af851c001a667ad3", "eebf8874618746b39cf4a21a2728dc7f", "5176834aa8784bba9ec21234b87a8948", "e2dc407afcd945c798e30597fddfcb3c", "0dccd57dcc5c43a588157cef957c07e8", "33ca0cdf2c7f41598a381c4ebe6a4ee1", "ee44487f58454dacb522b1e084ffb733", "d2c41e71a3f441deaed091b620ac5603", "3326b6141a1a4eba9f316df528a9b99a" ] }, "id": "986bc1a0-804f-4154-80f8-44cefbee1368", "outputId": "5dd7334b-4c71-465a-94d2-c3e95b9ddc58" }, "outputs": [], "source": [ "from huggingface_hub import hf_hub_download\n", "\n", "tokenizer_file_path = Path(LOCAL_DIR) / \"tokenizer.json\"\n", "if not tokenizer_file_path.exists():\n", " try:\n", " tokenizer_file_path = hf_hub_download(repo_id=REPO_ID, filename=\"tokenizer.json\", local_dir=LOCAL_DIR)\n", " except Exception as e:\n", " print(f\"Warning: failed to download tokenizer.json: {e}\")\n", " tokenizer_file_path = \"tokenizer.json\"" ] }, { "cell_type": "code", "execution_count": 19, "id": "_gBhxDtU_nxo", "metadata": { "id": "_gBhxDtU_nxo" }, "outputs": [ { "data": { "text/plain": [ "'<|START_OF_TURN_TOKEN|><|USER_TOKEN|>Give me a short introduction to large language models in 3 sentences.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|><|START_RESPONSE|>'" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokenizer = TinyAyaTokenizer(\n", " tokenizer_file_path=Path(LOCAL_DIR) / \"tokenizer.json\",\n", " eos_token_id=TINY_AYA_CONFIG[\"eos_token_id\"],\n", " pad_token_id=TINY_AYA_CONFIG[\"pad_token_id\"],\n", " bos_token_id=TINY_AYA_CONFIG[\"bos_token_id\"],\n", ")\n", "\n", "prompt = apply_chat_template(\"Give me a short introduction to large language models in 3 sentences.\")\n", "input_token_ids = tokenizer.encode(prompt)\n", "text = tokenizer.decode(input_token_ids)\n", "text" ] }, { "cell_type": "markdown", "id": "c172f89f-d301-439f-b809-46169e5f5945", "metadata": { "id": "c172f89f-d301-439f-b809-46169e5f5945" }, "source": [ " \n", "# 4. Load pretrained weights" ] }, { "cell_type": "code", "execution_count": 20, "id": "75166128-5899-4995-9b88-9672e135650e", "metadata": { "id": "75166128-5899-4995-9b88-9672e135650e" }, "outputs": [], "source": [ "def load_weights_into_tiny_aya(model, param_config, params):\n", " def assign(left, right, tensor_name=\"unknown\"):\n", " if left.shape != right.shape:\n", " raise ValueError(\n", " f\"Shape mismatch in tensor '{tensor_name}'. Left: {left.shape}, Right: {right.shape}\"\n", " )\n", "\n", " with torch.no_grad():\n", " if isinstance(right, torch.Tensor):\n", " left.copy_(right.to(dtype=left.dtype, device=left.device))\n", " else:\n", " left.copy_(torch.as_tensor(right, dtype=left.dtype, device=left.device))\n", "\n", " return left\n", "\n", " model.tok_emb.weight = assign(\n", " model.tok_emb.weight,\n", " params[\"model.embed_tokens.weight\"],\n", " \"model.embed_tokens.weight\",\n", " )\n", "\n", " for l in range(param_config[\"n_layers\"]):\n", " block = model.trf_blocks[l]\n", " att = block.att\n", "\n", " # Q, K, V projections\n", " att.W_query.weight = assign(\n", " att.W_query.weight,\n", " params[f\"model.layers.{l}.self_attn.q_proj.weight\"],\n", " f\"model.layers.{l}.self_attn.q_proj.weight\",\n", " )\n", " att.W_key.weight = assign(\n", " att.W_key.weight,\n", " params[f\"model.layers.{l}.self_attn.k_proj.weight\"],\n", " f\"model.layers.{l}.self_attn.k_proj.weight\",\n", " )\n", " att.W_value.weight = assign(\n", " att.W_value.weight,\n", " params[f\"model.layers.{l}.self_attn.v_proj.weight\"],\n", " f\"model.layers.{l}.self_attn.v_proj.weight\",\n", " )\n", "\n", " # Output projection\n", " att.out_proj.weight = assign(\n", " att.out_proj.weight,\n", " params[f\"model.layers.{l}.self_attn.o_proj.weight\"],\n", " f\"model.layers.{l}.self_attn.o_proj.weight\",\n", " )\n", "\n", " # Feedforward weights\n", " block.ff.fc1.weight = assign(\n", " block.ff.fc1.weight,\n", " params[f\"model.layers.{l}.mlp.gate_proj.weight\"],\n", " f\"model.layers.{l}.mlp.gate_proj.weight\",\n", " )\n", " block.ff.fc2.weight = assign(\n", " block.ff.fc2.weight,\n", " params[f\"model.layers.{l}.mlp.up_proj.weight\"],\n", " f\"model.layers.{l}.mlp.up_proj.weight\",\n", " )\n", " block.ff.fc3.weight = assign(\n", " block.ff.fc3.weight,\n", " params[f\"model.layers.{l}.mlp.down_proj.weight\"],\n", " f\"model.layers.{l}.mlp.down_proj.weight\",\n", " )\n", "\n", " # Layernorm\n", " block.input_layernorm.weight = assign(\n", " block.input_layernorm.weight,\n", " params[f\"model.layers.{l}.input_layernorm.weight\"],\n", " f\"model.layers.{l}.input_layernorm.weight\",\n", " )\n", "\n", " # Final normalization and output head\n", " model.final_norm.weight = assign(\n", " model.final_norm.weight,\n", " params[\"model.norm.weight\"],\n", " \"model.norm.weight\",\n", " )\n", "\n", " if \"lm_head.weight\" in params:\n", " model.out_head.weight = assign(model.out_head.weight, params[\"lm_head.weight\"], \"lm_head.weight\")\n", " else:\n", " if param_config[\"tie_word_embeddings\"]:\n", " model.out_head.weight = model.tok_emb.weight\n", " print(\"Model uses weight tying.\")" ] }, { "cell_type": "code", "execution_count": 21, "id": "d1ad9fe4-1330-46b6-9d73-d0203065753f", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "b05100cfca06481b95c73d6878515f0e", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading (incomplete total...): 0.00B [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "b848b264fa8444ae93fda94c2bfe7f65", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Fetching 15 files: 0%| | 0/15 [00:00\"),\n", " tokenizer._tok.token_to_id(\"<|END_OF_TURN_TOKEN|>\"),\n", "}\n", "stop_ids = {x for x in stop_ids if x is not None}\n", "\n", "\n", "def generate_text_basic_stream(\n", " model,\n", " token_ids,\n", " max_new_tokens,\n", " stop_token_ids=None,\n", " context_size=None,\n", "):\n", " stop_token_ids = set(stop_token_ids or [])\n", "\n", " model.eval()\n", " with torch.no_grad():\n", " cache = KVCache(n_layers=model.cfg[\"n_layers\"])\n", " model.reset_kv_cache()\n", "\n", " # Prime the cache with the initial context\n", " logits = model(token_ids, cache=cache)\n", "\n", " for _ in range(max_new_tokens):\n", " next_token = torch.argmax(logits[:, -1], dim=-1, keepdim=True)\n", "\n", " if stop_token_ids and next_token.item() in stop_token_ids:\n", " break\n", "\n", " yield next_token\n", "\n", " token_ids = torch.cat([token_ids, next_token], dim=1)\n", " # Feed only the new token to the model; cache handles history\n", " logits = model(next_token, cache=cache)" ] }, { "cell_type": "code", "execution_count": 24, "id": "1c7a04fa-6aac-416b-8f63-f1e19227633d", "metadata": { "id": "1c7a04fa-6aac-416b-8f63-f1e19227633d" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Large language models are advanced AI systems trained on vast amounts of text data to understand and generate human-like language. They use deep learning techniques, particularly transformer architectures, to process and predict text patterns, enabling tasks like translation, summarization, and conversational dialogue. These models have revolutionized natural language processing, powering applications from chatbots to content creation." ] } ], "source": [ "prompt = apply_chat_template(\"Give me a short introduction to large language models in 3 sentences.\")\n", "input_token_ids = tokenizer.encode(prompt)\n", "input_token_ids_tensor = torch.tensor(input_token_ids, device=device).unsqueeze(0)\n", "\n", "\n", "if torch.cuda.is_available():\n", " torch.cuda.reset_peak_memory_stats()\n", "\n", "\n", "for token in generate_text_basic_stream(\n", " model=model,\n", " token_ids=input_token_ids_tensor,\n", " max_new_tokens=500,\n", " stop_token_ids=stop_ids\n", "):\n", " token_id = token.squeeze(0).tolist()\n", " print(\n", " tokenizer.decode(token_id),\n", " end=\"\",\n", " flush=True\n", " )\n", "\n", "if torch.cuda.is_available():\n", " def calc_gpu_gb(x):\n", " return f\"{x / 1024 / 1024 / 1024:.2f} GB\"\n", " \n", " print(f\"\\n\\nGPU memory used: {calc_gpu_gb(torch.cuda.max_memory_allocated())}\")" ] }, { "cell_type": "markdown", "id": "549324d6-5c71-4147-ae21-2e67675faa3d", "metadata": { "id": "549324d6-5c71-4147-ae21-2e67675faa3d" }, "source": [ " \n", "# What's next?" ] }, { "cell_type": "markdown", "id": "e6edaaae-2de1-406c-8ffa-897cdfa3808c", "metadata": { "id": "e6edaaae-2de1-406c-8ffa-897cdfa3808c" }, "source": [ "- For those interested in a comprehensive guide on building a large language model from scratch and gaining a deeper understanding of its mechanics, you might like my [Build a Large Language Model (From Scratch)](http://mng.bz/orYv)\n", "\n", "" ] } ], "metadata": { "accelerator": "GPU", "colab": { "gpuType": "A100", "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.5" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "0dccd57dcc5c43a588157cef957c07e8": { "model_module": "@jupyter-widgets/controls", "model_module_version": "2.0.0", "model_name": "HTMLStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "2.0.0", "_model_name": "HTMLStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "2.0.0", "_view_name": "StyleView", "background": null, "description_width": "", "font_size": null, "text_color": null } }, "17a3174e65c54476b2e0d1faf8f011ca": { "model_module": "@jupyter-widgets/controls", "model_module_version": "2.0.0", "model_name": "HTMLModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "2.0.0", "_model_name": "HTMLModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "2.0.0", "_view_name": "HTMLView", "description": "", "description_allow_html": false, "layout": "IPY_MODEL_90a79523187446dfa692723b2e5833a7", "placeholder": "​", "style": "IPY_MODEL_431ffb83b8c14bf182f0430e07ea6154", "tabbable": null, "tooltip": null, "value": "model.safetensors:  35%" } }, "1bbf2e62c0754d1593beb4105a7f1ac1": { "model_module": "@jupyter-widgets/controls", "model_module_version": "2.0.0", "model_name": "FloatProgressModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "2.0.0", "_model_name": "FloatProgressModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "2.0.0", "_view_name": "ProgressView", "bar_style": "", "description": "", "description_allow_html": false, "layout": "IPY_MODEL_a8f1b72a33dd4b548de23fbd95e0da18", "max": 2471645608, "min": 0, "orientation": "horizontal", "style": "IPY_MODEL_25cc36132d384189acfbecc59483134b", "tabbable": null, "tooltip": null, "value": 880803840 } }, "25cc36132d384189acfbecc59483134b": { "model_module": "@jupyter-widgets/controls", "model_module_version": "2.0.0", "model_name": "ProgressStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "2.0.0", "_model_name": "ProgressStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "2.0.0", "_view_name": "StyleView", "bar_color": null, "description_width": "" } }, "271e2bd6a35e4a8b92de8697f7c0be5f": { "model_module": "@jupyter-widgets/base", "model_module_version": "2.0.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "2.0.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "2.0.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border_bottom": null, "border_left": null, "border_right": null, "border_top": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "3326b6141a1a4eba9f316df528a9b99a": { "model_module": "@jupyter-widgets/controls", "model_module_version": "2.0.0", "model_name": "HTMLStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "2.0.0", "_model_name": "HTMLStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "2.0.0", "_view_name": "StyleView", "background": null, "description_width": "", "font_size": null, "text_color": null } }, "33ca0cdf2c7f41598a381c4ebe6a4ee1": { "model_module": "@jupyter-widgets/base", "model_module_version": "2.0.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "2.0.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "2.0.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border_bottom": null, "border_left": null, "border_right": null, "border_top": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "431ffb83b8c14bf182f0430e07ea6154": { "model_module": "@jupyter-widgets/controls", "model_module_version": "2.0.0", "model_name": "HTMLStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "2.0.0", "_model_name": "HTMLStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "2.0.0", "_view_name": "StyleView", "background": null, "description_width": "", "font_size": null, "text_color": null } }, "5176834aa8784bba9ec21234b87a8948": { "model_module": "@jupyter-widgets/base", "model_module_version": "2.0.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "2.0.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "2.0.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border_bottom": null, "border_left": null, "border_right": null, "border_top": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "518fb202e4b44aaba47f07d1a61b6762": { "model_module": "@jupyter-widgets/controls", "model_module_version": "2.0.0", "model_name": "HTMLModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "2.0.0", "_model_name": "HTMLModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "2.0.0", "_view_name": "HTMLView", "description": "", "description_allow_html": false, "layout": "IPY_MODEL_e2dc407afcd945c798e30597fddfcb3c", "placeholder": "​", "style": "IPY_MODEL_0dccd57dcc5c43a588157cef957c07e8", "tabbable": null, "tooltip": null, "value": "tokenizer.model: 100%" } }, "672cdc5aea954de3af851c001a667ad3": { "model_module": "@jupyter-widgets/controls", "model_module_version": "2.0.0", "model_name": "FloatProgressModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "2.0.0", "_model_name": "FloatProgressModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "2.0.0", "_view_name": "ProgressView", "bar_style": "success", "description": "", "description_allow_html": false, "layout": "IPY_MODEL_33ca0cdf2c7f41598a381c4ebe6a4ee1", "max": 2183982, "min": 0, "orientation": "horizontal", "style": "IPY_MODEL_ee44487f58454dacb522b1e084ffb733", "tabbable": null, "tooltip": null, "value": 2183982 } }, "90a79523187446dfa692723b2e5833a7": { "model_module": "@jupyter-widgets/base", "model_module_version": "2.0.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "2.0.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "2.0.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border_bottom": null, "border_left": null, "border_right": null, "border_top": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "9881b6995c3f49dc89e6992fd9ab660b": { "model_module": "@jupyter-widgets/controls", "model_module_version": "2.0.0", "model_name": "HBoxModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "2.0.0", "_model_name": "HBoxModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "2.0.0", "_view_name": "HBoxView", "box_style": "", "children": [ "IPY_MODEL_17a3174e65c54476b2e0d1faf8f011ca", "IPY_MODEL_1bbf2e62c0754d1593beb4105a7f1ac1", "IPY_MODEL_b82112e1dec645d98aa1c1ba64abcb61" ], "layout": "IPY_MODEL_271e2bd6a35e4a8b92de8697f7c0be5f", "tabbable": null, "tooltip": null } }, "a1608feac06d4687967a3e398f01c489": { "model_module": "@jupyter-widgets/controls", "model_module_version": "2.0.0", "model_name": "HBoxModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "2.0.0", "_model_name": "HBoxModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "2.0.0", "_view_name": "HBoxView", "box_style": "", "children": [ "IPY_MODEL_518fb202e4b44aaba47f07d1a61b6762", "IPY_MODEL_672cdc5aea954de3af851c001a667ad3", "IPY_MODEL_eebf8874618746b39cf4a21a2728dc7f" ], "layout": "IPY_MODEL_5176834aa8784bba9ec21234b87a8948", "tabbable": null, "tooltip": null } }, "a8f1b72a33dd4b548de23fbd95e0da18": { "model_module": "@jupyter-widgets/base", "model_module_version": "2.0.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "2.0.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "2.0.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border_bottom": null, "border_left": null, "border_right": null, "border_top": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "b82112e1dec645d98aa1c1ba64abcb61": { "model_module": "@jupyter-widgets/controls", "model_module_version": "2.0.0", "model_name": "HTMLModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "2.0.0", "_model_name": "HTMLModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "2.0.0", "_view_name": "HTMLView", "description": "", "description_allow_html": false, "layout": "IPY_MODEL_bfd06423ad544218968648016e731a46", "placeholder": "​", "style": "IPY_MODEL_d029630b63ff44cf807ade428d2eb421", "tabbable": null, "tooltip": null, "value": " 870M/2.47G [00:20<00:37, 42.8MB/s]" } }, "bfd06423ad544218968648016e731a46": { "model_module": "@jupyter-widgets/base", "model_module_version": "2.0.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "2.0.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "2.0.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border_bottom": null, "border_left": null, "border_right": null, "border_top": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "d029630b63ff44cf807ade428d2eb421": { "model_module": "@jupyter-widgets/controls", "model_module_version": "2.0.0", "model_name": "HTMLStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "2.0.0", "_model_name": "HTMLStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "2.0.0", "_view_name": "StyleView", "background": null, "description_width": "", "font_size": null, "text_color": null } }, "d2c41e71a3f441deaed091b620ac5603": { "model_module": "@jupyter-widgets/base", "model_module_version": "2.0.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "2.0.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "2.0.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border_bottom": null, "border_left": null, "border_right": null, "border_top": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "e2dc407afcd945c798e30597fddfcb3c": { "model_module": "@jupyter-widgets/base", "model_module_version": "2.0.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "2.0.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "2.0.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border_bottom": null, "border_left": null, "border_right": null, "border_top": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "ee44487f58454dacb522b1e084ffb733": { "model_module": "@jupyter-widgets/controls", "model_module_version": "2.0.0", "model_name": "ProgressStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "2.0.0", "_model_name": "ProgressStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "2.0.0", "_view_name": "StyleView", "bar_color": null, "description_width": "" } }, "eebf8874618746b39cf4a21a2728dc7f": { "model_module": "@jupyter-widgets/controls", "model_module_version": "2.0.0", "model_name": "HTMLModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "2.0.0", "_model_name": "HTMLModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "2.0.0", "_view_name": "HTMLView", "description": "", "description_allow_html": false, "layout": "IPY_MODEL_d2c41e71a3f441deaed091b620ac5603", "placeholder": "​", "style": "IPY_MODEL_3326b6141a1a4eba9f316df528a9b99a", "tabbable": null, "tooltip": null, "value": " 2.18M/2.18M [00:00<00:00, 9.47MB/s]" } } } } }, "nbformat": 4, "nbformat_minor": 5 }