Add Tiny Aya from scratch (#962)

This commit is contained in:
Sebastian Raschka
2026-02-19 17:33:22 -05:00
committed by GitHub
parent 1ed48c2450
commit 62f0356e0d
8 changed files with 4500 additions and 2 deletions

View File

@@ -187,6 +187,7 @@ Several folders contain optional materials as a bonus for interested readers:
- [Qwen3 Dense and Mixture-of-Experts (MoE) From Scratch](ch05/11_qwen3/) - [Qwen3 Dense and Mixture-of-Experts (MoE) From Scratch](ch05/11_qwen3/)
- [Gemma 3 From Scratch](ch05/12_gemma3/) - [Gemma 3 From Scratch](ch05/12_gemma3/)
- [Olmo 3 From Scratch](ch05/13_olmo3/) - [Olmo 3 From Scratch](ch05/13_olmo3/)
- [Tiny Aya From Scratch](ch05/15_tiny-aya/)
- [Chapter 5 with other LLMs as Drop-In Replacement (e.g., Llama 3, Qwen 3)](ch05/14_ch05_with_other_llms/) - [Chapter 5 with other LLMs as Drop-In Replacement (e.g., Llama 3, Qwen 3)](ch05/14_ch05_with_other_llms/)
- **Chapter 6: Finetuning for classification** - **Chapter 6: Finetuning for classification**
- [Additional Experiments Finetuning Different Layers and Using Larger Models](ch06/02_bonus_additional-experiments) - [Additional Experiments Finetuning Different Layers and Using Larger Models](ch06/02_bonus_additional-experiments)

View File

@@ -0,0 +1,55 @@
# Tiny Aya 3.35B From Scratch
Tiny Aya is a new, "small" LLM by Cohere that is said to be the "most capable multi-lingual open-weight model" at the 3B parameter size class. (Tiny Aya outperforms Qwen3-4B, Gemma 3 4B, and Ministral 3 3B according to the [announcement post](https://cohere.com/blog/cohere-labs-tiny-aya)).
This is a great model to run and experiment with locally. The only caveat is that while it's an open-weight model, its licensing terms are relatively restricted and only allow non-commercial use.
That aside, Arya is a 3.35B parameter model that comes in several flavors that are useful for
personal and (non-commercial) research use:
- [tiny-aya-base](https://huggingface.co/CohereLabs/tiny-aya-base) (base model)
- [tiny-aya-global](https://huggingface.co/CohereLabs/tiny-aya-global) (best balance across languages and regions; notebook default)
- [tiny-aya-fire](https://huggingface.co/CohereLabs/tiny-aya-fire) (optimized for South Asian languages)
- [tiny-aya-water](https://huggingface.co/CohereLabs/tiny-aya-water) (optimized for European and Asia Pacific languages)
- [tiny-aya-earth](https://huggingface.co/CohereLabs/tiny-aya-earth) (optimized for West Asian and African languages)
More specifically, here's a list of languages the models are optimized for:
| Region | Languages | Optimized Model |
| ---------------- | ------------------------------------------------------------ | --------------- |
| **Asia Pacific** | Traditional Chinese, Cantonese, Vietnamese, Tagalog, Javanese, Khmer, Thai, Burmese, Malay, Korean, Lao, Indonesian, Simplified Chinese, Japanese | tiny-aya-water |
| **Africa** | Zulu, Amharic, Hausa, Igbo, Swahili, Xhosa, Wolof, Shona, Yoruba, Nigerian Pidgin, Malagasy | tiny-aya-earth |
| **South Asia** | Telugu, Marathi, Bengali, Tamil, Hindi, Punjabi, Gujarati, Urdu, Nepali | tiny-aya-fire |
| **Europe** | Catalan, Galician, Dutch, Danish, Finnish, Czech, Portuguese, French, Lithuanian, Slovak, Basque, English, Swedish, Polish, Spanish, Slovenian, Ukrainian, Greek, Bokmål, Romanian, Serbian, German, Italian, Russian, Irish, Hungarian, Bulgarian, Croatian, Estonian, Latvian, Welsh | tiny-aya-water |
| **West Asia** | Arabic, Maltese, Turkish, Hebrew, Persian | tiny-aya-earth |
Architecture-wise, Tiny Aya is a classic decoder-style transformer with a few noteworthy modifications (besides the obvious ones like SwiGLU and Grouped Query Attention):
1. **Parallel transformer blocks.** A parallel transformer block computes attention and MLP from the same normalized input, then adds both to the residual in one step. I assume this is to reduce serial dependencies inside a layer to improve computational throughput.
2. **Sliding window attention.** Specifically, it uses a 3:1 local:global ratio similar to Arcee Trinity and Olmo 3. The window size is also 4096. Also, similar to Arcee, the sliding window layers use RoPE whereas the full attention layers use NoPE.
3. **LayerNorm.** Most architectures moved to RMSNorm as it's computationally a bit cheaper and performs well. Tiny Aya is keeping it more classic with a modified version of LayerNorm (the implementation here is like standard LayerNorm but without shift, i.e., bias, parameter).
 
## Files
The [standalone-tiny-aya.ipynb](standalone-tiny-aya.ipynb) is a standalone Jupyter notebook that implements the Tiny Aya architecture and loads the pre-trained weights.
The alternative [standalone-tiny-aya-plus-kvcache.ipynb](standalone-tiny-aya-plus-kv-cache.ipynb) notebook adds a KV cache for better runtime performance (but adds more code complexity). To learn more about KV caching, see my [Understanding and Coding the KV Cache in LLMs from Scratch](https://magazine.sebastianraschka.com/p/coding-the-kv-cache-in-llms) article.
<br>
To learn more about the architecture differences and read about comparisons with other architectures, see my [The Big LLM Architecture Comparison: From DeepSeek-V3 to Kimi K2: A Look At Modern LLM Architecture Design](https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison) article.

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,118 @@
# Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt).
# Source for "Build a Large Language Model From Scratch"
# - https://www.manning.com/books/build-a-large-language-model-from-scratch
# Code: https://github.com/rasbt/LLMs-from-scratch
import importlib
from pathlib import Path
import pytest
import torch
from llms_from_scratch.utils import import_definitions_from_notebook
transformers_installed = importlib.util.find_spec("transformers") is not None
@pytest.fixture
def import_notebook_defs():
nb_dir = Path(__file__).resolve().parents[1]
mod = import_definitions_from_notebook(nb_dir, "standalone-tiny-aya-plus-kv-cache.ipynb")
return mod
@pytest.fixture
def dummy_input():
torch.manual_seed(123)
return torch.randint(0, 100, (1, 8)) # batch size 1, seq length 8
@pytest.fixture
def dummy_cfg_base():
return {
"vocab_size": 100,
"context_length": 64,
"emb_dim": 32,
"n_heads": 4,
"n_layers": 2,
"hidden_dim": 64,
"head_dim": 8,
"n_kv_heads": 1,
"attention_bias": False,
"attention_dropout": 0.0,
"sliding_window": 4,
"layer_types": ["sliding_attention", "full_attention"],
"rope_base": 10_000.0,
"layer_norm_eps": 1e-5,
"logit_scale": 1.0,
"tie_word_embeddings": False,
"dtype": torch.float32,
}
@torch.inference_mode()
def test_dummy_tiny_aya_forward(dummy_cfg_base, dummy_input, import_notebook_defs):
torch.manual_seed(123)
model = import_notebook_defs.TinyAyaModel(dummy_cfg_base)
out = model(dummy_input)
assert out.shape == (1, dummy_input.size(1), dummy_cfg_base["vocab_size"]), \
f"Expected shape (1, seq_len, vocab_size), got {out.shape}"
@torch.inference_mode()
@pytest.mark.skipif(not transformers_installed, reason="transformers not installed")
def test_tiny_aya_base_equivalence_with_transformers(import_notebook_defs):
from transformers import Cohere2Config, Cohere2ForCausalLM
# Tiny config so the test is fast
cfg = {
"vocab_size": 257,
"context_length": 8,
"emb_dim": 32,
"n_heads": 4,
"n_layers": 2,
"hidden_dim": 64,
"head_dim": 8,
"n_kv_heads": 2,
"sliding_window": 4,
"layer_types": ["sliding_attention", "full_attention"],
"dtype": torch.float32,
"attention_bias": False,
"attention_dropout": 0.0,
"layer_norm_eps": 1e-5,
"rope_base": 10_000.0,
"logit_scale": 1.0,
"tie_word_embeddings": False,
}
model = import_notebook_defs.TinyAyaModel(cfg)
hf_cfg = Cohere2Config(
vocab_size=cfg["vocab_size"],
max_position_embeddings=cfg["context_length"],
hidden_size=cfg["emb_dim"],
num_attention_heads=cfg["n_heads"],
num_hidden_layers=cfg["n_layers"],
intermediate_size=cfg["hidden_dim"],
num_key_value_heads=cfg["n_kv_heads"],
attention_bias=cfg["attention_bias"],
attention_dropout=cfg["attention_dropout"],
layer_norm_eps=cfg["layer_norm_eps"],
layer_types=cfg["layer_types"],
sliding_window=cfg["sliding_window"],
logit_scale=cfg["logit_scale"],
tie_word_embeddings=cfg["tie_word_embeddings"],
rope_parameters={"rope_type": "default", "rope_theta": cfg["rope_base"]},
attn_implementation="eager",
torch_dtype=torch.float32,
)
hf_model = Cohere2ForCausalLM(hf_cfg)
hf_state = hf_model.state_dict()
import_notebook_defs.load_weights_into_tiny_aya(model, cfg, hf_state)
x = torch.randint(0, cfg["vocab_size"], (2, cfg["context_length"]), dtype=torch.long)
ours_logits = model(x)
theirs_logits = hf_model(x).logits
torch.testing.assert_close(ours_logits, theirs_logits, rtol=1e-5, atol=1e-5)

View File

@@ -0,0 +1,117 @@
# Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt).
# Source for "Build a Large Language Model From Scratch"
# - https://www.manning.com/books/build-a-large-language-model-from-scratch
# Code: https://github.com/rasbt/LLMs-from-scratch
import importlib
from pathlib import Path
import pytest
import torch
from llms_from_scratch.utils import import_definitions_from_notebook
transformers_installed = importlib.util.find_spec("transformers") is not None
@pytest.fixture
def import_notebook_defs():
nb_dir = Path(__file__).resolve().parents[1]
mod = import_definitions_from_notebook(nb_dir, "standalone-tiny-aya.ipynb")
return mod
@pytest.fixture
def dummy_input():
torch.manual_seed(123)
return torch.randint(0, 100, (1, 8)) # batch size 1, seq length 8
@pytest.fixture
def dummy_cfg_base():
return {
"vocab_size": 100,
"context_length": 64,
"emb_dim": 32,
"n_heads": 4,
"n_layers": 2,
"hidden_dim": 64,
"head_dim": 8,
"n_kv_heads": 1,
"attention_bias": False,
"attention_dropout": 0.0,
"sliding_window": 4,
"layer_types": ["sliding_attention", "full_attention"],
"rope_base": 10_000.0,
"layer_norm_eps": 1e-5,
"logit_scale": 1.0,
"tie_word_embeddings": False,
"dtype": torch.float32,
}
@torch.inference_mode()
def test_dummy_tiny_aya_forward(dummy_cfg_base, dummy_input, import_notebook_defs):
torch.manual_seed(123)
model = import_notebook_defs.TinyAyaModel(dummy_cfg_base)
out = model(dummy_input)
assert out.shape == (1, dummy_input.size(1), dummy_cfg_base["vocab_size"]), \
f"Expected shape (1, seq_len, vocab_size), got {out.shape}"
@torch.inference_mode()
@pytest.mark.skipif(not transformers_installed, reason="transformers not installed")
def test_tiny_aya_base_equivalence_with_transformers(import_notebook_defs):
from transformers import Cohere2Config, Cohere2ForCausalLM
# Tiny config so the test is fast
cfg = {
"vocab_size": 257,
"context_length": 8,
"emb_dim": 32,
"n_heads": 4,
"n_layers": 2,
"hidden_dim": 64,
"head_dim": 8,
"n_kv_heads": 2,
"sliding_window": 4,
"layer_types": ["sliding_attention", "full_attention"],
"dtype": torch.float32,
"attention_bias": False,
"attention_dropout": 0.0,
"layer_norm_eps": 1e-5,
"rope_base": 10_000.0,
"logit_scale": 1.0,
"tie_word_embeddings": False,
}
model = import_notebook_defs.TinyAyaModel(cfg)
hf_cfg = Cohere2Config(
vocab_size=cfg["vocab_size"],
max_position_embeddings=cfg["context_length"],
hidden_size=cfg["emb_dim"],
num_attention_heads=cfg["n_heads"],
num_hidden_layers=cfg["n_layers"],
intermediate_size=cfg["hidden_dim"],
num_key_value_heads=cfg["n_kv_heads"],
attention_bias=cfg["attention_bias"],
attention_dropout=cfg["attention_dropout"],
layer_norm_eps=cfg["layer_norm_eps"],
layer_types=cfg["layer_types"],
sliding_window=cfg["sliding_window"],
logit_scale=cfg["logit_scale"],
tie_word_embeddings=cfg["tie_word_embeddings"],
rope_parameters={"rope_type": "default", "rope_theta": cfg["rope_base"]},
attn_implementation="eager",
torch_dtype=torch.float32,
)
hf_model = Cohere2ForCausalLM(hf_cfg)
hf_state = hf_model.state_dict()
import_notebook_defs.load_weights_into_tiny_aya(model, cfg, hf_state)
x = torch.randint(0, cfg["vocab_size"], (2, cfg["context_length"]), dtype=torch.long)
ours_logits = model(x)
theirs_logits = hf_model(x).logits
torch.testing.assert_close(ours_logits, theirs_logits, rtol=1e-5, atol=1e-5)

View File

@@ -0,0 +1,225 @@
# Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt).
# Source for "Build a Large Language Model From Scratch"
# - https://www.manning.com/books/build-a-large-language-model-from-scratch
# Code: https://github.com/rasbt/LLMs-from-scratch
import importlib
from pathlib import Path
import torch
from llms_from_scratch.utils import import_definitions_from_notebook
try:
from transformers import Cohere2Config, Cohere2ForCausalLM
except ImportError:
Cohere2Config = None
Cohere2ForCausalLM = None
def tiny_debug_config():
return {
"vocab_size": 257,
"context_length": 8,
"emb_dim": 32,
"n_heads": 4,
"n_layers": 2,
"hidden_dim": 64,
"head_dim": 8,
"n_kv_heads": 2,
"sliding_window": 4,
"layer_types": ["sliding_attention", "full_attention"],
"dtype": torch.float32,
"attention_bias": False,
"attention_dropout": 0.0,
"layer_norm_eps": 1e-5,
"rope_base": 10_000.0,
"logit_scale": 1.0,
"tie_word_embeddings": False,
}
def _hf_config_from_dict(cfg):
if Cohere2Config is None:
raise ImportError("transformers is required for the Tiny Aya debugger.")
return Cohere2Config(
vocab_size=cfg["vocab_size"],
max_position_embeddings=cfg["context_length"],
hidden_size=cfg["emb_dim"],
num_attention_heads=cfg["n_heads"],
num_hidden_layers=cfg["n_layers"],
intermediate_size=cfg["hidden_dim"],
num_key_value_heads=cfg["n_kv_heads"],
attention_bias=cfg["attention_bias"],
attention_dropout=cfg["attention_dropout"],
layer_norm_eps=cfg["layer_norm_eps"],
sliding_window=cfg["sliding_window"],
layer_types=cfg["layer_types"],
logit_scale=cfg["logit_scale"],
tie_word_embeddings=cfg.get("tie_word_embeddings", False),
rope_parameters={"rope_type": "default", "rope_theta": cfg["rope_base"]},
torch_dtype=cfg.get("dtype", torch.float32),
)
def load_notebook_defs(nb_name="standalone-tiny-aya.ipynb"):
nb_dir = Path(__file__).resolve().parents[1]
return import_definitions_from_notebook(nb_dir, nb_name)
def build_tiny_aya_pair(import_notebook_defs, cfg, hf_checkpoint=None):
if Cohere2ForCausalLM is None:
raise ImportError("transformers is required for the Tiny Aya debugger.")
ours = import_notebook_defs.TinyAyaModel(cfg)
hf_cfg = _hf_config_from_dict(cfg)
if hf_checkpoint:
hf_model = Cohere2ForCausalLM.from_pretrained(
hf_checkpoint,
torch_dtype=cfg.get("dtype", torch.float32),
attn_implementation="eager",
)
else:
hf_model = Cohere2ForCausalLM(hf_cfg)
import_notebook_defs.load_weights_into_tiny_aya(ours, cfg, hf_model.state_dict())
ours.eval()
hf_model.eval()
return ours, hf_model
def _attach_debug_hooks(model, is_hf):
traces = {}
handles = []
def hook(name):
def _record(_, __, output):
traces[name] = output.detach().to(torch.float32).cpu()
return _record
if is_hf:
core = model.model
handles.append(core.embed_tokens.register_forward_hook(hook("embedding")))
for idx, layer in enumerate(core.layers):
handles.append(layer.register_forward_hook(hook(f"block_{idx}")))
handles.append(core.norm.register_forward_hook(hook("final_norm")))
handles.append(model.lm_head.register_forward_hook(hook("logits")))
else:
handles.append(model.tok_emb.register_forward_hook(hook("embedding")))
blocks = getattr(model, "trf_blocks", None)
if blocks is None:
blocks = getattr(model, "blocks", None)
if blocks is None:
raise AttributeError("Could not locate Tiny Aya blocks on the local model.")
for idx, block in enumerate(blocks):
handles.append(block.register_forward_hook(hook(f"block_{idx}")))
handles.append(model.final_norm.register_forward_hook(hook("final_norm")))
handles.append(model.out_head.register_forward_hook(hook("logits")))
return traces, handles
def _layer_sort_key(name):
if name == "embedding":
return (0, 0)
if name.startswith("block_"):
idx = int(name.split("_")[1])
return (1, idx)
if name == "final_norm":
return (2, 0)
if name == "logits":
return (3, 0)
return (4, name)
def layerwise_differences(ours, hf_model, input_ids, rtol=1e-5, atol=1e-5):
ours_traces, ours_handles = _attach_debug_hooks(ours, is_hf=False)
hf_traces, hf_handles = _attach_debug_hooks(hf_model, is_hf=True)
try:
with torch.inference_mode():
ours(input_ids)
hf_model(input_ids)
finally:
for h in ours_handles + hf_handles:
h.remove()
layer_names = sorted(set(ours_traces) | set(hf_traces), key=_layer_sort_key)
results = []
for name in layer_names:
ours_tensor = ours_traces.get(name)
hf_tensor = hf_traces.get(name)
if ours_tensor is None or hf_tensor is None:
results.append(
{
"name": name,
"status": "missing",
"ours_shape": None if ours_tensor is None else tuple(ours_tensor.shape),
"hf_shape": None if hf_tensor is None else tuple(hf_tensor.shape),
"max_diff": None,
"mean_abs_diff": None,
}
)
continue
if ours_tensor.shape != hf_tensor.shape:
results.append(
{
"name": name,
"status": "shape_mismatch",
"ours_shape": tuple(ours_tensor.shape),
"hf_shape": tuple(hf_tensor.shape),
"max_diff": None,
"mean_abs_diff": None,
}
)
continue
diff = (ours_tensor - hf_tensor).abs()
max_diff = float(diff.max().item())
mean_diff = float(diff.mean().item())
allclose = torch.allclose(ours_tensor, hf_tensor, rtol=rtol, atol=atol)
results.append(
{
"name": name,
"status": "ok" if allclose else "mismatch",
"ours_shape": tuple(ours_tensor.shape),
"hf_shape": tuple(hf_tensor.shape),
"max_diff": max_diff,
"mean_abs_diff": mean_diff,
}
)
return results
def format_report(differences):
lines = []
for diff in sorted(differences, key=lambda d: _layer_sort_key(d["name"])):
if diff["status"] == "ok":
lines.append(f"[OK] {diff['name']}: max={diff['max_diff']:.2e}, mean={diff['mean_abs_diff']:.2e}")
elif diff["status"] == "mismatch":
lines.append(f"[DIFF] {diff['name']}: max={diff['max_diff']:.2e}, mean={diff['mean_abs_diff']:.2e}")
elif diff["status"] == "shape_mismatch":
lines.append(f"[SHAPE] {diff['name']}: ours={diff['ours_shape']}, hf={diff['hf_shape']}")
else:
lines.append(f"[MISSING] {diff['name']}: ours={diff['ours_shape']}, hf={diff['hf_shape']}")
return "\n".join(lines)
if __name__ == "__main__":
transformers_available = importlib.util.find_spec("transformers") is not None
if not transformers_available:
raise SystemExit("transformers is not installed; install it to run the debugger.")
import_notebook_defs = load_notebook_defs()
cfg = tiny_debug_config()
ours_model, hf_model = build_tiny_aya_pair(import_notebook_defs, cfg)
torch.manual_seed(0)
input_ids = torch.randint(0, cfg["vocab_size"], (1, cfg["context_length"]), dtype=torch.long)
diffs = layerwise_differences(ours_model, hf_model, input_ids)
print(format_report(diffs))

View File

@@ -37,6 +37,23 @@ def _extract_imports(src: str):
def _extract_defs_and_classes_from_code(src): def _extract_defs_and_classes_from_code(src):
def _is_header_complete(header_lines):
header = "\n".join(header_lines).rstrip()
if not header.endswith(":"):
return False
# Track bracket balance for multiline signatures
# like:
# def fn(
# arg,
# ):
balance = (
header.count("(") - header.count(")")
+ header.count("[") - header.count("]")
+ header.count("{") - header.count("}")
)
return balance <= 0
lines = src.splitlines() lines = src.splitlines()
kept = [] kept = []
i = 0 i = 0
@@ -47,14 +64,22 @@ def _extract_defs_and_classes_from_code(src):
j = i + 1 j = i + 1
while j < len(lines) and not lines[j].strip(): while j < len(lines) and not lines[j].strip():
j += 1 j += 1
if j < len(lines) and lines[j].lstrip().startswith(("def ", "class ")): if j < len(lines) and lines[j].lstrip().startswith(("def ", "class ", "async def ")):
kept.append(line) kept.append(line)
i += 1 i += 1
continue continue
if stripped.startswith("def ") or stripped.startswith("class "): if stripped.startswith(("def ", "class ", "async def ")):
kept.append(line) kept.append(line)
base_indent = len(line) - len(stripped) base_indent = len(line) - len(stripped)
i += 1 i += 1
# Handle multiline signatures before consuming the function/class body.
header_lines = [line]
while i < len(lines) and not _is_header_complete(header_lines):
header_lines.append(lines[i])
kept.append(lines[i])
i += 1
while i < len(lines): while i < len(lines):
nxt = lines[i] nxt = lines[i]
if nxt.strip() == "": if nxt.strip() == "":