Files
LLMs-from-scratch/ch02
Maxwell De Jong e0dbec3331 Fix encoding of multiple preceding spaces in BPE tokenizer. (#945)
* Fix encoding of multiple preceding spaces in BPE tokenizer.

* Add test

---------

Co-authored-by: rasbt <mail@sebastianraschka.com>
2026-01-10 10:27:23 -06:00
..
2025-06-13 08:16:18 -05:00

Chapter 2: Working with Text Data

 

Main Chapter Code

 

Bonus Materials

  • 02_bonus_bytepair-encoder contains optional code to benchmark different byte pair encoder implementations

  • 03_bonus_embedding-vs-matmul contains optional (bonus) code to explain that embedding layers and fully connected layers applied to one-hot encoded vectors are equivalent.

  • 04_bonus_dataloader-intuition contains optional (bonus) code to explain the data loader more intuitively with simple numbers rather than text.

  • 05_bpe-from-scratch contains (bonus) code that implements and trains a GPT-2 BPE tokenizer from scratch.

In the video below, I provide a code-along session that covers some of the chapter contents as supplementary material.



Link to the video