Files
LLMs-from-scratch/ch02/05_bpe-from-scratch
Sebastian Raschka 14c7afaa58 Fix GitHub CI timeout issue for link checker (#937)
* Fix GitHub CI timeout issue for link checker

* update problematic links
2026-01-02 14:34:31 -06:00
..

Byte Pair Encoding (BPE) Tokenizer From Scratch

  • bpe-from-scratch-simple.ipynb contains optional (bonus) code that explains and shows how the BPE tokenizer works under the hood; this is geared for simplicity and readability.

  • bpe-from-scratch.ipynb implements a more sophisticated (and much more complicated) BPE tokenizer that behaves similarly as tiktoken with respect to all the edge cases; it also has additional funcitionality for loading the official GPT-2 vocab.