minor fixes (#246)

* removed duplicated white spaces

* Update ch07/01_main-chapter-code/ch07.ipynb

* Update ch07/05_dataset-generation/llama3-ollama.ipynb

* removed duplicated white spaces

* fixed title again

---------

Co-authored-by: Sebastian Raschka <mail@sebastianraschka.com>
This commit is contained in:
Daniel Kleine
2024-06-26 00:30:30 +02:00
committed by GitHub
parent 9a9b3530c9
commit 81c843bdc0
10 changed files with 19 additions and 19 deletions

View File

@@ -710,7 +710,7 @@
"- `[UNK]` to represent works that are not included in the vocabulary\n",
"\n",
"- Note that GPT-2 does not need any of these tokens mentioned above but only uses an `<|endoftext|>` token to reduce complexity\n",
"- The `<|endoftext|>` is analogous to the `[EOS]` token mentioned above\n",
"- The `<|endoftext|>` is analogous to the `[EOS]` token mentioned above\n",
"- GPT also uses the `<|endoftext|>` for padding (since we typically use a mask when training on batched inputs, we would not attend padded tokens anyways, so it does not matter what these tokens are)\n",
"- GPT-2 does not use an `<UNK>` token for out-of-vocabulary words; instead, GPT-2 uses a byte-pair encoding (BPE) tokenizer, which breaks down words into subword units which we will discuss in a later section\n",
"\n"