add more notes and embed figures externally to save space

2026-04-10 12:33:42 +00:00 · 2024-03-17 09:08:38 -05:00
parent b655e628a2
commit d60da19fd0
51 changed files with 357 additions and 78 deletions
--- a/ch02/01_main-chapter-code/ch02.ipynb
+++ b/ch02/01_main-chapter-code/ch02.ipynb
@@ -41,12 +41,20 @@
    "print(\"tiktoken version:\", version(\"tiktoken\"))"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "5a42fbfd-e3c2-43c2-bc12-f5f870a0b10a",
+   "metadata": {},
+   "source": [
+    "- This chapter covers data preparation and sampling to get input data \"ready\" for the LLM"
+   ]
+  },
  {
   "cell_type": "markdown",
   "id": "628b2922-594d-4ff9-bd82-04f1ebdf41f5",
   "metadata": {},
   "source": [
-    "<img src=\"figures/1.webp\" width=\"500px\">"
+    "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/01.webp\" width=\"500px\">"
   ]
  },
  {
@@ -57,14 +65,6 @@
    "## 2.1 Understanding word embeddings"
   ]
  },
-  {
-   "cell_type": "markdown",
-   "id": "ba08d16f-f237-4166-bf89-0e9fe703e7b4",
-   "metadata": {},
-   "source": [
-    "<img src=\"figures/2.webp\" width=\"500px\">"
-   ]
-  },
  {
   "cell_type": "markdown",
   "id": "0b6816ae-e927-43a9-b4dd-e47a9b0e1cf6",
@@ -73,12 +73,37 @@
    "- No code in this section"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "4f69dab7-a433-427a-9e5b-b981062d6296",
+   "metadata": {},
+   "source": [
+    "- There are any forms of embeddings; we focus on text embeddings in this book"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ba08d16f-f237-4166-bf89-0e9fe703e7b4",
+   "metadata": {},
+   "source": [
+    "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/02.webp\" width=\"500px\">"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "288c4faf-b93a-4616-9276-7a4aa4b5e9ba",
+   "metadata": {},
+   "source": [
+    "- LLMs work embeddings in high-dimensional spaces (i.e., thousands of dimensions)\n",
+    "- Since we can't visualize such high-dimensional spaces (we humans think in 1, 2, or 3 dimensions), the figure below illustrates a 2-dimensipnal embedding space"
+   ]
+  },
  {
   "cell_type": "markdown",
   "id": "d6b80160-1f10-4aad-a85e-9c79444de9e6",
   "metadata": {},
   "source": [
-    "<img src=\"figures/3.webp\" width=\"300px\">"
+    "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/03.webp\" width=\"300px\">"
   ]
  },
  {
@@ -89,12 +114,20 @@
    "## 2.2 Tokenizing text"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "f9c90731-7dc9-4cd3-8c4a-488e33b48e80",
+   "metadata": {},
+   "source": [
+    "- In this section, we tokenize text, which means breaking text into smaller units, such as individual words and punctuation characters"
+   ]
+  },
  {
   "cell_type": "markdown",
   "id": "09872fdb-9d4e-40c4-949d-52a01a43ec4b",
   "metadata": {},
   "source": [
-    "<img src=\"figures/4.webp\" width=\"300px\">"
+    "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/04.webp\" width=\"300px\">"
   ]
  },
  {
@@ -261,7 +294,7 @@
   "id": "6cbe9330-b587-4262-be9f-497a84ec0e8a",
   "metadata": {},
   "source": [
-    "<img src=\"figures/5.webp\" width=\"350px\">"
+    "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/05.webp\" width=\"350px\">"
   ]
  },
  {
@@ -318,12 +351,20 @@
    "## 2.3 Converting tokens into token IDs"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "a5204973-f414-4c0d-87b0-cfec1f06e6ff",
+   "metadata": {},
+   "source": [
+    "- Next, we convert the text tokens into token IDs that we can process via embedding layers later"
+   ]
+  },
  {
   "cell_type": "markdown",
   "id": "177b041d-f739-43b8-bd81-0443ae3a7f8d",
   "metadata": {},
   "source": [
-    "<img src=\"figures/6.webp\" width=\"500px\">"
+    "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/06.webp\" width=\"500px\">"
   ]
  },
  {
@@ -444,12 +485,20 @@
    "        break"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "3b1dc314-351b-476a-9459-0ec9ddc29b19",
+   "metadata": {},
+   "source": [
+    "- Below, we illustrate the tokenization of a short sample text using a small vocabulary:"
+   ]
+  },
  {
   "cell_type": "markdown",
   "id": "67407a9f-0202-4e7c-9ed7-1b3154191ebc",
   "metadata": {},
   "source": [
-    "<img src=\"figures/7.webp\" width=\"500px\">"
+    "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/07.webp\" width=\"500px\">"
   ]
  },
  {
@@ -485,12 +534,21 @@
    "        return text"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "dee7a1e5-b54f-4ca1-87ef-3d663c4ee1e7",
+   "metadata": {},
+   "source": [
+    "- The `encode` function turns text into token IDs\n",
+    "- The `decode` function turns token IDs back into text"
+   ]
+  },
  {
   "cell_type": "markdown",
   "id": "cc21d347-ec03-4823-b3d4-9d686e495617",
   "metadata": {},
   "source": [
-    "<img src=\"figures/8.webp\" width=\"500px\">"
+    "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/08.webp\" width=\"500px\">"
   ]
  },
  {
@@ -582,12 +640,20 @@
    "## 2.4 Adding special context tokens"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "863d6d15-a3e2-44e0-b384-bb37f17cf443",
+   "metadata": {},
+   "source": [
+    "- It's useful to add some \"special\" tokens for unknown words and to denote the end of a text"
+   ]
+  },
  {
   "cell_type": "markdown",
   "id": "aa7fc96c-e1fd-44fb-b7f5-229d7c7922a4",
   "metadata": {},
   "source": [
-    "<img src=\"figures/9.webp\" width=\"500px\">"
+    "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/09.webp\" width=\"500px\">"
   ]
  },
  {
@@ -609,12 +675,20 @@
    "\n"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "a336b43b-7173-49e7-bd80-527ad4efb271",
+   "metadata": {},
+   "source": [
+    "- We use the `<|endoftext|>` tokens between two independent sources of text:"
+   ]
+  },
  {
   "cell_type": "markdown",
   "id": "52442951-752c-4855-9752-b121a17fef55",
   "metadata": {},
   "source": [
-    "<img src=\"figures/10.webp\" width=\"500px\">"
+    "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/10.webp\" width=\"500px\">"
   ]
  },
  {
@@ -953,12 +1027,20 @@
    "print(strings)"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "e8c2e7b4-6a22-42aa-8e4d-901f06378d4a",
+   "metadata": {},
+   "source": [
+    "- BPE tokenizers break down unknown words into subwords and individual characters:"
+   ]
+  },
  {
   "cell_type": "markdown",
   "id": "c082d41f-33d7-4827-97d8-993d5a84bb3c",
   "metadata": {},
   "source": [
-    "<img src=\"figures/11.webp\" width=\"300px\">"
+    "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/11.webp\" width=\"300px\">"
   ]
  },
  {
@@ -969,12 +1051,20 @@
    "## 2.6 Data sampling with a sliding window"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "509d9826-6384-462e-aa8a-a7c73cd6aad0",
+   "metadata": {},
+   "source": [
+    "- We train LLMs to generate one word at a time, so we want to prepare the training data accordingly where the next word in a sequence represents the target to predict:"
+   ]
+  },
  {
   "cell_type": "markdown",
   "id": "39fb44f4-0c43-4a6a-9c2f-9cf31452354c",
   "metadata": {},
   "source": [
-    "<img src=\"figures/12.webp\" width=\"400px\">"
+    "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/12.webp\" width=\"400px\">"
   ]
  },
  {
@@ -1101,14 +1191,6 @@
    "    print(tokenizer.decode(context), \"---->\", tokenizer.decode([desired]))"
   ]
  },
-  {
-   "cell_type": "markdown",
-   "id": "b59f90fe-fa73-4c2d-bd9b-ce7c2ce2ba00",
-   "metadata": {},
-   "source": [
-    "<img src=\"figures/13.webp\" width=\"500px\">"
-   ]
-  },
  {
   "cell_type": "markdown",
   "id": "210d2dd9-fc20-4927-8d3d-1466cf41aae1",
@@ -1145,6 +1227,16 @@
    "print(\"PyTorch version:\", torch.__version__)"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "0c9a3d50-885b-49bc-b791-9f5cc8bc7b7c",
+   "metadata": {},
+   "source": [
+    "- We use a sliding window approach where we slide the window one word at a time (this is also known as `stride=1`):\n",
+    "\n",
+    "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/13.webp\" width=\"500px\">"
+   ]
+  },
  {
   "cell_type": "markdown",
   "id": "92ac652d-7b38-4843-9fbd-494cdc8ec12c",
@@ -1268,12 +1360,20 @@
    "print(second_batch)"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "b006212f-de45-468d-bdee-5806216d1679",
+   "metadata": {},
+   "source": [
+    "- An example using stride equal to the context length (here: 4) as shown below:"
+   ]
+  },
  {
   "cell_type": "markdown",
   "id": "9cb467e0-bdcd-4dda-b9b0-a738c5d33ac3",
   "metadata": {},
   "source": [
-    "<img src=\"figures/14.webp\" width=\"500px\">"
+    "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/14.webp\" width=\"500px\">"
   ]
  },
  {
@@ -1349,7 +1449,7 @@
   "id": "e85089aa-8671-4e5f-a2b3-ef252004ee4c",
   "metadata": {},
   "source": [
-    "<img src=\"figures/15.webp\" width=\"400px\">"
+    "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/15.webp\" width=\"400px\">"
   ]
  },
  {
@@ -1489,12 +1589,30 @@
    "print(embedding_layer(input_ids))"
   ]
  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "be97ced4-bd13-42b7-866a-4d699a17e155",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "- An embedding layer is essentially a look-up operation:"
+   ]
+  },
  {
   "cell_type": "markdown",
   "id": "f33c2741-bf1b-4c60-b7fd-61409d556646",
   "metadata": {},
   "source": [
-    "<img src=\"figures/16.webp\" width=\"500px\">"
+    "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/16.webp\" width=\"500px\">"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "08218d9f-aa1a-4afb-a105-72ff96a54e73",
+   "metadata": {},
+   "source": [
+    "- **You may be interested in the bonus content comparing embedding layers with regular linear layers: [../02_bonus_efficient-multihead-attention](../02_bonus_efficient-multihead-attention)**"
   ]
  },
  {
@@ -1505,12 +1623,28 @@
    "## 2.8 Encoding word positions"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "24940068-1099-4698-bdc0-e798515e2902",
+   "metadata": {},
+   "source": [
+    "- Embedding layer convert IDs into identical vector representations regardless of where they are located in the input sequence:"
+   ]
+  },
  {
   "cell_type": "markdown",
   "id": "9e0b14a2-f3f3-490e-b513-f262dbcf94fa",
   "metadata": {},
   "source": [
-    "<img src=\"figures/17.webp\" width=\"400px\">"
+    "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/17.webp\" width=\"400px\">"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "92a7d7fe-38a5-46e6-8db6-b688887b0430",
+   "metadata": {},
+   "source": [
+    "- Positional embeddings are combined with the token embedding vector to form the input embeddings for a large language model:"
   ]
  },
  {
@@ -1518,7 +1652,7 @@
   "id": "48de37db-d54d-45c4-ab3e-88c0783ad2e4",
   "metadata": {},
   "source": [
-    "<img src=\"figures/18.webp\" width=\"500px\">"
+    "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/18.webp\" width=\"500px\">"
   ]
  },
  {
@@ -1679,12 +1813,21 @@
    "print(input_embeddings.shape)"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "1fbda581-6f9b-476f-8ea7-d244e6a4eaec",
+   "metadata": {},
+   "source": [
+    "- In the initial phase of the input processing workflow, the input text is segmented into separate tokens\n",
+    "- Following this segmentation, these tokens are transformed into token IDs based on a predefined vocabulary:"
+   ]
+  },
  {
   "cell_type": "markdown",
   "id": "d1bb0f7e-460d-44db-b366-096adcd84fff",
   "metadata": {},
   "source": [
-    "<img src=\"figures/19.webp\" width=\"400px\">"
+    "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/19.webp\" width=\"400px\">"
   ]
  },
  {
@@ -1722,7 +1865,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.12.2"
+   "version": "3.10.6"
  }
 },
 "nbformat": 4,
--- a/ch02/01_main-chapter-code/figures/1.webp
+++ b/ch02/01_main-chapter-code/figures/1.webp
--- a/ch02/01_main-chapter-code/figures/10.webp
+++ b/ch02/01_main-chapter-code/figures/10.webp
--- a/ch02/01_main-chapter-code/figures/11.webp
+++ b/ch02/01_main-chapter-code/figures/11.webp
--- a/ch02/01_main-chapter-code/figures/12.webp
+++ b/ch02/01_main-chapter-code/figures/12.webp
--- a/ch02/01_main-chapter-code/figures/13.webp
+++ b/ch02/01_main-chapter-code/figures/13.webp
--- a/ch02/01_main-chapter-code/figures/14.webp
+++ b/ch02/01_main-chapter-code/figures/14.webp
--- a/ch02/01_main-chapter-code/figures/15.webp
+++ b/ch02/01_main-chapter-code/figures/15.webp
--- a/ch02/01_main-chapter-code/figures/16.webp
+++ b/ch02/01_main-chapter-code/figures/16.webp
--- a/ch02/01_main-chapter-code/figures/17.webp
+++ b/ch02/01_main-chapter-code/figures/17.webp
--- a/ch02/01_main-chapter-code/figures/18.webp
+++ b/ch02/01_main-chapter-code/figures/18.webp
--- a/ch02/01_main-chapter-code/figures/19.webp
+++ b/ch02/01_main-chapter-code/figures/19.webp
--- a/ch02/01_main-chapter-code/figures/2.webp
+++ b/ch02/01_main-chapter-code/figures/2.webp
--- a/ch02/01_main-chapter-code/figures/3.webp
+++ b/ch02/01_main-chapter-code/figures/3.webp
--- a/ch02/01_main-chapter-code/figures/4.webp
+++ b/ch02/01_main-chapter-code/figures/4.webp
--- a/ch02/01_main-chapter-code/figures/5.webp
+++ b/ch02/01_main-chapter-code/figures/5.webp
--- a/ch02/01_main-chapter-code/figures/6.webp
+++ b/ch02/01_main-chapter-code/figures/6.webp
--- a/ch02/01_main-chapter-code/figures/7.webp
+++ b/ch02/01_main-chapter-code/figures/7.webp
--- a/ch02/01_main-chapter-code/figures/8.webp
+++ b/ch02/01_main-chapter-code/figures/8.webp
--- a/ch02/01_main-chapter-code/figures/9.webp
+++ b/ch02/01_main-chapter-code/figures/9.webp