add more notes and embed figures externally to save space

This commit is contained in:
rasbt
2024-03-17 09:08:38 -05:00
parent b655e628a2
commit d60da19fd0
51 changed files with 357 additions and 78 deletions

View File

@@ -41,12 +41,20 @@
"print(\"tiktoken version:\", version(\"tiktoken\"))"
]
},
{
"cell_type": "markdown",
"id": "5a42fbfd-e3c2-43c2-bc12-f5f870a0b10a",
"metadata": {},
"source": [
"- This chapter covers data preparation and sampling to get input data \"ready\" for the LLM"
]
},
{
"cell_type": "markdown",
"id": "628b2922-594d-4ff9-bd82-04f1ebdf41f5",
"metadata": {},
"source": [
"<img src=\"figures/1.webp\" width=\"500px\">"
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/01.webp\" width=\"500px\">"
]
},
{
@@ -57,14 +65,6 @@
"## 2.1 Understanding word embeddings"
]
},
{
"cell_type": "markdown",
"id": "ba08d16f-f237-4166-bf89-0e9fe703e7b4",
"metadata": {},
"source": [
"<img src=\"figures/2.webp\" width=\"500px\">"
]
},
{
"cell_type": "markdown",
"id": "0b6816ae-e927-43a9-b4dd-e47a9b0e1cf6",
@@ -73,12 +73,37 @@
"- No code in this section"
]
},
{
"cell_type": "markdown",
"id": "4f69dab7-a433-427a-9e5b-b981062d6296",
"metadata": {},
"source": [
"- There are any forms of embeddings; we focus on text embeddings in this book"
]
},
{
"cell_type": "markdown",
"id": "ba08d16f-f237-4166-bf89-0e9fe703e7b4",
"metadata": {},
"source": [
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/02.webp\" width=\"500px\">"
]
},
{
"cell_type": "markdown",
"id": "288c4faf-b93a-4616-9276-7a4aa4b5e9ba",
"metadata": {},
"source": [
"- LLMs work embeddings in high-dimensional spaces (i.e., thousands of dimensions)\n",
"- Since we can't visualize such high-dimensional spaces (we humans think in 1, 2, or 3 dimensions), the figure below illustrates a 2-dimensipnal embedding space"
]
},
{
"cell_type": "markdown",
"id": "d6b80160-1f10-4aad-a85e-9c79444de9e6",
"metadata": {},
"source": [
"<img src=\"figures/3.webp\" width=\"300px\">"
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/03.webp\" width=\"300px\">"
]
},
{
@@ -89,12 +114,20 @@
"## 2.2 Tokenizing text"
]
},
{
"cell_type": "markdown",
"id": "f9c90731-7dc9-4cd3-8c4a-488e33b48e80",
"metadata": {},
"source": [
"- In this section, we tokenize text, which means breaking text into smaller units, such as individual words and punctuation characters"
]
},
{
"cell_type": "markdown",
"id": "09872fdb-9d4e-40c4-949d-52a01a43ec4b",
"metadata": {},
"source": [
"<img src=\"figures/4.webp\" width=\"300px\">"
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/04.webp\" width=\"300px\">"
]
},
{
@@ -261,7 +294,7 @@
"id": "6cbe9330-b587-4262-be9f-497a84ec0e8a",
"metadata": {},
"source": [
"<img src=\"figures/5.webp\" width=\"350px\">"
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/05.webp\" width=\"350px\">"
]
},
{
@@ -318,12 +351,20 @@
"## 2.3 Converting tokens into token IDs"
]
},
{
"cell_type": "markdown",
"id": "a5204973-f414-4c0d-87b0-cfec1f06e6ff",
"metadata": {},
"source": [
"- Next, we convert the text tokens into token IDs that we can process via embedding layers later"
]
},
{
"cell_type": "markdown",
"id": "177b041d-f739-43b8-bd81-0443ae3a7f8d",
"metadata": {},
"source": [
"<img src=\"figures/6.webp\" width=\"500px\">"
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/06.webp\" width=\"500px\">"
]
},
{
@@ -444,12 +485,20 @@
" break"
]
},
{
"cell_type": "markdown",
"id": "3b1dc314-351b-476a-9459-0ec9ddc29b19",
"metadata": {},
"source": [
"- Below, we illustrate the tokenization of a short sample text using a small vocabulary:"
]
},
{
"cell_type": "markdown",
"id": "67407a9f-0202-4e7c-9ed7-1b3154191ebc",
"metadata": {},
"source": [
"<img src=\"figures/7.webp\" width=\"500px\">"
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/07.webp\" width=\"500px\">"
]
},
{
@@ -485,12 +534,21 @@
" return text"
]
},
{
"cell_type": "markdown",
"id": "dee7a1e5-b54f-4ca1-87ef-3d663c4ee1e7",
"metadata": {},
"source": [
"- The `encode` function turns text into token IDs\n",
"- The `decode` function turns token IDs back into text"
]
},
{
"cell_type": "markdown",
"id": "cc21d347-ec03-4823-b3d4-9d686e495617",
"metadata": {},
"source": [
"<img src=\"figures/8.webp\" width=\"500px\">"
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/08.webp\" width=\"500px\">"
]
},
{
@@ -582,12 +640,20 @@
"## 2.4 Adding special context tokens"
]
},
{
"cell_type": "markdown",
"id": "863d6d15-a3e2-44e0-b384-bb37f17cf443",
"metadata": {},
"source": [
"- It's useful to add some \"special\" tokens for unknown words and to denote the end of a text"
]
},
{
"cell_type": "markdown",
"id": "aa7fc96c-e1fd-44fb-b7f5-229d7c7922a4",
"metadata": {},
"source": [
"<img src=\"figures/9.webp\" width=\"500px\">"
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/09.webp\" width=\"500px\">"
]
},
{
@@ -609,12 +675,20 @@
"\n"
]
},
{
"cell_type": "markdown",
"id": "a336b43b-7173-49e7-bd80-527ad4efb271",
"metadata": {},
"source": [
"- We use the `<|endoftext|>` tokens between two independent sources of text:"
]
},
{
"cell_type": "markdown",
"id": "52442951-752c-4855-9752-b121a17fef55",
"metadata": {},
"source": [
"<img src=\"figures/10.webp\" width=\"500px\">"
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/10.webp\" width=\"500px\">"
]
},
{
@@ -953,12 +1027,20 @@
"print(strings)"
]
},
{
"cell_type": "markdown",
"id": "e8c2e7b4-6a22-42aa-8e4d-901f06378d4a",
"metadata": {},
"source": [
"- BPE tokenizers break down unknown words into subwords and individual characters:"
]
},
{
"cell_type": "markdown",
"id": "c082d41f-33d7-4827-97d8-993d5a84bb3c",
"metadata": {},
"source": [
"<img src=\"figures/11.webp\" width=\"300px\">"
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/11.webp\" width=\"300px\">"
]
},
{
@@ -969,12 +1051,20 @@
"## 2.6 Data sampling with a sliding window"
]
},
{
"cell_type": "markdown",
"id": "509d9826-6384-462e-aa8a-a7c73cd6aad0",
"metadata": {},
"source": [
"- We train LLMs to generate one word at a time, so we want to prepare the training data accordingly where the next word in a sequence represents the target to predict:"
]
},
{
"cell_type": "markdown",
"id": "39fb44f4-0c43-4a6a-9c2f-9cf31452354c",
"metadata": {},
"source": [
"<img src=\"figures/12.webp\" width=\"400px\">"
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/12.webp\" width=\"400px\">"
]
},
{
@@ -1101,14 +1191,6 @@
" print(tokenizer.decode(context), \"---->\", tokenizer.decode([desired]))"
]
},
{
"cell_type": "markdown",
"id": "b59f90fe-fa73-4c2d-bd9b-ce7c2ce2ba00",
"metadata": {},
"source": [
"<img src=\"figures/13.webp\" width=\"500px\">"
]
},
{
"cell_type": "markdown",
"id": "210d2dd9-fc20-4927-8d3d-1466cf41aae1",
@@ -1145,6 +1227,16 @@
"print(\"PyTorch version:\", torch.__version__)"
]
},
{
"cell_type": "markdown",
"id": "0c9a3d50-885b-49bc-b791-9f5cc8bc7b7c",
"metadata": {},
"source": [
"- We use a sliding window approach where we slide the window one word at a time (this is also known as `stride=1`):\n",
"\n",
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/13.webp\" width=\"500px\">"
]
},
{
"cell_type": "markdown",
"id": "92ac652d-7b38-4843-9fbd-494cdc8ec12c",
@@ -1268,12 +1360,20 @@
"print(second_batch)"
]
},
{
"cell_type": "markdown",
"id": "b006212f-de45-468d-bdee-5806216d1679",
"metadata": {},
"source": [
"- An example using stride equal to the context length (here: 4) as shown below:"
]
},
{
"cell_type": "markdown",
"id": "9cb467e0-bdcd-4dda-b9b0-a738c5d33ac3",
"metadata": {},
"source": [
"<img src=\"figures/14.webp\" width=\"500px\">"
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/14.webp\" width=\"500px\">"
]
},
{
@@ -1349,7 +1449,7 @@
"id": "e85089aa-8671-4e5f-a2b3-ef252004ee4c",
"metadata": {},
"source": [
"<img src=\"figures/15.webp\" width=\"400px\">"
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/15.webp\" width=\"400px\">"
]
},
{
@@ -1489,12 +1589,30 @@
"print(embedding_layer(input_ids))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "be97ced4-bd13-42b7-866a-4d699a17e155",
"metadata": {},
"outputs": [],
"source": [
"- An embedding layer is essentially a look-up operation:"
]
},
{
"cell_type": "markdown",
"id": "f33c2741-bf1b-4c60-b7fd-61409d556646",
"metadata": {},
"source": [
"<img src=\"figures/16.webp\" width=\"500px\">"
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/16.webp\" width=\"500px\">"
]
},
{
"cell_type": "markdown",
"id": "08218d9f-aa1a-4afb-a105-72ff96a54e73",
"metadata": {},
"source": [
"- **You may be interested in the bonus content comparing embedding layers with regular linear layers: [../02_bonus_efficient-multihead-attention](../02_bonus_efficient-multihead-attention)**"
]
},
{
@@ -1505,12 +1623,28 @@
"## 2.8 Encoding word positions"
]
},
{
"cell_type": "markdown",
"id": "24940068-1099-4698-bdc0-e798515e2902",
"metadata": {},
"source": [
"- Embedding layer convert IDs into identical vector representations regardless of where they are located in the input sequence:"
]
},
{
"cell_type": "markdown",
"id": "9e0b14a2-f3f3-490e-b513-f262dbcf94fa",
"metadata": {},
"source": [
"<img src=\"figures/17.webp\" width=\"400px\">"
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/17.webp\" width=\"400px\">"
]
},
{
"cell_type": "markdown",
"id": "92a7d7fe-38a5-46e6-8db6-b688887b0430",
"metadata": {},
"source": [
"- Positional embeddings are combined with the token embedding vector to form the input embeddings for a large language model:"
]
},
{
@@ -1518,7 +1652,7 @@
"id": "48de37db-d54d-45c4-ab3e-88c0783ad2e4",
"metadata": {},
"source": [
"<img src=\"figures/18.webp\" width=\"500px\">"
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/18.webp\" width=\"500px\">"
]
},
{
@@ -1679,12 +1813,21 @@
"print(input_embeddings.shape)"
]
},
{
"cell_type": "markdown",
"id": "1fbda581-6f9b-476f-8ea7-d244e6a4eaec",
"metadata": {},
"source": [
"- In the initial phase of the input processing workflow, the input text is segmented into separate tokens\n",
"- Following this segmentation, these tokens are transformed into token IDs based on a predefined vocabulary:"
]
},
{
"cell_type": "markdown",
"id": "d1bb0f7e-460d-44db-b366-096adcd84fff",
"metadata": {},
"source": [
"<img src=\"figures/19.webp\" width=\"400px\">"
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/19.webp\" width=\"400px\">"
]
},
{
@@ -1722,7 +1865,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.2"
"version": "3.10.6"
}
},
"nbformat": 4,

Binary file not shown.

Before

Width:  |  Height:  |  Size: 24 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 32 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 15 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 21 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 15 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 20 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 7.7 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 20 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 18 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 14 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 10 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 14 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 6.3 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 9.9 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 2.7 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 23 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 13 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 16 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 12 KiB