add more notes and embed figures externally to save space
@@ -37,6 +37,22 @@
|
||||
"print(\"torch version:\", version(\"torch\"))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "02a11208-d9d3-44b1-8e0d-0c8414110b93",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/01.webp\" width=\"500px\">"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "50e020fd-9690-4343-80df-da96678bef5e",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/02.webp\" width=\"600px\">"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "ecc4dcee-34ea-4c05-9085-2f8887f70363",
|
||||
@@ -53,6 +69,22 @@
|
||||
"- No code in this section"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "55c0c433-aa4b-491e-848a-54905ebb05ad",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/03.webp\" width=\"400px\">"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "03d8df2c-c1c2-4df0-9977-ade9713088b2",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/04.webp\" width=\"400px\">"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "3602c585-b87a-41c7-a324-c5e8298849df",
|
||||
@@ -69,6 +101,22 @@
|
||||
"- No code in this section"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "bc4f6293-8ab5-4aeb-a04c-50ee158485b1",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/05.webp\" width=\"400px\">"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "6565dc9f-b1be-4c78-b503-42ccc743296c",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/06.webp\" width=\"200px\">"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "5efe05ff-b441-408e-8d66-cde4eb3397e3",
|
||||
@@ -103,6 +151,14 @@
|
||||
" - In short, think of $z^{(2)}$ as a modified version of $x^{(2)}$ that also incorporates information about all other input elements that are relevant to a given task at hand."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "fcc7c7a2-b6ab-478f-ae37-faa8eaa8049a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/07.webp\" width=\"400px\">"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "ff856c58-8382-44c7-827f-798040e6e697",
|
||||
@@ -141,14 +197,6 @@
|
||||
" - The subscript \"21\" in $\\omega_{21}$ means that input sequence element 2 was used as a query against input sequence element 1."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "2e29440f-9b77-4966-83aa-d1ff2e653b00",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<img src=\"figures/dot-product.png\" width=\"450px\">"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "35e55f7a-f2d0-4f24-858b-228e4fe88fb3",
|
||||
@@ -176,6 +224,14 @@
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "5cb3453a-58fa-42c4-b225-86850bc856f8",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/08.webp\" width=\"400px\">"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "77be52fb-82fd-4886-a4c8-f24a9c87af22",
|
||||
@@ -242,6 +298,14 @@
|
||||
"print(torch.dot(inputs[0], query))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "dfd965d6-980c-476a-93d8-9efe603b1b3b",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/09.webp\" width=\"400px\">"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "7d444d76-e19e-4e9a-a268-f315d966609b",
|
||||
@@ -346,6 +410,14 @@
|
||||
"- **Step 3**: compute the context vector $z^{(2)}$ by multiplying the embedded input tokens, $x^{(i)}$ with the attention weights and sum the resulting vectors:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "f1c9f5ac-8d3d-4847-94e3-fd783b7d4d3d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/10.webp\" width=\"400px\">"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
@@ -394,7 +466,15 @@
|
||||
"id": "11c0fb55-394f-42f4-ba07-d01ae5c98ab4",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<img src=\"figures/attention-matrix.png\" width=\"400px\">"
|
||||
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/11.webp\" width=\"400px\">"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "d9bffe4b-56fe-4c37-9762-24bd924b7d3c",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/12.webp\" width=\"400px\">"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -594,6 +674,14 @@
|
||||
"## 3.4 Implementing self-attention with trainable weights"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "ac9492ba-6f66-4f65-bd1d-87cf16d59928",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/13.webp\" width=\"400px\">"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "2b90a77e-d746-4704-9354-1ddad86e6298",
|
||||
@@ -617,6 +705,14 @@
|
||||
" - These trainable weight matrices are crucial so that the model (specifically, the attention module inside the model) can learn to produce \"good\" context vectors."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "59db4093-93e8-4bee-be8f-c8fac8a08cdd",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/14.webp\" width=\"400px\">"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "4d996671-87aa-45c9-b2e0-07a7bcc9060a",
|
||||
@@ -630,14 +726,6 @@
|
||||
" - Value vector: $v^{(i)} = W_v \\,x^{(i)}$\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "d3b29bc6-4bde-4924-9aff-0af1421803f5",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<img src=\"figures/weight-selfattn-1.png\" width=\"600px\">"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "9f334313-5fd0-477b-8728-04080a427049",
|
||||
@@ -755,7 +843,7 @@
|
||||
"id": "8ed0a2b7-5c50-4ede-90cf-7ad74412b3aa",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<img src=\"figures/weight-selfattn-2.png\" width=\"600px\">"
|
||||
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/15.webp\" width=\"400px\">"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -810,7 +898,7 @@
|
||||
"id": "8622cf39-155f-4eb5-a0c0-82a03ce9b999",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<img src=\"figures/weight-selfattn-3.png\" width=\"600px\">"
|
||||
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/16.webp\" width=\"400px\">"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -847,7 +935,7 @@
|
||||
"id": "b8f61a28-b103-434a-aee1-ae7cbd821126",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<img src=\"figures/weight-selfattn-4.png\" width=\"600px\">"
|
||||
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/17.webp\" width=\"400px\">"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -940,6 +1028,14 @@
|
||||
"print(sa_v1(inputs))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "7ee1a024-84a5-425a-9567-54ab4e4ed445",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/18.webp\" width=\"400px\">"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "048e0c16-d911-4ec8-b0bc-45ceec75c081",
|
||||
@@ -1010,6 +1106,14 @@
|
||||
"## 3.5 Hiding future words with causal attention"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "71e91bb5-5aae-4f05-8a95-973b3f988a35",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/19.webp\" width=\"400px\">"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "82f405de-cd86-4e72-8f3c-9ea0354946ba",
|
||||
@@ -1031,10 +1135,10 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "71e91bb5-5aae-4f05-8a95-973b3f988a35",
|
||||
"id": "57f99af3-32bc-48f5-8eb4-63504670ca0a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<img src=\"figures/masked.png\" width=\"600px\">"
|
||||
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/20.webp\" width=\"400px\">"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -1193,6 +1297,14 @@
|
||||
"- So, instead of zeroing out attention weights above the diagonal and renormalizing the results, we can mask the unnormalized attention scores above the diagonal with negative infinity before they enter the softmax function:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "eb682900-8df2-4767-946c-a82bee260188",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/21.webp\" width=\"400px\">"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 29,
|
||||
@@ -1279,7 +1391,7 @@
|
||||
"id": "ee799cf6-6175-45f2-827e-c174afedb722",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<img src=\"figures/dropout.png\" width=\"500px\">"
|
||||
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/21.webp\" width=\"400px\">"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -1460,6 +1572,14 @@
|
||||
"- Note that dropout is only applied during training, not during inference."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a554cf47-558c-4f45-84cd-bf9b839a8d50",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/23.webp\" width=\"400px\">"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "c8bef90f-cfd4-4289-b0e8-6a00dc9be44c",
|
||||
@@ -1485,11 +1605,11 @@
|
||||
"\n",
|
||||
"- This is also called single-head attention:\n",
|
||||
"\n",
|
||||
"<img src=\"figures/single-head.png\" width=\"600px\">\n",
|
||||
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/24.webp\" width=\"400px\">\n",
|
||||
"\n",
|
||||
"- We simply stack multiple single-head attention modules to obtain a multi-head attention module:\n",
|
||||
"\n",
|
||||
"<img src=\"figures/multi-head.png\" width=\"600px\">\n",
|
||||
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/25.webp\" width=\"400px\">\n",
|
||||
"\n",
|
||||
"- The main idea behind multi-head attention is to run the attention mechanism multiple times (in parallel) with different, learned linear projections. This allows the model to jointly attend to information from different representation subspaces at different positions."
|
||||
]
|
||||
@@ -1678,6 +1798,14 @@
|
||||
"- Note that in addition, we added a linear projection layer (`self.out_proj `) to the `MultiHeadAttention` class above. This is simply a linear transformation that doesn't change the dimensions. It's a standard convention to use such a projection layer in LLM implementation, but it's not strictly necessary (recent research has shown that it can be removed without affecting the modeling performance; see the further reading section at the end of this chapter)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "dbe5d396-c990-45dc-9908-2c621461f851",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/26.webp\" width=\"400px\">"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "8b0ed78c-e8ac-4f8f-a479-a98242ae8f65",
|
||||
@@ -1802,7 +1930,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.12"
|
||||
"version": "3.10.6"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
||||
|
Before Width: | Height: | Size: 136 KiB |
|
Before Width: | Height: | Size: 67 KiB |
|
Before Width: | Height: | Size: 93 KiB |
|
Before Width: | Height: | Size: 63 KiB |
|
Before Width: | Height: | Size: 59 KiB |
|
Before Width: | Height: | Size: 60 KiB |
|
Before Width: | Height: | Size: 71 KiB |
|
Before Width: | Height: | Size: 52 KiB |
|
Before Width: | Height: | Size: 61 KiB |
|
Before Width: | Height: | Size: 53 KiB |
|
Before Width: | Height: | Size: 54 KiB |