Bonus material: extending tokenizers (#496)

* Bonus material: extending tokenizers * small wording update
2026-04-10 12:33:42 +00:00 · 2025-01-22 09:26:54 -06:00
parent 9175590ea4
commit dcaac28b92
7 changed files with 1224 additions and 2 deletions
--- a/ch07/01_main-chapter-code/exercise-solutions.ipynb
+++ b/ch07/01_main-chapter-code/exercise-solutions.ipynb
@@ -309,7 +309,30 @@
    "Average score: 48.87\n",
    "```\n",
    "\n",
-    "The score is close to 50, which is in the same ballpark as the score we previously achieved with the Alpaca-style prompts."
+    "The score is close to 50, which is in the same ballpark as the score we previously achieved with the Alpaca-style prompts.\n",
+    "\n",
+    "There is no inherent advantage or rationale why the Phi prompt-style should be better, but it can be more concise and efficient, except for the caveat mentioned in the *Tip* section below."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "156bc574-3f3e-4479-8f58-c8c8c472416e",
+   "metadata": {},
+   "source": [
+    "#### Tip: Considering special tokens"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "65cacf90-21c2-48f2-8f21-5c0c86749ff2",
+   "metadata": {},
+   "source": [
+    "- Note that the Phi-3 prompt template contains special tokens such as `<|user|>` and `<|assistant|>`, which can be suboptimal for the GPT-2 tokenizer\n",
+    "- While the GPT-2 tokenizer recognizes `<|endoftext|>` as a special token (encoded into token ID 50256), it is inefficient at handling other special tokens, such as the aforementioned ones\n",
+    "- For instance, `<|user|>` is encoded into 5 individual token IDs (27, 91, 7220, 91, 29), which is very inefficient\n",
+    "- We could add `<|user|>` as a new special token in `tiktoken` via the `allowed_special` argument, but please keep in mind that the GPT-2 vocabulary would not be able to handle it without additional modification\n",
+    "- If you are curious about how a tokenizer and LLM can be extended to handle special tokens, please see the [extend-tiktoken.ipynb](../../ch05/09_extending-tokenizers/extend-tiktoken.ipynb) bonus materials (note that this is not required here but is just an interesting/bonus consideration for curious readers)\n",
+    "- Furthermore, we can hypothesize that models that support these special tokens of a prompt template via their vocabulary may perform more efficiently and better overall"
   ]
  },
  {
@@ -994,7 +1017,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.10.11"
+   "version": "3.11.4"
  }
 },
 "nbformat": 4,