Bonus material: extending tokenizers (#496)

* Bonus material: extending tokenizers

* small wording update
This commit is contained in:
Sebastian Raschka
2025-01-22 09:26:54 -06:00
committed by GitHub
parent 9175590ea4
commit dcaac28b92
7 changed files with 1224 additions and 2 deletions

View File

@@ -309,7 +309,30 @@
"Average score: 48.87\n",
"```\n",
"\n",
"The score is close to 50, which is in the same ballpark as the score we previously achieved with the Alpaca-style prompts."
"The score is close to 50, which is in the same ballpark as the score we previously achieved with the Alpaca-style prompts.\n",
"\n",
"There is no inherent advantage or rationale why the Phi prompt-style should be better, but it can be more concise and efficient, except for the caveat mentioned in the *Tip* section below."
]
},
{
"cell_type": "markdown",
"id": "156bc574-3f3e-4479-8f58-c8c8c472416e",
"metadata": {},
"source": [
"#### Tip: Considering special tokens"
]
},
{
"cell_type": "markdown",
"id": "65cacf90-21c2-48f2-8f21-5c0c86749ff2",
"metadata": {},
"source": [
"- Note that the Phi-3 prompt template contains special tokens such as `<|user|>` and `<|assistant|>`, which can be suboptimal for the GPT-2 tokenizer\n",
"- While the GPT-2 tokenizer recognizes `<|endoftext|>` as a special token (encoded into token ID 50256), it is inefficient at handling other special tokens, such as the aforementioned ones\n",
"- For instance, `<|user|>` is encoded into 5 individual token IDs (27, 91, 7220, 91, 29), which is very inefficient\n",
"- We could add `<|user|>` as a new special token in `tiktoken` via the `allowed_special` argument, but please keep in mind that the GPT-2 vocabulary would not be able to handle it without additional modification\n",
"- If you are curious about how a tokenizer and LLM can be extended to handle special tokens, please see the [extend-tiktoken.ipynb](../../ch05/09_extending-tokenizers/extend-tiktoken.ipynb) bonus materials (note that this is not required here but is just an interesting/bonus consideration for curious readers)\n",
"- Furthermore, we can hypothesize that models that support these special tokens of a prompt template via their vocabulary may perform more efficiently and better overall"
]
},
{
@@ -994,7 +1017,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.11"
"version": "3.11.4"
}
},
"nbformat": 4,