Qwen3Tokenizer fix for Qwen3 Base models and generation mismatch with HF (#828)

* prevent `self.apply_chat_template` being applied for base Qwen models

* - added no chat template comparison in `test_chat_wrap_and_equivalence`
- removed duplicate comparison

* Revert "- added no chat template comparison in `test_chat_wrap_and_equivalence`"

This reverts commit 3a5ee8cfa1.

* Revert "prevent `self.apply_chat_template` being applied for base Qwen models"

This reverts commit df504397a8.

* copied `download_file` in `utils` from https://github.com/rasbt/reasoning-from-scratch/blob/main/reasoning_from_scratch/utils.py

* added copy of test `def test_tokenizer_equivalence()` from `reasoning-from-scratch` in `test_qwen3.py`

* removed duplicate code fragment in`test_chat_wrap_and_equivalence`

* use apply_chat_template

* add toggle for instruct model

* Update tokenizer usage

---------

Co-authored-by: rasbt <mail@sebastianraschka.com>
This commit is contained in:
casinca
2025-09-17 15:14:11 +02:00
committed by GitHub
parent bfc6389fab
commit 42c130623b
7 changed files with 125 additions and 15 deletions

View File

@@ -9,6 +9,8 @@ import ast
import re
import types
from pathlib import Path
import urllib.request
import urllib.parse
import nbformat
@@ -122,3 +124,22 @@ def import_definitions_from_notebook(nb_dir_or_path, notebook_name=None, *, extr
exec(src, mod.__dict__)
return mod
def download_file(url, out_dir="."):
"""Simple file download utility for tests."""
from pathlib import Path
out_dir = Path(out_dir)
out_dir.mkdir(parents=True, exist_ok=True)
filename = Path(urllib.parse.urlparse(url).path).name
dest = out_dir / filename
if dest.exists():
return dest
try:
with urllib.request.urlopen(url) as response:
with open(dest, 'wb') as f:
f.write(response.read())
return dest
except Exception as e:
raise RuntimeError(f"Failed to download {url}: {e}")