From e1d094b655e2f514d81b2334ae48ddeffc9fd643 Mon Sep 17 00:00:00 2001 From: Sebastian Raschka Date: Sat, 27 Apr 2024 07:59:42 -0500 Subject: [PATCH] Update README.md --- .../02_bonus_additional-experiments/README.md | 24 +++++++++++++++++-- 1 file changed, 22 insertions(+), 2 deletions(-) diff --git a/ch06/02_bonus_additional-experiments/README.md b/ch06/02_bonus_additional-experiments/README.md index 62e85ee..47a215e 100644 --- a/ch06/02_bonus_additional-experiments/README.md +++ b/ch06/02_bonus_additional-experiments/README.md @@ -22,7 +22,9 @@ For example,   -### Usage: +### Usage + +You can use the following code to reproduce the experiments: - Row 1: `python additional-experiments.py` - Row 2: `python additional-experiments.py --trainable_token first` @@ -31,4 +33,22 @@ For example, - Row 5: `python additional-experiments.py --model_size gpt2-medium (355M)` - Row 6: `python additional-experiments.py --model_size gpt2-large (774M)` - Row 7: `python additional-experiments.py --weights random --trainable_layers all` -- Row 8: `python additional-experiments.py --context_length "model_context_length"` \ No newline at end of file +- Row 8: `python additional-experiments.py --context_length "model_context_length"` + +I've kept the LLM and dataset small on purpose, so you can run the training on a regular laptop like a MacBook Air M3 in about 15 minutes in case you don't have access to a GPU. + +  + +### Interpretation + +1. **Training the Last vs. First Output Token (Row 1 vs. 2)**: Training the last output token results in significantly better performance compared to the first. This improvement is expected due to the causal self-attention mask. + +2. **Training the Last Transformer Block vs. Last Layer (Row 1 vs. 3)**: Training the entire last transformer block is much more effective than training only the last layer. + +3. **Training All Layers vs. Last Transformer Block (Row 1 vs. 4)**: Training all layers shows a modest improvement of 2% over just training the last transformer block, but it requires almost three times longer in terms of training duration. + +4. **Using Larger Pretrained Models (Row 1 vs 5, and Row 1 vs. 6)**: Employing a 3x larger pretrained model leads to worse results. However, using a 5x larger model improves performance compared to the initial model, as was anticipated. + +5. **Using a Model with Random Weights vs. Pretrained Weights (Row 1 vs. 7)**: Utilizing a model with random weights yields results that are only slightly worse by 1.3% compared to using pretrained weights. + +6. **Padding Input to Full Context Length vs. Longest Training Example (Row 1 vs. 8)**: Padding the input to the full supported context length results is significantly worse.