From 2f53bf5fe585e4483a64e8f6d89fae77bbc018d6 Mon Sep 17 00:00:00 2001 From: Sebastian Raschka Date: Tue, 24 Jun 2025 16:52:29 -0500 Subject: [PATCH] Link the other KV cache sections (#708) --- ch04/03_kv-cache/README.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/ch04/03_kv-cache/README.md b/ch04/03_kv-cache/README.md index 5cb1d6c..8f17156 100644 --- a/ch04/03_kv-cache/README.md +++ b/ch04/03_kv-cache/README.md @@ -297,3 +297,11 @@ On a Mac Mini with an M4 chip (CPU), with a 200-token generation and a window si | `gpt_with_kv_cache_optimized.py` | 166 | Unfortunately, the speed advantages disappear on CUDA devices as this is a tiny model, and the device transfer and communication outweigh the benefits of a KV cache for this small model. + + +  +## Additional Resources + +1. [Qwen3 from-scratch KV cache benchmarks](../../ch05/11_qwen3#pro-tip-2-speed-up-inference-with-compilation) +2. [Llama 3 from-scratch KV cache benchmarks](../../ch05/07_gpt_to_llama/README.md#pro-tip-3-speed-up-inference-with-compilation) +3. [Understanding and Coding the KV Cache in LLMs from Scratch](https://magazine.sebastianraschka.com/p/coding-the-kv-cache-in-llms) -- A more detailed write-up of this README