From 4a617b8343987bc58fb5606a8eba5912e5a896ed Mon Sep 17 00:00:00 2001 From: Sebastian Raschka Date: Tue, 2 Apr 2024 08:54:24 -0500 Subject: [PATCH] Gutenberg for Windows users (#99) --- ch05/03_bonus_pretraining_on_gutenberg/README.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/ch05/03_bonus_pretraining_on_gutenberg/README.md b/ch05/03_bonus_pretraining_on_gutenberg/README.md index 2ea0f3b..ba82c6f 100644 --- a/ch05/03_bonus_pretraining_on_gutenberg/README.md +++ b/ch05/03_bonus_pretraining_on_gutenberg/README.md @@ -13,6 +13,8 @@ Please read the [Project Gutenberg Permissions, Licensing and other Common Reque ### 1) Download the dataset +In this section, we download books from Project Gutenberg using code from the [`pgcorpus/gutenberg`](https://github.com/pgcorpus/gutenberg) GitHub repository. + As of this writing, this will require approximately 50 GB of disk space, but it may be more depending on how much Project Gutenberg grew since then. Follow these steps to download the dataset: @@ -28,6 +30,10 @@ Follow these steps to download the dataset: 5. `cd ..` +  +> [!NOTE] +> The [`pgcorpus/gutenberg`](https://github.com/pgcorpus/gutenberg) code is compatible with both Linux and macOS. However, Windows users would have to make small adjustments, such as adding `shell=True` to the `subprocess` calls and replacing `rsync`. Alternatively, an easier way to run this code on Windows is by using the "Windows Subsystem for Linux" feature, which allows users to run a Linux environment in Windows. For more information, please read [Microsoft's official installation instruction](https://learn.microsoft.com/en-us/windows/wsl/install) and [tutorial](https://learn.microsoft.com/en-us/training/modules/wsl-introduction/). +   ### 2) Prepare the dataset