WGET GUIDE


PART 1: SETTING THE SCENE

Wget is the premier tool for downloading entire websites to your machine.
It is a staple of Linux and has persisted since 1996. As a result it is
heavily documented, has decades worth of versions and is an invaluable
asset to your efforts.

Installation

Linux

1. Debian-based: sudo apt install wget
2. Arch-based: sudo pacman -S wget
3. RHEL-based: sudo dnf install wget
4. Verify installation: wget --version

macOS

1. Homebrew: brew install wget
2. MacPorts: sudo port install wget
3. Nix: nix-shell -p wget
4. Verify installation: wget --version

Windows

1. Winget: winget install GnuWin32.Wget
2. Scoop: scoop install wget
3. Chocolatey: choco install wget
4. Verify installation: wget --version


PART 2: MIRRORING A WEBSITE

The following command will download a full local copy of a website,
converting all links so they work offline:

  wget --mirror --convert-links --adjust-extension --page-requisites --no-parent URL

Replace URL with the address of the site or page you want to mirror. Here
is what each flag does:

--mirror
  Enables mirroring mode, equivalent to turning on recursive downloading,
  timestamping, and infinite recursion depth all at once. wget will follow
  links and pull the full structure of the site.

--convert-links
  After downloading, converts all links in the HTML to point to the local
  files instead of the original web addresses. This is what makes the
  archived site browsable offline: clicking a link opens the local copy
  rather than trying to fetch from the internet.

--adjust-extension
  Adds the correct file extension (e.g. .html) to downloaded files that
  do not already have one, so your browser knows how to open them.

--page-requisites
  Downloads everything needed to display each page correctly: images,
  stylesheets, scripts, and other assets, even if they are hosted on a
  different domain.

--no-parent
  Restricts downloading to the URL you specified and anything below it in
  the directory tree. Without this, wget could wander up to the site's
  root and start pulling the entire domain. This keeps the download
  focused.

Choosing a Download Location

By default, wget saves files into the current directory, organised into
subfolders named after the domain. To run the command from a specific
folder, navigate there first:

  cd ~/Documents/Archives
  wget --mirror --convert-links --adjust-extension --page-requisites --no-parent URL

Or specify an output directory directly with -P:

  wget --mirror --convert-links --adjust-extension --page-requisites --no-parent -P ~/Documents/Archives URL


PART 3: TROUBLESHOOTING AND TIPS

The download is taking forever or pulling too much
  Some sites are enormous. You can limit recursion depth with --level=N
  (e.g. --level=2 to only go two links deep) or add a wait between
  requests with --wait=1 to be polite to the server and avoid being
  rate-limited.

wget is being blocked by the server
  Some servers reject requests that do not look like a browser. Try
  spoofing a user agent by adding --user-agent="Mozilla/5.0" to your
  command.

Links still point to the live site after downloading
  Make sure you included --convert-links. If you forgot it, you can re-run
  the command with -c (--continue) to resume and only fetch what is
  missing.

Some assets are missing from the local copy
  If images or styles are hosted on a different domain (a CDN, for
  instance), --page-requisites should catch them. However, some sites load
  assets dynamically via JavaScript, which wget cannot execute. For heavily
  JavaScript-dependent sites, a headless browser tool will produce better
  results than wget alone.

Windows Notes

curl is already on your machine
  Windows 10 and 11 come with curl built in. For simple one-off file
  downloads, curl will do the job without any installation. Where wget
  earns its place on Windows is site mirroring and recursive downloads,
  which curl cannot do.

Use PowerShell, not Command Prompt
  wget runs in both, but PowerShell is the better environment on Windows
  for command-line work generally. If you installed via Scoop, PowerShell
  is required.

File paths use backslashes on Windows
  When specifying an output directory with -P, use Windows-style paths if
  needed, for example -P C:\Users\You\Archives. Forward slashes also work
  in PowerShell, so -P ~/Documents/Archives should work there too.

wget on Windows is a port, not the native tool
  The Windows version is a port of GNU wget rather than a first-class
  Windows application. It works well for the use cases in this guide, but
  if you hit unusual issues not covered here, they may be Windows-specific.
  Searching for the error message alongside "wget Windows" is usually the
  fastest way to find a fix.

Tips

1. Add --wait=1 --random-wait to your command to space out requests and be
   less aggressive. This reduces the chance of being blocked and is kinder
   to smaller sites.

2. To resume an interrupted download, re-run the same command with the -c
   (--continue) flag added. wget will pick up where it left off rather than
   starting over.

3. To download a single file rather than mirror a whole site, just pass the
   URL with no extra flags: wget URL

4. You can use --reject to skip certain file types. For example,
   --reject="*.mp4,*.zip" will skip video and archive files if you only
   want the text and images.

5. Pair wget with the Wayback Machine: mirror a site locally with wget for
   offline access, and submit the URL to https://web.archive.org/save to
   preserve a public copy too.


CONCLUSION

Congratulations! You can now mirror websites to your local machine. For a
full list of wget options, run wget --help or consult the GNU wget manual
at https://www.gnu.org/software/wget/manual/