Dev Notes

Various Cheat Sheets and Resources by David Egan/Carawebs.

Download A Website Using wget


Sysadmin
David Egan

Use the wget command line utility to download an entire website.

Be careful with recursive retrieval - you might download the entire internet!

wget \
     --recursive \
     --no-clobber \
     --page-requisites \
     --html-extension \
     --convert-links \
     --domains testsite.com \
     --no-parent \
         http://testsite.com/
  • –recursive: downloads entire site
  • –no-clobber: doesn’t overwrite files, useful for interrupted downloads
  • –page-requisites: download all the files required to display the page (CSS, images etc)
  • –html-extension: save files with extension HTML
  • –convert-links: make links relative so they work off-line
  • –domains: Set domains to be followed
  • –no-parent: Don’t ascend to the parent directory when retrieving recursively - guarantees that only the files below a certain hierarchy will be downloaded

Alternative Method

This can sometimes works better:

wget --wait=20 --limit-rate=20K -r -p -U Mozilla http://www.testsite.com

Friendlier on the target website and avoids getting blocked.

Note: We use this primarily for downloading our own or our client’s CMS based sites in “flat” HTML - so we’re only hitting our own site resources, or we’re downloading with permission.

If you’re using this method to download other people’s websites, be responsible.

See wget man page for more details.


comments powered by Disqus