Written by Giles Bennett
On occasion we find ourselves needing to download an entire copy of an existing website - with the recent move of Animalcare to Wordpress, for example, it was the easiest way to retrieve the large quantity of product documentation on their site.
Whilst there are a number of utilities for both Windows and Mac that can do it - we are particular fans of Sitesucker, at a modest fee of around $5 - on occasion even that struggles. We recently needed to grab an existing site, managed through Concrete5, before the client's hosting was disabled, but Sitesucker kept coming up blank.
The answer lay in one of the handiest tools in Linux - wget. Whilst this is something that we regularly use for transferring large files between servers, with a few parameters it can be used to grab a complete copy of a website into a folder. The trick is in using the right parameters, and to save you reading the manual, here's how :
wget --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=windows --domains domain-to-grab.com domain-to-grab.com
Let's look at those parameters individually.
--recursive tells wget to follow links downwards through the directory structure. If no depth is specified, then the default of 5 directories downwards is set.
--no-clobber saves your bandwidth by not downloading a file if it would overwrite an existing file (thereby reducing the need to download multiple instances of the same file)
--page-requisites tells wget to download all files that are necessary to ensure the correct display of the page, which means it will pick up all the CSS, JS and so on.
--html-extension adds the extension .html to the local filename when saving it if the MIME type is .html but no extension is specified.
--convert-links makes sure that the downloaded files reference the converted links.
--restrict-file-names=windows restricts the file names for local files to those which can be used in Windows.
--domains restricts the files downloaded to the specified domain(s) so that it doesn't disappear off following external links away from the site and start downloading those too.
domain-to-grab.com - finally, we give it the starting point of its search.
And that's it - it will run quite happily from the command line and grab the entire site to a directory. Whilst it's not completely foolproof, when all other methods have failed, you can't beat the old ways!