Me, elsewhere on the net
  • Mastodon
  • Telegram
  • Facebook
  • X
  • VK
  • Instagram

Paolo Redaelli

A civil engineer with a longlife fondness for Software Libero
  • Home
  • Contatti
  • Gnome Apps
  • Info
  • La conta dei biscotti

Archiving a (WordPress) website with wget – D’Arcy Norman dot net

Paolo Redaelli2020-04-27

Archiving a (WordPress) website with wget


 Posted on December 24, 2011

I needed to archive several WordPress sites as part of the process of gathering the raw data for my thesis research. I found a few recipes online for using wget to grab entire sites, but they all needed some tweaking. So, here’s my recipe for posterity:

I used wget, which is available on any linux-ish system (I ran it on the same Ubuntu server that hosts the sites).

wget --mirror -p --html-extension --convert-links -e robots=off -P . http://url-to-site

That command doesn’t throttle the requests, so it could cause problems if the server has high load. Here’s what that line does:

  • –mirror: turns on recursion etc… rather than just downloading the single file at the root of the URL, it’ll now suck down the entire site.
  • -p: download all prerequisites (supporting media etc…) rather than just the html
  • –html-extension: this adds .html after the downloaded filename, to make sure it plays nicely on whatever system you’re going to view the archive on
  • –convert-links: rewrite the URLs in the downloaded html files, to point to the downloaded files rather than to the live site. this makes it nice and portable, with everything living in a self-contained directory.
  • -e robots=off: executes the “robots off” command, telling wget to ignore any directive to ignore the site in question. This is strictly Not a Good Thing To Do, but if you own the site, this is OK. If you don’t own the site being archived, you should obey all robots.txt files or you’ll be a Very Bad Person.
  • -P .: set the download directory to something. I left it at the default “.” (which means “here”) but this is where you could pass in a directory path to tell wget to save the archived site. Handy, if you’re doing this on a regular basis (say, as a cron job or something…)
  • http://url-to-site: this is the full URL of the site to download. You’ll likely want to change this.

You may also need to play around with the -D domain-list and/or –exclude-domains options, if you just want to control how it handles content hosted on more than one domain.

It’s worth noting that this isn’t WordPress-specific. This should work fine for archiving any website.

  • Click to Press This! (Opens in new window) Press This
  • Click to share on Telegram (Opens in new window) Telegram
  • Pocket
  • Share on Tumblr
  • More
  • Click to share on WhatsApp (Opens in new window) WhatsApp
  • Tweet

Like this:

Like Loading...

Related

Pages: 1 2 3
Posted in Web

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related

  • Ævitanda
  • BSSG – Bash Static Site Generator
  • Firefox inizia finalmente a supportare le PWA – Zeus News
  • Alternatives to white

Post navigation

Namechk | Username, Domain, and Trademark Search | Username Registration
Emu48
Powered by WordPress | Theme: Jorvik by UXL Themes
This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish.Accept Read More
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT
%d