{"id":7139,"date":"2020-04-27T01:09:56","date_gmt":"2020-04-26T23:09:56","guid":{"rendered":"https:\/\/monodes.com\/predaelli\/?p=7139"},"modified":"2020-04-27T01:09:56","modified_gmt":"2020-04-26T23:09:56","slug":"archiving-a-wordpress-website-with-wget-darcy-norman-dot-net","status":"publish","type":"post","link":"https:\/\/monodes.com\/predaelli\/2020\/04\/27\/archiving-a-wordpress-website-with-wget-darcy-norman-dot-net\/","title":{"rendered":"Archiving a (WordPress) website with wget &#8211; D&#8217;Arcy Norman dot net"},"content":{"rendered":"<p><em><a href=\"https:\/\/darcynorman.net\/2011\/12\/24\/archiving-a-wordpress-website-with-wget\/\">Archiving a (WordPress) website with wget &#8211; D&#8217;Arcy Norman dot net<\/a><\/em><\/p>\n<header class=\"entry-header\"><a href=\"https:\/\/www.guyrutenberg.com\/2014\/05\/02\/make-offline-mirror-of-a-site-using-wget\/\">Make Offline Mirror of a Site using `wget`<\/a><\/header>\n<p><!--more--><!--nextpage--><\/p>\n<blockquote>\n<header class=\"header-section \">\n<div class=\"intro-header no-img\">\n<div class=\"container\">\n<div class=\"row\">\n<div class=\"col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1\">\n<div class=\"posts-heading\">\n<h1>Archiving a (WordPress) website with wget<\/h1>\n<hr class=\"small\" \/>\n<p><span class=\"post-meta\">\u00a0Posted on December 24, 2011<br \/>\n<\/span><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/header>\n<div class=\"container\" role=\"main\">\n<div class=\"row\">\n<div class=\"col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1\">\n<article class=\"blog-post\" role=\"main\">I needed to archive several WordPress sites as part of the process of gathering the raw data for my thesis research. I found a few recipes online for using <code class=\"\" data-line=\"\">wget<\/code> to grab entire sites, but they all needed some tweaking. So, here&#8217;s my recipe for posterity:<\/p>\n<p>I used wget, which is available on any linux-ish system (I ran it on the same Ubuntu server that hosts the sites).<\/p>\n<p><code class=\"\" data-line=\"\">wget --mirror -p --html-extension --convert-links -e robots=off -P . http:\/\/url-to-site<\/code><\/p>\n<p>That command doesn&#8217;t throttle the requests, so it could cause problems if the server has high load. Here&#8217;s what that line does:<\/p>\n<ul>\n<li>&#8211;mirror: turns on recursion etc&#8230; rather than just downloading the single file at the root of the URL, it&#8217;ll now suck down the entire site.<\/li>\n<li>-p: download all prerequisites (supporting media etc&#8230;) rather than just the html<\/li>\n<li>&#8211;html-extension: this adds .html after the downloaded filename, to make sure it plays nicely on whatever system you&#8217;re going to view the archive on<\/li>\n<li>&#8211;convert-links: rewrite the URLs in the downloaded html files, to point to the downloaded files rather than to the live site. this makes it nice and portable, with everything living in a self-contained directory.<\/li>\n<li>-e robots=off: executes the &#8220;robots off&#8221; command, telling wget to ignore any directive to ignore the site in question. This is strictly Not a Good Thing To Do, but if you own the site, this is OK. If you don&#8217;t own the site being archived, you should obey all robots.txt files or you&#8217;ll be a Very Bad Person.<\/li>\n<li>-P .: set the download directory to something. I left it at the default &#8220;.&#8221; (which means &#8220;here&#8221;) but this is where you could pass in a directory path to tell wget to save the archived site. Handy, if you&#8217;re doing this on a regular basis (say, as a cron job or something&#8230;)<\/li>\n<li>http:\/\/url-to-site: this is the full URL of the site to download. You&#8217;ll likely want to change this.<\/li>\n<\/ul>\n<p>You may also need to play around with the -D domain-list and\/or &#8211;exclude-domains options, if you just want to control how it handles content hosted on more than one domain.<\/p>\n<p>It&#8217;s worth noting that this isn&#8217;t WordPress-specific. This should work fine for archiving any website.<\/p>\n<\/article>\n<\/div>\n<\/div>\n<\/div>\n<\/blockquote>\n<p><!--nextpage--><\/p>\n<blockquote>\n<header class=\"entry-header\">\n<h1 class=\"entry-title\">Make Offline Mirror of a Site using `wget`<\/h1>\n<\/header>\n<div class=\"entry-content\">\n<p>Sometimes you want to create an offline copy of a site that you can take and view even without internet access. Using <code class=\"\" data-line=\"\">wget<\/code> you can make such copy easily:<\/p>\n<pre><code class=\"\" data-line=\"\">wget --mirror --convert-links --adjust-extension --page-requisites \n--no-parent http:\/\/example.org\n<\/code><\/pre>\n<p>Explanation of the various flags:<\/p>\n<ul>\n<li><code class=\"\" data-line=\"\">--mirror<\/code> \u2013 Makes (among other things) the download recursive.<\/li>\n<li><code class=\"\" data-line=\"\">--convert-links<\/code> \u2013 convert all the links (also to stuff like CSS stylesheets) to relative, so it will be suitable for offline viewing.<\/li>\n<li><code class=\"\" data-line=\"\">--adjust-extension<\/code> \u2013 Adds suitable extensions to filenames (<code class=\"\" data-line=\"\">html<\/code> or <code class=\"\" data-line=\"\">css<\/code>) depending on their content-type.<\/li>\n<li><code class=\"\" data-line=\"\">--page-requisites<\/code> \u2013 Download things like CSS style-sheets and images required to properly display the page offline.<\/li>\n<li><code class=\"\" data-line=\"\">--no-parent<\/code> \u2013 When recursing do not ascend to the parent directory. It useful for restricting the download to only a portion of the site.<\/li>\n<\/ul>\n<p>Alternatively, the command above may be shortened:<\/p>\n<pre><code class=\"\" data-line=\"\">wget -mkEpnp http:\/\/example.org\n<\/code><\/pre>\n<p>Note: that the last <code class=\"\" data-line=\"\">p<\/code> is part of <code class=\"\" data-line=\"\">np<\/code> (<code class=\"\" data-line=\"\">--no-parent<\/code>) and hence you see <code class=\"\" data-line=\"\">p<\/code> twice in the flags.<\/p>\n<\/div>\n<\/blockquote>\n","protected":false},"excerpt":{"rendered":"<p class=\"excerpt\">Archiving a (WordPress) website with wget &#8211; D&#8217;Arcy Norman dot net Make Offline Mirror of a Site using `wget`<\/p>\n<p class=\"more-link-p\"><a class=\"more-link\" href=\"https:\/\/monodes.com\/predaelli\/2020\/04\/27\/archiving-a-wordpress-website-with-wget-darcy-norman-dot-net\/\">Read more &rarr;<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"link","meta":{"inline_featured_image":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"activitypub_content_warning":"","activitypub_content_visibility":"","activitypub_max_image_attachments":4,"activitypub_interaction_policy_quote":"anyone","activitypub_status":"","footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"jetpack_post_was_ever_published":false},"categories":[46],"tags":[],"class_list":["post-7139","post","type-post","status-publish","format-link","hentry","category-web","post_format-post-format-link"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p6daft-1R9","jetpack-related-posts":[{"id":5338,"url":"https:\/\/monodes.com\/predaelli\/2019\/03\/10\/before-curl-there-were-wget\/","url_meta":{"origin":7139,"position":0},"title":"Before cURL there were wget!","author":"Paolo Redaelli","date":"2019-03-10","format":false,"excerpt":": What is cURL and why is it all over API docs? \u2013 Amara Graham \u2013 Medium Well, wget is also there since, well, 1996!! :)","rel":"","context":"In &quot;Fun&quot;","block_context":{"text":"Fun","link":"https:\/\/monodes.com\/predaelli\/category\/fun\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":12892,"url":"https:\/\/monodes.com\/predaelli\/2025\/03\/06\/how-to-download-all-pdf-files-linked-from-a-single-page-using-wget\/","url_meta":{"origin":7139,"position":1},"title":"How to download all PDF files linked from a single page using wget","author":"Paolo Redaelli","date":"2025-03-06","format":false,"excerpt":"You can use wget to download all PDFs from a webpage by using: wget -r -l1 -H -t1 -nd -N -np -A.pdf -erobots=off --wait=2 --random-wait --limit-rate=20k [URL] -r: Recursive download. -l1: Only one level deep (i.e., only files directly linked from this page). -H: Span hosts (follow links to other\u2026","rel":"","context":"In &quot;Tricks&quot;","block_context":{"text":"Tricks","link":"https:\/\/monodes.com\/predaelli\/category\/documentations\/tricks\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":6480,"url":"https:\/\/monodes.com\/predaelli\/2020\/01\/25\/advanced-cli-commands-you-should-know-as-a-developer\/","url_meta":{"origin":7139,"position":2},"title":"Advanced CLI: Commands You Should Know as a Developer","author":"Paolo Redaelli","date":"2020-01-25","format":"link","excerpt":"Advanced CLI: Commands You Should Know as a Developer May I feel a little proud when I tell you I know them all? :) Advanced CLI: Commands You Should Know as a Developer Advanced commands; get more done No, in this article we won\u2019t go over the basic commands like\u2026","rel":"","context":"In &quot;Documentations&quot;","block_context":{"text":"Documentations","link":"https:\/\/monodes.com\/predaelli\/category\/documentations\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":3949,"url":"https:\/\/monodes.com\/predaelli\/2018\/03\/31\/piccole-magie\/","url_meta":{"origin":7139,"position":3},"title":"Piccole magie","author":"Paolo Redaelli","date":"2018-03-31","format":false,"excerpt":"Trasformare un sito Wordpress in un sito HTML statico Dovevo fare una copia di un sito Wordpress e archiviarlo, ma volevo qualcosa che all\u2019eventuale ripristino non mi costringesse ad installare un database server (alla MySQL) e un server web. Ci sono tanti modi per farlo; l\u2019ho fatto con Wget, una\u2026","rel":"","context":"In &quot;Documentations&quot;","block_context":{"text":"Documentations","link":"https:\/\/monodes.com\/predaelli\/category\/documentations\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":8913,"url":"https:\/\/monodes.com\/predaelli\/2021\/11\/15\/i-know-them-all\/","url_meta":{"origin":7139,"position":4},"title":"I know them all","author":"Paolo Redaelli","date":"2021-11-15","format":false,"excerpt":"Linux Networking Commands That You Must Know | by Vikram Gupta | Nov, 2021 | Level Up Coding Fine, I know and use them all: ifconfig traceroute tracepath ping netstat hostname curl wget whois scp ssh","rel":"","context":"In &quot;Fun&quot;","block_context":{"text":"Fun","link":"https:\/\/monodes.com\/predaelli\/category\/fun\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":11612,"url":"https:\/\/monodes.com\/predaelli\/2024\/04\/20\/its-worth\/","url_meta":{"origin":7139,"position":5},"title":"It&#8217;s worth","author":"Paolo Redaelli","date":"2024-04-20","format":false,"excerpt":"Xz may had had a huge trust-related security issue but its performance is still very desiderable: paolo@DietPi:~\/Scaricati$ wget --mirror it.aleteia.org paolo@DietPi:~\/Scaricati$ du -sch it.aleteia.org\/; time tar -acf ~\/archivio\/data\/Documenti\/it.aleteia.org.tar.xz it.aleteia.org\/; du -h ~\/archivio\/data\/Documenti\/it.aleteia.org.tar.xz<br>37G it.aleteia.org\/<br>37G totale real 614m8,594s<br>user 469m26,287s<br>sys 15m33,329s<br>1,6G \/home\/paolo\/archivio\/data\/Documenti\/it.aleteia.org.tar.xz This humble Raspberry Pi 3 may be aging and slow but\u2026","rel":"","context":"In &quot;Mood&quot;","block_context":{"text":"Mood","link":"https:\/\/monodes.com\/predaelli\/category\/mood\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"jetpack_likes_enabled":true,"_links":{"self":[{"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/posts\/7139","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/comments?post=7139"}],"version-history":[{"count":0,"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/posts\/7139\/revisions"}],"wp:attachment":[{"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/media?parent=7139"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/categories?post=7139"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/tags?post=7139"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}