So I one day I found myself in the market for a good web archiver. Specifically, there were some interesting open directories I wanted to mirror. My ideal solution would be a web front end around wget, but a little bit of research and testing showed that such an architecture would be too simplistic for the level of detail I wanted. There were a couple spider frameworks I tried out, like scrapy, but I wasn't enthusiastic about the prospect of trying to roll my own solution, when I knew sites like the Internet Archive had the exact kind of thing I had in mind, and they use the Heritrix engine archive their material. The Heritrix Wikipedia page mentions that it can output in the same directory format as wget (perfect!), but there's no citation for that, and the Heretrix documentation is unorganized, to say the least.
Setting it up
I used Heritrix 3.2.0, the most recent stable version, for this project.
This is not going to be a full tutorial on how to use Heritrix. Be sure to read the documentation, and to read the default job configuration file before starting a large job.
Install Heritrix as per the instructions and get it started. Navigate to the web interface and create a new job with the standard configuration.
Next we want to edit the disposition chain, which starts at line 335 in my default configuration. The first bean defined should be the warcWriter, which, obviously, writes out scraped content to WARC files. WARC files are perfect for preserving websites exactly how they were accessed, but are a little too clumsy to be convenient.
After the WARC bean, add the follow code:
<bean id="mirrorWriter" class="org.archive.modules.writer.MirrorWriterProcessor"> </bean>
Then, in the dispositionProcessors bean, remove the line <ref bean="warcWriter"/> from the the processors list, and add <ref bean="mirrorWriter"/>.
That's pretty much all that's needed to get started. There are more parameters that can be tweaked, which can be found in the 3.2.0 Javadoc, but the defaults are not anything surprising.