Using Heritrix to archive sites to a directory structure
So I one day I found myself in the market for a good web archiver. Specifically, there were some interesting open directories I wanted to mirror. My ideal solution would be a web front end around wget, but a little bit of research and testing showed that such an architecture would be too simplistic for the level of detail I wanted. There were a couple spider frameworks I tried out, like scrapy, but I wasn't enthusiastic about the prospect of trying to roll my own solution, when I knew sites like the Internet Archive had the exact kind of thing I had in mind, and they use the Heritrix engine archive their material.