News that the Library of Congress have been collecting an archive from twitter for the last few years (reported by the Library of Congress and also in a story from the Washington Post) caught my attention.  Web archiving now seems to be something that is gaining some attention as  something that Libraries, particularly National Libraries, should be engaged in.  So, for example, as the well as the work the Library of Congress are doing, the BL have the UK Web Archive  and Australia have Pandora.

Although National Libraries and the Internet Archive have been Web archiving for a while, coverage of the Web is never going to be comprehensive and in particular is always likely to exclude material locked away in institutional systems.  For Universities that means that material in their Virtual Learning Environment (VLE), for example, isn’t going to be archived by these web-scale systems, so if you want to preserve a record of how your institution offered online learning, someone has to take steps to actively archive those websites.

What was particularly interesting about the twitter index was that although processes have been put in place to capture and archive the material, there is still some way to go to be able to provide access to that material.  Web archiving is something that we’ve been working on for the last few months as part of our digital library work and it has quickly become apparent that collecting the material and presenting the material represent two very different challenges.  I’m not entirely sure that the analogy works entirely but it seems to me that you could think of the collection stage of being akin to ‘rescue archaeology’  in that often, what we are having to do, is to archive a website before it is deleted or the server/application closed down.

Collecting web archiving material
We’ve been working on web archiving some of our internal websites, such as our moodle VLE sites, of which there are several thousand going back to 2006.  So we’ve had to establish some selection criteria, eventually choosing first and last presentations of individual modules, but recognising that we might also have to capture websites that display particularly significant pedagogical features or aspects of learning design.

To capture the websites our digital library developer initially started with using a web archiving tool called Heritrix but discovered that this had problems with our authentication system.  Switching to another tool, Wget  proved to be more successful and has allowed us to successfully archive several hundred sites.  Both tools essentially work by being given a URL and some parameters and then copying the webpage content, following links to retrieve files/images and continuing across the hierarchy of a site.  It is usually a bit of trial and error to get the parameters right so that you archive what you want from the site without straying into other sites.  So there is some work to monitor, stop and restart the processes to capture the right content.  What you get at the end of the process is an archive file in WARC format.

We have had some challenges to overcome such as concern being expressed that web archiving shouldn’t take place on live systems as web archiving activity could be seen as being similar to a ‘denial of service’ attack, given that it makes a large number of requests in a short space of time.  Given that organisations such as the Internet Archive will be web archiving our public sites all the time anyway, that one surprised us a little.  Tools like Wget and Heritrix allow you to ‘throttle’ them so they can make limited numbers of requests to minimise the impact on systems.

Displaying web archived material
Although we have captured several hundred websites we haven’t yet made them all available.  As with the Library of Congress twitter archive we’ve found that there is quite a significant piece of work to make the websites available.  We’ve concentrated on working with one test website as a proof of concept.  The approach our digital library developer has taken is to use a local copy of the Wayback Machine software to ‘play-back’ a version of the website.  We’ve found that this works pretty well and gives us a reasonable representation of the original website with functioning links to content within that particular website.    As part of the digital library work the website has also been pulled apart into its constituent parts and these have been indexed and ingested into the fedora digital library to allow the digital library search to find websites alongside other content.

Whilst the process seems to work quite well there’s some work to do to get all the sites loaded into the digital library.  So while we’ve a fairly well-established routine now to archive the sites, we’ve still some work to do to put in place routines to publish the material into the digital library.  But it’s been a good peice of work to do and adds to the content that we can make available through the new digital library once it goes live later this year.