You are currently browsing the category archive for the ‘Digital Libraries’ category.

Photograph of grass in sunlightOne of the areas we started to explore with our digital archive project for was web archiving.  The opportunity arose to start to capture course websites from our Moodle Virtual Learning environment from 2006 onwards.   We made use of the standard web archive format WARC and eventually settled on Wget as the tool to archive the websites from moodle, (we’d started with using Heritrix but discovered that it didn’t cope with our authentication processes).  As a proof of concept we included one website in our staff version of our digital archive (the downside of archiving course materials is that they are full of copyright materials) and made use of a local instance of the Wayback machine software from the Internet Archive.  [OpenWayback is the latest development].   So we’ve now archived several hundred module websites and will be starting to think about how we manage access to them and what people might want to do with them (beyond the obvious one of just looking at them to see what was in those old courses).

So I was interested to see a tweet and then a blog post about a tool called warcbase – described as ‘an open-source platform for managing web archives…’ but particularly because the blog post from Ian Milligan combined web archiving with something else that I’d remembered Tony Hirst talking and blogging about, IPython and Jupyter. It also reminded me of a session Tony ran in the library taking us through ipython and his ‘conversations with data’ approach.

The warcbase and jupyter approach takes the notebook method of keeping track of your explorations and scripting and applies it to the area of web archives to explore the web archive as a researcher might.  So it covers the sort of analytical work that we are starting to see with the UK Web Archive data (often written up on the UK Web Archive blog).   And it got me starting to wonder both about whether warcbase might be a useful technology to explore as a way of thinking about how we might develop a method of providing access to the VLE websites archive.  But it also made me think about what the implications might be of the skills that librarians (or data librarians) might need to have to facilitate the work of researchers who might want to run tools like jupyter across a web archive, and about the technology infrastructure that we might need to facilitate this type of research, and also about what the implications are for the permissions and access that researchers might need to explore the web archive.  A bit of an idle thought about what we might want to think about.

OU Digital Archive home pageThe digital archive site that we’ve been working away on for a while now is finally public.  It is being given a very low-key soft launch to give time for more testing and checking to make sure that the features work OK for users, but as it has now been tweeted about, is linked from our main library website and findable on Google, then I can finally write a short piece about it.

The site has gone live with a mix of images, some videos about the university and a small collection of video clips from the first science module in the 1970s.  Accompanying the images and videos are a couple of sub-sites we’ve called Exhbitions. To start with there are two, one covering the teaching of Shakespeare and the other giving a potted history of the university.  The exhibitions are designed to give a bit more context around some of the material in the collection.

The small collection of 160 historical images from the history of the university include people involved in the development of the university or significant events such as the first graduation ceremony, as well as a selection of images about the construction of the campus.   The latter is slightly odd maybe for a distance learning institution, with a campus that most students may never see, but maybe that makes the changes to the physical enviroment of interest to students and the general viewer nonetheless.

The selection of videos include a collection of thirty programmes about the university mostly from the 1970s and 1980s and mainly from a magazine-style series called Open Forum, giving students a bit of an insight into the life of the university.  It includes sections from various University officials, but also student experiences, Summer schools and the like.  Some of the videos cover events such as royal visits and material about the history of the university.

Less obvious to the casual browser is the inclusion of a large collection of metadata about university courses.  This metadata profile forms a skeleton or scaffolding that is used to hang the bits of digitised course materials together and relate them to their parent course/module.  So it gives a way of displaying the Module presentation datesdifferent types of material included in a module together as well as giving information about the module, its subjects and when it ran.  At the moment there are only a few digitised samples hanging on the underlying bare bones.

To find the metadata go to the View All tab, make sure the ‘Available online’  button isn’t selected and choose ‘Module overview’ from Content Type, and it’s possible to browse through some details of the university’s old modules, seeing some information about the module, when they were run.  You can also follow through to the linked data repository at e.g. Underpinning this aspect of the site is a semantic web RDF triplestore.

Public and staff sites
One of the challenges for the digital archive is that it is essentially two different sites under the skin.  A staff version of the site has been available internally for over a year and lets staff login to see a broader range of material, particularly from old university course materials.  So staff can access some sound recordings as well as a small number of digitised books, and access a larger collection of videos, although at this stage it’s still a fairly small proportion of the overall archive.  But more will be added over time as well as hopefully some of the several hundred module websites that have been archived over the past three years.

Intellectual Property
Unlike many digital archives all of the content is relatively recent, i.e. less than fifty years old.  And that gives a different set of challenges as there is a lot of content that would need to have Intellectual Property rights cleared before it could be made openly available.  So there are a small number of clips but at the moment limited amounts of course materials that have been able to be made open.  So one of the challenges will be to find ways to fund making more material open, both in terms of the effort needed to digitise and check material and the cost of payments to any rights holders.

The digital archive can be found at

I noticed this morning a blog post on the Wellcome Library plans to build a cloud-based digital library platform, ‘Moving the Wellcome Library to the cloud‘  It’s a fascinating piece of news.  The Wellcome Library’s amibition and scale, talking about having over 30m digitised pages by 2018 and about building a platform that could potentially be made use of by others is interesting to see.

As we’ve seen with Library Management Systems, cloud-based systems are becoming commonplace but where digital libraries seem to be concerned, most of them are operated as locally hosted systems.   The article also talks about the use of IIIF (International Image Interoperability Framework)  which is something for digital libraries to take notice of.  It also flags some developments to Wellcome’s media player to create a new Universal Viewer to handle video, audio and other material.  Given how tricky we’ve found getting accesible media players it will be interesting to keep an eye on these developments.

Mention of APIs, commodity services and APIs are also in scope.  Something definitely to watch for the future.

We’ve been using Trello ( as a tool to help us manage the lists of tasks in the digital library/digital archive project that we’ve been running.  After looking at some of our existing tools (such as Mantis Bug Tracker for example) the team decided that they didn’t really want the detailed tracking features and didn’t feel that our standard project management tools (MS Project and the One Page Project Manager, or Outlook tasks) were quite what we needed to keep track of what is essentially a ‘product backlog‘, a list of requirements that need to be developed for the digital archive system.

Trello’s simplicity Trello desktop screenshotmakes it easy to add and organise a list of tasks and break them down into categories, with colour-coding and the ability to drag tasks around from one stream to another.  Being able to share the board across the team and assign members to the task is good.  You can also set due dates and attach files, which we’ve found useful to use to attach design and wireframe illustrations.  You can set up as many different boards as you need to so can breakdown your tasks however you want to.  The boards scroll left and right so you can go to as many columns as you need to.

We’ve been using it to group together priority tasks into a list so the team know which tasks to concentrate on, and when the tasks are done the team member can update the task message so each task can be checked and cleared off the list.

Trello ipad screenshot We’re mainly using Trello on the desktop straight from the website, although there is also an ipad app that seems to work well.  For a fairly small team with just a single developer Trello seems to work quite well.  It’s simple and easy to use and doesn’t take a lot of effort to keep up to date, it’s a practical and useful tool.   If you had a larger project you might want to use more sophisticated tools that have some ability to track progress and effort and produce burndown charts for example, but as a simple way of tracking a list of tasks to be worked on, it’s a useful project tool.




Most of the time Photograph of bee on teaselmy interest is about making sure that users of websites can get access to an appropriate version of the website, or that the site works on a variety of different devices.  But as websites become more personalised, my version of your website might look different to your version.

But one of the other projects that I’m involved with is looking at web archiving of University websites, mainly internal ones that aren’t being captured by the Internet Archive or the UK Web Archive.   And personalisation and different forms that websites can take is one of the really big challenges for capturing web sites.  So I was interested to read a recent article in D-Lib Magazine ‘A method for identifying personalised representations in web archives’ by Kelly, Brunelle, Weigle and Nelson, D-Lib Magazine, November/December 2013, Vol. 19, number 11/12 doi:10.1045/november2013-kelly

This article describes how the user-agent string in mobile browsers is used to serve different versions of webpages.  They show some good examples from CNN of the completely different representations that you might see on iphones, desktops and android devices.  The paper goes on to talk through some possible solutions to identify different versions and suggests a modification of the Wayback machine engine to allow the user to choose which versions of a user-agent you may want to view from an archive.  Combined with the memento approach that offers time-based versions of a website it’s interesting to see an approach that starts to look at ways of capturing the increasingly fragmented and personalised nature of the web.

I just remembered to put up my slides onto slideshare from a talk I gave to a group of students about the work that we’ve been doing around linked data, particularly in relation to the STELLAR project.   STELLAR was a Jisc-funded project that finished in July.  It investigated the value of a digital library collection of old course material, carried out an enhancement using linked data technology and then evaluated the impact on perceptions of value.

The slides talk through why semantic web technologies might be important to libraries, cover a very basic outline of linked data and then concentrate on discussing what we did in STELLAR, what we found and how we’ve embedded that technology into our new digital archive.

The slides are on slideshare at and embedded below

News that the Library of Congress have been collecting an archive from twitter for the last few years (reported by the Library of Congress and also in a story from the Washington Post) caught my attention.  Web archiving now seems to be something that is gaining some attention as  something that Libraries, particularly National Libraries, should be engaged in.  So, for example, as the well as the work the Library of Congress are doing, the BL have the UK Web Archive  and Australia have Pandora.

Although National Libraries and the Internet Archive have been Web archiving for a while, coverage of the Web is never going to be comprehensive and in particular is always likely to exclude material locked away in institutional systems.  For Universities that means that material in their Virtual Learning Environment (VLE), for example, isn’t going to be archived by these web-scale systems, so if you want to preserve a record of how your institution offered online learning, someone has to take steps to actively archive those websites.

What was particularly interesting about the twitter index was that although processes have been put in place to capture and archive the material, there is still some way to go to be able to provide access to that material.  Web archiving is something that we’ve been working on for the last few months as part of our digital library work and it has quickly become apparent that collecting the material and presenting the material represent two very different challenges.  I’m not entirely sure that the analogy works entirely but it seems to me that you could think of the collection stage of being akin to ‘rescue archaeology’  in that often, what we are having to do, is to archive a website before it is deleted or the server/application closed down.

Collecting web archiving material
We’ve been working on web archiving some of our internal websites, such as our moodle VLE sites, of which there are several thousand going back to 2006.  So we’ve had to establish some selection criteria, eventually choosing first and last presentations of individual modules, but recognising that we might also have to capture websites that display particularly significant pedagogical features or aspects of learning design.

To capture the websites our digital library developer initially started with using a web archiving tool called Heritrix but discovered that this had problems with our authentication system.  Switching to another tool, Wget  proved to be more successful and has allowed us to successfully archive several hundred sites.  Both tools essentially work by being given a URL and some parameters and then copying the webpage content, following links to retrieve files/images and continuing across the hierarchy of a site.  It is usually a bit of trial and error to get the parameters right so that you archive what you want from the site without straying into other sites.  So there is some work to monitor, stop and restart the processes to capture the right content.  What you get at the end of the process is an archive file in WARC format.

We have had some challenges to overcome such as concern being expressed that web archiving shouldn’t take place on live systems as web archiving activity could be seen as being similar to a ‘denial of service’ attack, given that it makes a large number of requests in a short space of time.  Given that organisations such as the Internet Archive will be web archiving our public sites all the time anyway, that one surprised us a little.  Tools like Wget and Heritrix allow you to ‘throttle’ them so they can make limited numbers of requests to minimise the impact on systems.

Displaying web archived material
Although we have captured several hundred websites we haven’t yet made them all available.  As with the Library of Congress twitter archive we’ve found that there is quite a significant piece of work to make the websites available.  We’ve concentrated on working with one test website as a proof of concept.  The approach our digital library developer has taken is to use a local copy of the Wayback Machine software to ‘play-back’ a version of the website.  We’ve found that this works pretty well and gives us a reasonable representation of the original website with functioning links to content within that particular website.    As part of the digital library work the website has also been pulled apart into its constituent parts and these have been indexed and ingested into the fedora digital library to allow the digital library search to find websites alongside other content.

Whilst the process seems to work quite well there’s some work to do to get all the sites loaded into the digital library.  So while we’ve a fairly well-established routine now to archive the sites, we’ve still some work to do to put in place routines to publish the material into the digital library.  But it’s been a good peice of work to do and adds to the content that we can make available through the new digital library once it goes live later this year.

Harvard Elevator Pitch screenshotOne of the really useful things about being involved with JISC-funded projects is that you get to take part in programme meetings and they often lead to finding out about interesting tools that I probably wouldn’t otherwise have come across.  So last week I was with the STELLAR project team that went to the programme meeting for the ‘Enhancing the Sustainability of Digital Content’ programme meeting, and we were introduced to the Harvard Business School Elevator Pitch Builder tool.  For anyone who hasn’t come across the ‘Elevator Pitch’ the idea is that you have the length of a journey in an elevator (lift) to make your pitch, for your project or idea.  The thinking being that you might be in a lift with the Vice Chancellor and he asks ‘what do you do?’   Essentially it is a tool to get you to structute and organise a succinct pitch that gets across the key points of what you want to say.

Harvard’s Elevator Pitch tool gets you to create some text to answer WHO, WHAT, WHY and GOAL, then analyses your pitch in terms of the number of words, time it will take to say and how many words are repeated.  The tool suggests suitable words that you might want to use to get the attention of the person you are speaking to. It’s a good tool to use to get a nicely structured pitch for a project.

JISC programme meetings are a really useful part of being involved in a JISC project.  You generally get the chance to find out at an early stage what the other projects in your programme strand are working on (in our case a range of digital content, from UK Web Archive big data through to archaeology, geospatial and botanical content). That can be really useful as you can find where there is common ground and make a lot of useful contacts amongst people working on similar things.  So we’ve got a few contacts to follow up in the digital libraries area.  And JISC programme managers are really useful people to know as they have a great breadth of knowledge of what is going on in several areas of work.

Latest project
From February I’m going to be involved in a new project, STELLARSemantic Technologies Enhancing the Lifecycle of LeArning Resources (funded by JISC).   In some ways the project connects with previous work I’ve been involved with in the Lucero project in that it will be employing linked data, and will be working with learning materials, in that I’ve had some involvement with our production and presentation learning systems through the VLE.  But STELLAR will be dealing with a different area for me, in that we’ll be looking at my institution’s store of legacy learning materials.   So it’s a good opportunity to learn more about curation and preservation and digital lifecycles.

STELLAR is particularly going to be looking at trying to understand the value of those legacy learning materials by talking to the academics who have been involved in creating those materials.   There are quite a few reasons why older course materials may still have value, they might be able to be reused in new courses on the basis that reusing old materials might be less costly than creating new materials.  They might have value in being able to be transformed into Open Educational Resources.  Or, for example, they might have value in being good historical examples of styles of teaching and learning.  So STELLAR will be exploring different types and models of expressing the value of those materials.

Finding out about the value that is placed on these materials can also be an important factor when trying to understand which materials to preserve as a priority, or where you should expend your resources, and we’d hope that STELLAR would help to inform HE policies as institutions build up increasing amounts of digital learning materials.

As part of STELLAR we will be taking some digital legacy learning material and transforming it into linked data (with some help from our friends in KMi). This gives us the opportunity to connect old course materials into the OU’s ecosystem by linking to existing datasets on current courses and OER material in OpenLearn.  By transforming the content in this way we can then explore whether making it more discoverable changes the value proposition, makes the content more likely to be reused or opens up other possibilities.  It should be an interesting project and one that I’m looking forward to, as there are going to be a lot of opportunties to build up my understanding of these issues and aspects.

Last time I heard the results of a Funding bid we’d submitted I was sitting in a conference in London.  It seems to be becoming a habit as we had the results of our latest funding bid just before Christmas.   This time I was sitting in a coffee bar in Yorkshire, and it was a nice surprise to hear that we’d been successful as I wasn’t expecting the results before Christmas.  We’d put in a funding bid back in November and all being well with the clarifications on a few points, are going to be doing some work starting next month with our digital legacy learning materials and linked data.  We’re looking forward to getting started on STELLAR.

Twitter posts



July 2020

Creative Commons License