The rise of the datastore
The last year or so has seen the growth of a new type of resource available on the web, the “datastore”. Datastores are collections of data, generally, but not necessarily government data, although usually authoratitive. Examples include DataSF from San Fransisco, Chicago City Data, the UK Government datastore data.gov.uk, the London datastore and the Guardian newspaper’s datastore.
The defining aspect of datastores is that they provide ‘raw data’ collected together in one place, rather than being spread across many different government and other websites. That raw data can cover a wide variety of subjects, from mortality statistics and Indices of Deprivation, through to ‘how many miles of high-speed railway’ and FTSE100 Directors’ pay. Generally the data is presented in the form of tables, often in Excel or CSV format (or exportable in those formats).
Alongside the benefits of having the data collected together in one place, and in many cases having data that has never been made available publicly, datastores offer the potential to start to present and analyse the data through visualisations and data mashups – by combining data from more than one source and exposing connections. There are a few examples on Tony Hirst’s blog and an example below using Many Eyes of a visualization of London population.
Challenges for libraries and librarians
Although in terms of discovery, datastores help by collecting together relevant data in one place, such as on data.gov.uk , datastores do still present some specific challenges for libraries and librarians. There are still some discovery challenges, but I would consider that the biggest challenges are around librarians getting to grips with exploiting the data within the datastores.
The challenges in this area are about finding the datastores and understanding what is contained within each datastore. But these are skills that librarians are used to using to find and assess resources so shouldn’t present much of a challenge. Techniques such as building Google Custom Search engines to search datastores can help with finding relevant data within these resources.
Building a custom search engine to search the London, UK Government and Guardian datastores is fairly straightforward, so I’ve built a quick example at http://www.google.co.uk/cse/home?cx=009989586971183011327:nt82dyi0ehc
Using this form of search engine makes it simple to discover which datastores have datasets that may be of interest.
Where I think it starts to become more difficult for libraries is in exploiting the data in the datastores. There is a question here about the role of the librarian. Is the role to just find the data, check its quality and promote it to academics and students?, or is there a role to help users to find ways of using the data? The latter role implies a much deeper understanding of how the data can be used, not just being able to export the data in a spreadsheet and produce a nice visualization, but also to know how to use APIs to dig into datastores, to use tools such as Yahoo Pipes to take data and transform it. The question is how much librarians and libraries see that as their role, and how much do they see their role as being that of supporting students and academics in exploiting the data, by learning and teaching the techniques to understand and exploit the data.
Obviously some librarians are more comfortable playing around with data than others, but the interest among librarians in the Mashed Library events indicates that a growing number of librarians are starting to appreciate that this is an area relevant to libraries. But over the years libraries and librarians have had to get to grip with several generations of new technological innovations, from CD-ROMs, through the world wide web to RFID and in each case librarians have taken on board new skills to exploit the new technologies and help their users.