You are currently browsing the category archive for the ‘Data’ category.

Analytics seems to be a major theme of a lot of conferences at the moment.  I’ve been following a couple of library sector conferences this week on twitter (Talis Insight #talisinsight and the 17th Distance Library Services Conference #dls16) and analytics seems to be a very common theme.

A colleague at the DLS conference tweeted a picture 2016-04-22_1535about the impact of a particular piece of practice and that set us off thinking, did we have that data?, did we have examples of where we’d done something similar?    The good thing now is that I think rather than thinking ‘it would be good if we could do something like that’, we’ve a bit more confidence – if we get the examples and the data, we know we can do the analyses, but we also know we ‘should’ be doing the analyses as a matter of course.

It was also good to see that other colleagues (@DrBartRienties) at the university were presenting some of the University’s learning analytics work at Talis Insight. Being at a university that is undertaking a lot of academic work on learning analytics is both really helpful when you’re trying to look at library analytics but also provides a valuable source of advice and guidance in some of our explorations.

[As an aside, and having spent much of my library career in public libraries, I’m not sure how much academic librarians realise the value of being able to talk to academics in universities, to hear their talks, discuss their research or get their advice.  In a lot of cases you’re able to talk with world-class researchers doing ground-breaking work and shaping the world around us.]


Photgraph of RobinWe’re in the early stages of our work with library data and I thought I’d write up some reflections on the early stages.  So far we’ve mostly confined ourselves to trying to understand the library data we have and suitable methods to access it and manipulate it.  We’re interested in  aggregations of data, e.g. by week, by month, by resource, in comparison with total student numbers etc.

Ezproxy data
One of our main sources of data is from ezproxy, which we use for both on and off-campus use of online library resources.  Around 85-90% of our authenticated resource access goes through this system.   One of the first things we learnt when we started investigating this data source is that there are two levels of logfile – the full log of all resource requests and the SPU (Starting Point URL) logfile.   The latter tracks the first request to a domain in a session.

We looked at approaches that others had taken to help shape how we approached analysing the data.  Wollongong for example, decided to analyse the time stamp as follows:

  • The day  is divided into 144 10-minute sessions
  • If a student has an entry in the log during a 10-minute period, then 1/6 is added to the sum of that student’s access for that session (or week, in the case of the Marketing Cube).
  • Any further log entries during that student’s 10-minute period are not counted.

Using this logic, UWL measures how long students spent using its electronic resources with a reasonable degree of accuracy due to small time periods (10 minutes) being measured.

Discovering the Impact of Library Use and Student Performance, Cox and Jantti 2012

To adopt this approach would mean that we’d need to be looking at the full log files to pick up each of the 10 minute sessions.  Unfortunately owing to the size of our version of the full logs we found it wasn’t going to be feasible to use this approach, we’d have to use the SPU version and take a different approach.

Athens data
A small proportion of our resource authentication goes through OpenAthens.   Each month we get a logfile of resource accesses that have been authenticated using this route.   Unlike ezproxy data we don’t get a date/timestamp, all we know is that those resources were accessed during the month.  Against each resource/user combination you get a count of the number of times that combination occurred during the month.

Looking into the data one of the interesting things we’ve been able to identify is that OpenAthens authentication also gets used for other resources not just library resources, so for example we’re using it for some library tools such as RefWorks and Library Search, but it’s straight-forward to take those out if they aren’t wanted in your analysis.

So one of the things we’ve been looking at is how easy it is to add the Athens and Ezproxy data together.   There are similarities between the datasets but some processing is needed to join them up.  The ezproxy data can be aggregated at a monthly level and there are a few resources that we have access to via both routes so those resource names need to be normalised.

The biggest difference between the two datasets is that whereas you get a logfile entry for each SPU access in the ezproxy dataset you get a total per month for each user/resource combination in the OpenAthens data.  One approach we’ve tried is just to duplicate the rows, so where the count says the resource/user combination appeared twice in the month, just copy the line.  In that way the two sets of data are comparable and can be analysed together, so if you wanted to be able do a headcount of users who’ve accessed 1 or more library resources in a month you could include data from both ezproxy and openathens authenticated resources.

Numbers and counts
One thing we’ve found is that users of the data want several different counts of users and data from the library e-resources usage data.  The sorts of questions we’ve had to think about so far include:

  • What percentage of students have accessed a library resource in 2014-15? – (count of students who’ve accessed 1 or more library resources)
  • What percentage of students have accessed library resources for modules starting in 2014? – a different question to the first one as students can be studying more than one module at a time
  • How much use of library resources is made by the different Faculties?
  • How many resources have students accessed – what’s the average per student, per module, per level

Those have raised a few interesting questions, including which student number do you take when calculating means? – the number at the start, at the end, or part-way through?

Next steps
In the New Year we’ve more investigation and more data to tackle and should be able to start to join library data up with data that lets us explore correlations between library use, retention and student success.



SunsetIn the early usability tests we ran for the discovery system we implemented earlier in the year one of the aspects we looked at were the search facets.   Included amongst the facets is a feature to let users limit their search by a date range.  So that sounds reasonably straight-forward, filter your results by the publication date of the resource, narrowing your results down by putting in a range of dates.  But one thing that emerged during the testing is that there’s a big assumption underlying this concept.  During the testing a user tried to use the date range to restrict results to journals for the current year and was a little baffled why the search system didn’t work as they expected.  Their expectation was that by putting in 2015 it would show them journals in that subject where we had issues for the current year.  But the system didn’t know that issues that were continuing and therefore had a date range that was open-ended were available for 2015 as the metadata didn’t include the current year, just a start date for the subscription period.  So consequently the system didn’t ‘know’ that the journal was available for the current year.  And that exposed for me the gulf that exists between user and library understanding and how our metadata and systems don’t seem to match user expectations.  So that usability testing session came to mind when reading the following blog post about linked data.

I would really like my software to tell the user if we have this specific article in a bound print volume of the Journal of Doing Things, exactly which of our location(s) that bound volume is located at, and if it’s currently checked out (from the limited collections, such as off-site storage, we allow bound journal checkout).

My software can’t answer this question, because our records are insufficient. Why? Not all of our bound volumes are recorded at all, because when we transitioned to a new ILS over a decade ago, bound volume item records somehow didn’t make it. Even for bound volumes we have — or for summary of holdings information on bib/copy records — the holdings information (what volumes/issues are contained) are entered in one big string by human catalogers. This results in output that is understandable to a human reading it (at least one who can figure out what “v.251(1984:Jan./June)-v.255:no.8(1986)”  means). But while the information is theoretically input according to cataloging standards — changes in practice over the years, varying practice between libraries, human variation and error, lack of validation from the ILS to enforce the standards, and lack of clear guidance from standards in some areas, mean that the information is not recorded in a way that software can clearly and unambiguously understand it.  From the Bibliographic Wilderness blog

Processes that worked for library catalogues or librarians i.e. in this case the description v.251(1984:Jan./June)-v.255:no.8(1986) need translating for a non-librarian or a computer to understand what they mean.

It’s a good and interesting blog post and raises some important questions about why, despite the seemingly large number of identifiers in use in the library world (or maybe because) it is so difficult to pull together metadata and descriptions of material to consolidate versions together.   It’s an issue that causes issues across a range of work we try to do, from discovery systems, where we end up trying to normalise data from different systems to reduce the number of what seem to users to be duplicate entries to work around usage data, where trying to consolidate usage data of a particular article or journal becomes impossible where versions of that article are available from different providers, or from institutional repositories or from different URLs.

Photograph of grass in sunlightOne of the areas we started to explore with our digital archive project for was web archiving.  The opportunity arose to start to capture course websites from our Moodle Virtual Learning environment from 2006 onwards.   We made use of the standard web archive format WARC and eventually settled on Wget as the tool to archive the websites from moodle, (we’d started with using Heritrix but discovered that it didn’t cope with our authentication processes).  As a proof of concept we included one website in our staff version of our digital archive (the downside of archiving course materials is that they are full of copyright materials) and made use of a local instance of the Wayback machine software from the Internet Archive.  [OpenWayback is the latest development].   So we’ve now archived several hundred module websites and will be starting to think about how we manage access to them and what people might want to do with them (beyond the obvious one of just looking at them to see what was in those old courses).

So I was interested to see a tweet and then a blog post about a tool called warcbase – described as ‘an open-source platform for managing web archives…’ but particularly because the blog post from Ian Milligan combined web archiving with something else that I’d remembered Tony Hirst talking and blogging about, IPython and Jupyter. It also reminded me of a session Tony ran in the library taking us through ipython and his ‘conversations with data’ approach.

The warcbase and jupyter approach takes the notebook method of keeping track of your explorations and scripting and applies it to the area of web archives to explore the web archive as a researcher might.  So it covers the sort of analytical work that we are starting to see with the UK Web Archive data (often written up on the UK Web Archive blog).   And it got me starting to wonder both about whether warcbase might be a useful technology to explore as a way of thinking about how we might develop a method of providing access to the VLE websites archive.  But it also made me think about what the implications might be of the skills that librarians (or data librarians) might need to have to facilitate the work of researchers who might want to run tools like jupyter across a web archive, and about the technology infrastructure that we might need to facilitate this type of research, and also about what the implications are for the permissions and access that researchers might need to explore the web archive.  A bit of an idle thought about what we might want to think about.

data.path Ryoji.Ikeda - 3 by r2hox

data.path Ryoji.Ikeda – 3 by r2hox

One of the pieces of work we’re just starting off in the team this year is to do some in-depth work on library data.  In the past we’ve looked at activity data and how it can be used for personalised services (e.g. to build recommendations in the RISE project or more recently to support the OpenTree system), but in the last year we’ve been turning our attention to what the data can start to tell us about library use.

There have been a couple of activities that we’ve undertaken so far.  We’ve provided some data to an institutional Learning Analytics project on the breakdown of library use of online resources for a dozen or so target modules.  We’ve been able to take data from the EZproxy logfiles, and show the breakdown by student ID, by week and by resource over the nine-month life of the different modules.  That has put library data alongside other data such as use of the Virtual Learning Environment and allowed module teams to  look at how library use might relate to the other data.

Pattern of week by week library use of eresources - first level science course

Pattern of week by week library use of eresources – first level science course

A colleague has also been able to make use of some data combining library use and satisfaction survey data for a small number of modules, to shed a little light on whether satisfied students were making more use of the library than unsatisfied ones (obviously not a causal relationship – but initial indications seem to be that for some modules there does seem to be a pattern there).

Library Analytics roadmap
But these have been really early exploratory steps, so during last year we started to plan out a Library Analytics Roadmap to scope out the range of work we need to do.  This covers not just data analysis, but also some infrastructural developments to help with improving access to data and some effort to build skills in the library.  It is backed up with engagement with our institutional learning analytics projects and some work to articulate a strategy around library analytics.  The idea being that the roadmap activities will help us change how we approach data, so we have the necessary skills and processes to be able to provide evidence of how library use relates to vital aspects such as student retention and achievement.

Library data project
We’re working on a definition of Library analytics as being about:

Using data about student engagement with library services and content to help institutions and students understand and improve library services to learners

Part of the roadmap activity this year is to start to carry out a more systematic investigation into library data, to match it against student achievement and retention data.  The aim is to build an evidence base of case studies, based on quantitative data and some qualitative work we hope to do.  Ideally we’d like to be able to follow the paths mapped out by the likes of Minnesota, Wollongong and Huddersfield in their various projects and demonstrate that there is a correlation between library use, student success and retention.

Challenges to address
We know that we’re going to need more data analysis skills, and some expertise from a statistician.  We also have some challenges because of the nature of our institution.  We won’t have library management system book loans, or details of visits to the library, we will mainly have to concentrate on use of online resources.  So in some ways that simplifies things.  But our model of study also throws up some challenges.  With a traditional campus institution students study a degree over three or four years.  There is a cohort of students that follow through year 1, 2, 3 etc and at the end of that period they do their exams and get their degree classification.  So it is relatively straight-forward to see retention as being about students that return in year 2 and year 3, or don’t drop-out during the year, and to see success measured as their final degree classification.  But with part-time distance learning, where although students sign up to a qualification, they still follow a pattern of modules and many will take longer than six years to complete, often with one of more ‘breaks’ in study, following a cohort across modules might be difficult.  So we might have to concentrate on analysis at the ‘module’ level… but then that raises another question for us.  Our students could be studying more than one module at a time so how do you easily know whether their library use relates to module A or module B?  Lots of things to think about as we get into the detail.

At the end of November I was at a different sort of conference to the ones I normally get to attend.  This one, Design4learning was held at the OU in Milton Keynes, but was a more general education conference.  Described as “The Conference aims to advance the understanding and application of blended learning, design4learning and learning analytics ” Design4learning covered topics such as MOOCs, elearning, learning design and learning analytics.

There were a useful series of presentations at the conference and several of them are available from the conference website.   We’d put together a poster for the conference talking about the work we’ve started to do in the library on ‘library analytics’ – entitled ‘Learning Analytics – exploring the value of library data and it was good to talk to a few non-library people about the wealth of data that libraries capture and how that can contribute to the institutional picture of learning analyticPoster for Design4learning conferences.

Our poster covered some of the exploration that we’ve been doing, mainly with online resource usage from our EZProxy logfiles.  In some cases we’ve been able to join that data with demographic and other data from surveys to start to look in a very small way at patterns of online library use.

Design4learning conference poster v3

The poster also highlighted the range of data that libraries capture and the sorts of questions that could be asked and potentially answered.  It also flagged up the leading-edge work by projects such as Huddersfield’s Library Impact Data Project and the work of the Jisc Lamp project.

An interesting conference and an opportunity to talk with a different group of people about the potential of library data.

I was intrigued to read David Weinberger’s blog post ‘Protecting library privacy with a hard opt-in’  in it he suggests that there is a case to be made for asking users to explicitly opt-in to publishing details of their checkouts (loans) before you can use that activity data.  I must admit that I’d completely missed the connection between David Weinberger author of ‘Everything is miscellaneous’ and his role with the Harvard Innovation Lab and I’m sure I’ve probably blogged about both in the past.

The concern that has been raised is about re-identification, where supposedly ‘anonymous’ datasets can be combined with other data to identify individuals.  There’s a good description of the issue in this paper from 2008 from Michael Hay and others from the University of Massachusetts

Obviously an issue of this type is of critical significance when you might be talking about medical trials data for example, but library data might also be personal or sensitive.  Aside from the personal aspects you could also imagine that a researcher carrying out a literature search for material for a potential new research area would not want ‘competitors’ to know that they were looking at a particular area, particularly now cross-domain research activities are more common.

The issue of anonymity and potentially being able to identify an individual from their activity data is an area that has been explored through a number of projects, such as in Jisc’s Activity Data programme and synthesis project outputs at  particularly in the section on data protection.   Most of the approaches tackled anonymization in two ways, by replacing user IDs with a generated ID (described interestingly by Hay as ‘naive anonymization’) and by removing data from the dataset if there were only small numbers of users included (such as a course with only a few students enrolled).

Re-identification techniques seem to work by being able to identify unique patterns of use, called digital fingerprints that can be used to identify individuals.  When you combine data from an anonymized dataset with other material you can start to identify individuals.  It certainly seems to be something that needs to be thought carefully about when contemplating releasing datasets.

Is the suggested solution, of asking for explicit permission the right approach?  If you are planning to release data openly, I’d probably agree.  If you plan to use it only within your systems to generate recommendations, then yes it’s probably good practice. I worry slightly about the value of the activity data if there is a low opt-in level.  That may significantly diminish its value and usefulness.

I’m not too convinced though about the approach that says that users agree to a public page that lists your activity.  That would seem to me to encourage people who might not be unhappy with allowing their data to be used unattributed in recommendations not to opt-in.  When we’ve asked students about their views of what data we should be able to use they were quite happy for activity data to be used.   My view would be that it’s fine to show an individual what they have used (and we do that), but not something to share.

To Birmingham today for the second meeting of the Jisc LAMP (library analytics and metrics project) community advisory and planning group. This is a short Jisc-managed project that is working to build a prototype dashboard tool that should allow benchmarking and statistical significance tests on a range of library analytics data.

The LAMP project blog at is a good place to start to get up to speed with the work that LAMP is doing and I’m sure that there will be an update on the blog soon to cover some of the things that we discussed during the day.

One of the things that I always find useful about these types of activity, beyond the specific discussions and knowledge sharing about the project and the opportunity to talk to other people working in the sector, is that there is invariably some tool or technique that gets used in the project or programme meetings that you can take away and use more widely. I think I’ve blogged before about the Harvard Elevator pitch from a previous Jisc programme meeting.

This time we were taken through an approach of carrying out a review of the project a couple of years hence, where you had to imagine that the project had failed totally. It hadn’t delivered anything that was useful, so no product, tool or learning came out of the project. It was a complete failure.

We were then asked to try to think about reasons why the project had failed to deliver. So we spent half an hour or so individually writing reasons onto post-it notes. At the end of that time we went round the room reading out the ideas and matching them with similar post-it notes, with Ben and Andy sticking them to a wall and arranging them in groups based on similarity.

It quickly shifted away from going round formally to more of a collective sharing of ideas but that was good and the technique really seemed to be pretty effective at capturing challenges. So we had challenges grouped around technology and data, political and community aspects, and legal aspects for example.

We then spent a bit of time reviewing and recategorising the post-it notes into categories that people were reasonably happy with. Then came the challenge of going through each of the groups of ideas and working out what, if anything, the project could or should do to minimise the risk of that possible outcome happening. That was a really interesting exercise to identify some actions that could be done in the project such as engagement to encourage more take up.

A really interesting demonstration of quite a powerful technique that’s going to be pretty useful for many project settings. It seemed to be a really good way of trying to think about potential hurdles for a project and went beyond what you might normally try to do when thinking about risks, issues and engagement.

It’s interesting to me how so many of the good project management techniques work on the basis of working backwards. Whether that is about writing tasks for a One Page Project Plan based on describing the task as if has been completed, e.g. Site launch completed, or whether it is about working backwards from an end state to plan out the steps and the timescale you will have to go through. These both envisage what a successful project looks like, while the pre-mortem thinks about what might go wrong. Useful technique. google analytics infographic screenshotInfographics and data visualisations seem to be very popular at the moment and for a while I’ve been keeping an eye on as they have some great infographics and data visualisations.  One of the good things about the infographics is that there is some scope to customise them.  So for example there is one about the ‘Life of a hashtag’ that you can customise and several others around facebook and twitter that you can use.

I picked up on twitter the other week that they had just brought out a Google Analytics infographic.  That immediately got my interest as we make a lot of use of GA.  You just point it to your site through your Google Analytics account and then get a weekly email ‘Your weekly insights’ created dynamically from your Google Analytics data.

It’s a very neat idea and quite a useful promotional tool to give people a quick snapshot of what is going on.  So you get Pageviews over the past three weeks, what the google analytics infographic screenshotrends are for New and Returning Visitors and reports on Pages per visit and Time on site and how that has changed in the past week.

It’s quite useful for social media traffic showing how facebook and twitter traffic has changed over the past week and as these types of media are things that you often want quite quick feedback on it is a nice visual way of being able to show what difference a particular activity might have had.

Obviously as a free tool, there’s a limit to the customisation you can do.  So it might be nice to have visits or unique visitors to measure change in use of the site, or your top referrals, or particular google analytics infographic screenshotpages that have been used most frequently. The time period is something that possibly makes it less useful for me in that I’m more likely to be want to compare against the previous month (or even this month last year).  But no doubt would build a custom version for you if you wanted something particular.

But as a freely available tool it’s a useful thing to have.  The infographic is nicely presented and gives a visually appealing presentation of analytics data that can often be difficult to present to audiences who don’t necessarily understand the intricacies of web analytics.

The Google Analytics infographic is at

I’d been thinking early this morning about writing up a blog post around some thoughts about ‘Library Analytics’ and thinking that it was interesting how ‘Library Analytics’ had been used by Harvard for their ‘Library analytics toolkit’ and by others as a way of talking about web analytics, but that neither really seemed to me to quite be analagous to the way that the Learning Analytics community, such as Solar,  view analytics.  There are several definitions about Learning Analytics.  This one from Educause’s 7 things you should know about first-generation learning analytics:

Learning analytics (LA) applies the model of analytics to the specific goal of improving learning outcomes. LA collects and analyzes the “digital breadcrumbs” that students leave as they interact with various computer systems to look for correlations between those activities and learning outcomes. The type of data gathered varies by institution and by application, but in general it includes information about the frequency with which students access online materials or the results of assessments from student exercises and activities conducted online. Learning analytics tools can track far more data than an instructor can alone, and at their best, LA applications can identify factors that are unexpectedly associated with student learning and course completion.

Much of the library interest in analytics seems to me to have mainly been about using activity data to understand user behaviour and make service improvements, but I’m increasingly of the view that whilst that is important, it is only half the story.  One of the areas that interests me about both learning analytics and activity data, is the empowering potential of that data as a tool for the user, rather than the lecturer or librarian, to find out interesting things about their behaviour, or get suggested actions or activities, and essentially to be able to make better choices.  And that seems to be the key – just as reviews and ratings are helping people being informed consumers, with sites like Trip Advisor then we should be building library systems that help our users to be informed library consumers.

So it was great to see the announcement of the JiscLAMP project this morning announcing the Library Analytics and Metrics project and talking about delivering a prototype shared library analytics service for UK academic libraries.  I was particularly interested to see that the plan is to develop some use-cases for the data and great that Ben Showers shared some of the vision behind the idea.   It’s a great first step to put data on a solid, consistent and sustainable basis, and should build a good platform to be able to exploit that vast reservoir of library data.

Twitter posts



March 2017
« Feb    

Creative Commons License