We’re in the early stages of our work with library data and I thought I’d write up some reflections on the early stages. So far we’ve mostly confined ourselves to trying to understand the library data we have and suitable methods to access it and manipulate it. We’re interested in aggregations of data, e.g. by week, by month, by resource, in comparison with total student numbers etc.
One of our main sources of data is from ezproxy, which we use for both on and off-campus use of online library resources. Around 85-90% of our authenticated resource access goes through this system. One of the first things we learnt when we started investigating this data source is that there are two levels of logfile – the full log of all resource requests and the SPU (Starting Point URL) logfile. The latter tracks the first request to a domain in a session.
We looked at approaches that others had taken to help shape how we approached analysing the data. Wollongong for example, decided to analyse the time stamp as follows:
- The day is divided into 144 10-minute sessions
- If a student has an entry in the log during a 10-minute period, then 1/6 is added to the sum of that student’s access for that session (or week, in the case of the Marketing Cube).
- Any further log entries during that student’s 10-minute period are not counted.
Using this logic, UWL measures how long students spent using its electronic resources with a reasonable degree of accuracy due to small time periods (10 minutes) being measured.
Discovering the Impact of Library Use and Student Performance, Cox and Jantti 2012 http://er.educause.edu/articles/2012/7/discovering-the-impact-of-library-use-and-student-performance
To adopt this approach would mean that we’d need to be looking at the full log files to pick up each of the 10 minute sessions. Unfortunately owing to the size of our version of the full logs we found it wasn’t going to be feasible to use this approach, we’d have to use the SPU version and take a different approach.
A small proportion of our resource authentication goes through OpenAthens. Each month we get a logfile of resource accesses that have been authenticated using this route. Unlike ezproxy data we don’t get a date/timestamp, all we know is that those resources were accessed during the month. Against each resource/user combination you get a count of the number of times that combination occurred during the month.
Looking into the data one of the interesting things we’ve been able to identify is that OpenAthens authentication also gets used for other resources not just library resources, so for example we’re using it for some library tools such as RefWorks and Library Search, but it’s straight-forward to take those out if they aren’t wanted in your analysis.
So one of the things we’ve been looking at is how easy it is to add the Athens and Ezproxy data together. There are similarities between the datasets but some processing is needed to join them up. The ezproxy data can be aggregated at a monthly level and there are a few resources that we have access to via both routes so those resource names need to be normalised.
The biggest difference between the two datasets is that whereas you get a logfile entry for each SPU access in the ezproxy dataset you get a total per month for each user/resource combination in the OpenAthens data. One approach we’ve tried is just to duplicate the rows, so where the count says the resource/user combination appeared twice in the month, just copy the line. In that way the two sets of data are comparable and can be analysed together, so if you wanted to be able do a headcount of users who’ve accessed 1 or more library resources in a month you could include data from both ezproxy and openathens authenticated resources.
Numbers and counts
One thing we’ve found is that users of the data want several different counts of users and data from the library e-resources usage data. The sorts of questions we’ve had to think about so far include:
- What percentage of students have accessed a library resource in 2014-15? – (count of students who’ve accessed 1 or more library resources)
- What percentage of students have accessed library resources for modules starting in 2014? – a different question to the first one as students can be studying more than one module at a time
- How much use of library resources is made by the different Faculties?
- How many resources have students accessed – what’s the average per student, per module, per level
Those have raised a few interesting questions, including which student number do you take when calculating means? – the number at the start, at the end, or part-way through?
In the New Year we’ve more investigation and more data to tackle and should be able to start to join library data up with data that lets us explore correlations between library use, retention and student success.