One of the great things about new projects is that they offer the opportunity to learn new skills as well as build on existing knowledge. So our new library data project is giving plenty of opportunities to learn new things and new tools to help with data extraction and data analysis.
After a bit of experimentation about the best method of getting extracts of library data (including trying to doing it through Access) we settled on using MySQL Workbench version 6.3 with read-only access to the database tables storing the library data. It’s been a bit of a learning curve to understand the tool, the SQL syntax and the structure of our data but direct access to the data means that the team can extract the data needed and quickly test out different options or extracts of data. In the past I’ve mainly used tools such as Cognos or Oracle Business Inteligence which essentially hide the raw SQL queries behind a WYSIWYG interface, so it’s been interesting to use this approach. It’s been really useful to be learning the tool with the project team, because it means that I can get SQL queries checked to make sure they are doing what I think they are doing, and to share queries across the team.
In the main I’m running the SQL query and checking that I’ve got the data I want but then exporting the data as .csv to do further data tidying and cleaning in MS Excel. But I have learnt a few useful things including how add in an anonymised ID as part of the query (useful if you don’t need the real ID but just need to know which users are unique and much easier to do in SQL than in Excel).
I’ve certainly learnt a lot more about Excel. It’s been the tool that I’ve used to process the data extracts, to join data together from other sources and (for the time being at least) to present tables and visualisations of the data. Filtering and pivot tables have been the main techniques, with frequent use of pivot tables to filter data and provide counts. Features such as Excel 2013’s pivot table ‘distinct count’ have been useful.
One of the tasks I’ve been doing in Excel is to join two data sources together, e.g. joining counts of library use via ezproxy and by athens, or joining library use with data on student results. I’d started mainly using VLOOKUP in Excel but have switched (on the recommendation of a colleague) to using INDEX/MATCH as it seems to work much better (if you can get the syntax exactly right.
The project team is starting to think that as we learn more about SQL that we try to do more of the data manipulation and counts directly through the SQL queries as doing them in Excel can be really time-consuming.
SPSS has been a completely new tool to me. We’re using IBM SPSS Statistics version 21 to carry out the statistical analyses. Again it’s got a steep learning curve and I’m finding I need frequent recourse to some of the walk-throughs on sites such as Laerd statistics e.g. https://statistics.laerd.com/spss-tutorials/one-way-anova-using-spss-statistics.php But I’m slowly getting to grips with it and as I get more familiar with it I can start to see more of the value in it. Once you’ve got the data into the data table and organised properly it’s really quick to run correlation or variance tests, although it quickly starts to flag up queries about, which test to use and why, and what do the results mean? I particularly like the output window that it uses to track all the actions and show any charts you’ve created or analyses you’ve undertaken on the data.
The team is in the early stages of exploring the SAS system that is used for our institutional data warehouse. Ultimately we’d want to get library use data into the institutional data warehouse and then query it alongside other institutional data directly from the warehouse. SAS apparently has statistical analysis capabilities but the learning curve seems to be fairly high. We’ve also thought about whether tools such as Open Refine might be useful for cleaning up data but haven’t been able to explore that yet. Similarly I know we have a need for tools to present and visualise the data findings – ultimately that might be met by an institutional SAS Visual Analytics tool.