I was intrigued to read David Weinberger’s blog post ‘Protecting library privacy with a hard opt-in’  in it he suggests that there is a case to be made for asking users to explicitly opt-in to publishing details of their checkouts (loans) before you can use that activity data.  I must admit that I’d completely missed the connection between David Weinberger author of ‘Everything is miscellaneous’ and his role with the Harvard Innovation Lab and I’m sure I’ve probably blogged about both in the past.

The concern that has been raised is about re-identification, where supposedly ‘anonymous’ datasets can be combined with other data to identify individuals.  There’s a good description of the issue in this paper from 2008 from Michael Hay and others from the University of Massachusetts http://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1176&context=cs_faculty_pubs

Obviously an issue of this type is of critical significance when you might be talking about medical trials data for example, but library data might also be personal or sensitive.  Aside from the personal aspects you could also imagine that a researcher carrying out a literature search for material for a potential new research area would not want ‘competitors’ to know that they were looking at a particular area, particularly now cross-domain research activities are more common.

The issue of anonymity and potentially being able to identify an individual from their activity data is an area that has been explored through a number of projects, such as in Jisc’s Activity Data programme and synthesis project outputs at http://www.activitydata.org  particularly in the section on data protection.   Most of the approaches tackled anonymization in two ways, by replacing user IDs with a generated ID (described interestingly by Hay as ‘naive anonymization’) and by removing data from the dataset if there were only small numbers of users included (such as a course with only a few students enrolled).

Re-identification techniques seem to work by being able to identify unique patterns of use, called digital fingerprints that can be used to identify individuals.  When you combine data from an anonymized dataset with other material you can start to identify individuals.  It certainly seems to be something that needs to be thought carefully about when contemplating releasing datasets.

Is the suggested solution, of asking for explicit permission the right approach?  If you are planning to release data openly, I’d probably agree.  If you plan to use it only within your systems to generate recommendations, then yes it’s probably good practice. I worry slightly about the value of the activity data if there is a low opt-in level.  That may significantly diminish its value and usefulness.

I’m not too convinced though about the approach that says that users agree to a public page that lists your activity.  That would seem to me to encourage people who might not be unhappy with allowing their data to be used unattributed in recommendations not to opt-in.  When we’ve asked students about their views of what data we should be able to use they were quite happy for activity data to be used.   My view would be that it’s fine to show an individual what they have used (and we do that), but not something to share.