You are currently browsing the monthly archive for November 2013.
I was intrigued to read David Weinberger’s blog post ‘Protecting library privacy with a hard opt-in’ in it he suggests that there is a case to be made for asking users to explicitly opt-in to publishing details of their checkouts (loans) before you can use that activity data. I must admit that I’d completely missed the connection between David Weinberger author of ‘Everything is miscellaneous’ and his role with the Harvard Innovation Lab and I’m sure I’ve probably blogged about both in the past.
The concern that has been raised is about re-identification, where supposedly ‘anonymous’ datasets can be combined with other data to identify individuals. There’s a good description of the issue in this paper from 2008 from Michael Hay and others from the University of Massachusetts http://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1176&context=cs_faculty_pubs
Obviously an issue of this type is of critical significance when you might be talking about medical trials data for example, but library data might also be personal or sensitive. Aside from the personal aspects you could also imagine that a researcher carrying out a literature search for material for a potential new research area would not want ‘competitors’ to know that they were looking at a particular area, particularly now cross-domain research activities are more common.
The issue of anonymity and potentially being able to identify an individual from their activity data is an area that has been explored through a number of projects, such as in Jisc’s Activity Data programme and synthesis project outputs at http://www.activitydata.org particularly in the section on data protection. Most of the approaches tackled anonymization in two ways, by replacing user IDs with a generated ID (described interestingly by Hay as ‘naive anonymization’) and by removing data from the dataset if there were only small numbers of users included (such as a course with only a few students enrolled).
Re-identification techniques seem to work by being able to identify unique patterns of use, called digital fingerprints that can be used to identify individuals. When you combine data from an anonymized dataset with other material you can start to identify individuals. It certainly seems to be something that needs to be thought carefully about when contemplating releasing datasets.
Is the suggested solution, of asking for explicit permission the right approach? If you are planning to release data openly, I’d probably agree. If you plan to use it only within your systems to generate recommendations, then yes it’s probably good practice. I worry slightly about the value of the activity data if there is a low opt-in level. That may significantly diminish its value and usefulness.
I’m not too convinced though about the approach that says that users agree to a public page that lists your activity. That would seem to me to encourage people who might not be unhappy with allowing their data to be used unattributed in recommendations not to opt-in. When we’ve asked students about their views of what data we should be able to use they were quite happy for activity data to be used. My view would be that it’s fine to show an individual what they have used (and we do that), but not something to share.
I’ve noticed recently when searching Google on an ipad that I’m seeing a different results display to the standard desktop display. I’m now seeing the results split up into a set of boxes. So there’s a box at the top containing paid advertising, followed by a box with three results from the web, followed by a box with a single result from news and so on.
In landscape orientation you also get a related searches box on the right of the screen. When you turn to a portrait view the related searches drop to the bottom. At the foot of the page is a next button that takes you to more results including images. On this second screen the related searches have dropped to the bottom and have been replaced by more advertising.
Some of the boxes have a ‘More’ link, for news and images for example. When you go on to pages three and four you are into a fairly standard google web list but still placed in a box. I’m not sure when Google started doing this or if this is a feature that is just being tested for mobile devices. Not everyone seems to see it on ipads so I’d be interested to know under what circumstances you get to see this approach.
It is very reminiscent of the ‘bento box’ type approach, pulling results from different places and that’s something that we’ve been trying. It’s not dissimilar to NCSU’s approach in terms of showing results from different types of content. e.g. http://www.lib.ncsu.edu/search/?q=psychology
I think I’m quite surprised to find Google looking at this route. For libraries we are looking at this route because it is a way to bring results together from several different systems. Those systems are often the front-end of the systems that are used to manage different types of content and we often seem to struggle to join up all the different types of content into one integrated search solution. Google have come to this from a very different place in that they have their content organised by themselves in what you would presume is a consistent way. But still feel the need to be able to highlight content of different types (news, videos, images) to people.
But I think the difference is in the types of things that are being pulled out here. You can see from NCSU that a typical list of different ‘stuff’ for libraries is Articles, Databases, Books & Media, Journals, Library website. Yet for Google it is news, videos, images, maps, essentially quite high-level format concepts. And I’m starting to think that it is one of the real problems for libraries that we have put ourselves in a position where articles and journals are somehow seen to be two different and separate things, when in reality one is just the packaging of several of the others together.