A comment on one of my search blog posts by Preedip Balaji suggested TAPoR text analysis as a useful tool to help with comparing the search terms lists that I was using to look at the terms that users were using on the tabbed search tool that we had on our old website.  Tabbed search box screenshotAt the time we had three tabbed searches to cover the library catalogue, website and originally a federated search tool that then migrated to a discovery search tool.  We’d found that there was quite considerable overlap between the search tools that users put into the search box, and subsequently we’ve gone away from a tabbed approach on the new website in favour of a single discovery search box.  But at the time I wondered about whether there were any text analysis tools that would help with trying to provide some form of assessment about the similarity between the search terms used.

TAPoRware seems to be exactly the sort of text comparison tool that I was looking for.  Developed at the University of Alberta, TAPoR (Text Analysis Portal For Research), has a range of HTML, XML and Plain text tools that allow you to analyse words, find patterns and look for data within text for example.  So I’ve been playing around with the Taporware screenshotComparator tool to compare some of the lists of 100 search terms used in the website, federated and catalogue searches.

The comparator tool lets you compare two sets of data at a time and you can upload your set of data as a text file from a local file.  For some reason it wouldn’t accept an excel file but it will display the results as either html or as a tab-delimited file.  The comparator tool goes through and provides some data about how many words there are and how many are unique or appear multiple times.  Then it provides a list of words that are common or unique to either file.

The tool only lets you compare two files at a time, ideally I’d have liked to compare three files.  It also compares the words individually, whereas most of the search terms included in my files are actually search phrases.    So I’ve had to run three comparisons to compare each file with the other two.  The table below summarises the comparisons and shows what percentage of terms are common or unique to each file of search terms.

Common Unique to 1 Unique to 2
Number % Number % Number %
Catalogue/Federated 80 41 56 29 60 31
Catalogue/Website 77 37 75 36 57 27
Federated/Website 61 28 95 43 75 34

If I understand correctly then the implication is that there is more in common between the search terms for the catalogue and federated search than between federated search and the website.  When I looked at the search terms originally there were around 45% that had been used across three of the search boxes and website search terms did seem to differ slightly from the federated and catalogue searches. That seems to be borne out by the text comparator that shows the website search data as having less common words.

TAPoR looks like a useful tool, although I’ve barely scratched the surface of what it can do.  Now we’ve changed our website to just a single discovery search there’s some further work we can maybe do to analyse the terms that people are using now to compare with what they used to use on the tabbed search system.