Over the last few weeks we’ve been liasing with our cohorts at the University of Sussex, Cambridge University Library, and Lincoln University to extract data and bring it over here to Mimas to start processing. Our aim is to add those sets to the existing API (along with updated data from JRUL and Huddersfield), so that the recommendations or aggregations of related texts produced are less ‘skewed’ to the JRUL context (course reading lists, etc).
When we ran the SALT project, we worked only with the substantial JRUL set of circulation data. Interestingly (and usefully), the way that JRUL set up their system locally allowed us to see both ISBNs, as well as the JRUL assigned work-ID to identify items. This meant we could deal with items without ISBNs — somewhat critical to our ‘long tail’ hypothesis, which posited that recommenders could help surface under-used items, many of which might be pre-1970s, when ISBNs were phased in.
But now we’re dealing with circulation data from more than one source, and of course there are issues with this approach. The JRUL local solution for items without ISBNs is not widely applied and now we’re dealing with more datasets; we need to map items between different datasets, and the only common ID we have is ISBN. This means that for now we need to shift back to using only ISBN as the ID we deal with, and then adjust our tables and API accordingly. We do see this as limiting, but for our key objectives in this project, it’s good enough. However, we want to return to this challenge later in the project to see if we can refine the system so it can surface older items.
The other issue emerging currently is that of multiple ISBNs for the same work – a perennial and complex issue, which is particularly coming to the fore in the debate on how to identify eBooks: http://publishingperspectives.com/2010/11/isbns-and-e-books-the-ongoing-dilemma/
With some of our partners’ data, this field has only one value – it seems to be difficult to pinpoint exactly where in the supply chain the decision as to which ISBN to assign seems to occur (depending on vendor systems and cataloguing practices), but it’s clear it will vary a great deal according to institution and processes. On the other hand, in other datasets, multiple ISBNs for one work are recorded, and we need to make a call as to which ISBN we work with. We could just go with the first ISBN that appears, but this will likely result in duplicates appearing in the recommendations list; it also means that the algorithm on which the recommendation itself is made is watered down (i.e., recommendations will be less meaningful).
For now, we’re going to have to settle for grabbing the first ISBN to get the demonstrator working. But we’ll also need to develop a stage in our processing where we map ISBNs, and this would also need to be part of the API (so institutions using the API can also map effectively). Right now we’re trying to find out if there is some sort of service that might help us out here. General consensus is that ‘there must be something’ (surely we’re not the first people to tackle this) but so far we’ve not come across anything that fits the bill. Any suggestions gratefully received!