Progress so far, and some of the challenges around identifiers and ISBNS we’re facing along the way

Over the last few weeks we’ve been liasing with our cohorts at the University of Sussex, Cambridge University Library, and Lincoln University to extract data and bring it over here to Mimas to start processing. Our aim is to add those sets to the existing API (along with updated data from JRUL and Huddersfield), so that the recommendations or aggregations of related texts produced are less ‘skewed’ to the JRUL context (course reading lists, etc).

When we ran the SALT project, we worked only with the substantial JRUL set of circulation data.  Interestingly (and usefully), the way that JRUL set up their system locally allowed us to see both ISBNs, as well as the JRUL assigned work-ID to identify items. This meant we could deal with items without ISBNs — somewhat critical to our ‘long tail’ hypothesis, which posited that recommenders could help surface under-used items, many of which might be pre-1970s, when ISBNs were phased in.

But now we’re dealing with circulation data from more than one source, and of course there are issues with this approach. The JRUL local solution for items without ISBNs is not widely applied and now we’re dealing with more datasets; we need to map items between different datasets, and the only common ID we have is ISBN. This means that for now we need to shift back to using only ISBN as the ID we deal with, and then adjust our tables and API accordingly.  We do see this as limiting, but for our key objectives in this project, it’s good enough. However, we want to return to this challenge later in the project to see if we can refine the system so it can surface older items.

The other issue emerging currently is that of multiple ISBNs for the same work – a perennial and complex issue, which is particularly coming to the fore in the debate on how to identify eBooks:

With some of our partners’ data, this field has only one value – it seems to be difficult to pinpoint exactly where in the supply chain the decision as to which ISBN to assign seems to occur (depending on vendor systems and cataloguing practices), but it’s clear it will vary a great deal according to institution and processes. On the other hand, in other datasets, multiple ISBNs for one work are recorded, and we need to make a call as to which ISBN we work with.   We could just go with the first ISBN that appears, but this will likely result in duplicates appearing in the recommendations list; it also means that the algorithm on which the recommendation itself is made is watered down (i.e., recommendations will be less meaningful).

For now, we’re going to have to settle for grabbing the first ISBN to get the demonstrator working.  But we’ll also need to develop a stage in our processing where we map ISBNs, and this would also need to be part of the API (so institutions using the API can also map effectively). Right now we’re trying to find out if there is some sort of service that might help us out here. General consensus is that ‘there must be something’ (surely we’re not the first people to tackle this) but so far we’ve not come across anything that fits the bill.  Any suggestions gratefully received!



Announcing the Copac Activity Data Project (otherwise known as SALT 2)

We’re extremely pleased to announce that thanks to funding from JISC, we are about to commence work that builds on the success of SALT, and provides further understanding of the potential of aggregating and sharing library circulation data to support recommender functionality and the local and national levels. From now until July 31st 2012, we want to  strengthen the existing business case for openly sharing circulation data to support recommendations, and will produce a scoping and feasibility report for a shared national service to support circulation data aggregation, normalisation, and distribution for reuse via an open API.

To achieve this we plan to aggregate and normalise data from libraries in addition to JRUL and to make this available along with the John Rylands Library, University of Manchester dataset through a shared API; our new partner in this include: Cambridge University library, Lincoln University Library, Sussex University Library, and University of Huddersfield Library.

CopacAD will conduct primary research to  investigate the following additional use cases:

  • an undergraduate from a teaching and learning institution searching for course related materials
  • academics/teachers using the recommender to support the development of course reading lists
  • librarians using the recommendations to support academics/lecturers and collections development.

At the same time, we’re going to develop a Shared Service Scoping and Feasibility study will explore the options for a shared service for aggregating, normalising and hosting circulation data, and the potential range of web services/APIs that could be made available on top of that data.

Issues we’ll address will include identifying what infrastructure would need to be in place, how scaleable the service would need to be, and whether the service can scale with demand, the potential use cases for such a service, and benefits to be realised, the projected costs of such a service on an ongoing basis, technical and financial sustainability, including potential business model options moving forward.

If you’re interested in learning more, here’s the proposal for this work [doc].  And as with SALT, we will be regularly updating the community on our progress and lessons learned through this blog.