Data loading, processing, and more challenges with ISBNs — a technical update

A while back I wrote a post detailing some of the challenges we were encountering with resolving ISBNs through the API to ensure that items were allocated relevant recommendations.  This problem meant that not only could duplicate items appear in lists of recommendations, but also that the relevancy could be weakened.  We said then that we were opting to ‘settle for grabbing the first ISBN to get the demonstrator working’ purely for testing purposes.

But then we began work on aggregating and normalising the data from our four additional partners, and found (of course) that the issue only became significantly exacerbated as the quantity of data and variance between records increased significantly. Processing of data has also slowed considerably as we tackle these larger pots of data, and if this work were to be taken further then we’d be exploring how to enhance and streamline the database and processing workflows. In addition, right now calls upon the API would provide potentially very slow results, which is clearly not sustainable in the longer term if the API is to be used more broadly as part of a core service infrastructure.  For detailed information on the loading and processing routines we’re using, see this document prepared by our developer, Dave Chaplin.

In terms of the ISBN issue, we found our problem was not so much that we have duplicates appearing, but that when we implement it into Copac many results did not have recommendations at all – quite simply because we couldn’t easily match works with the same ISBN to one another.  The level of duplication currently existing in the Copac database compounds this issue further, and is something we’re tackling separately – calling upon the API against work level records will go a long way in making this issue go away for Copac users.

But for testing purposes, the problem of empty results has been resolved by the use of OCLC’s xISBN service, which is allowing us to cross-walk from one ISBN to any of its aliases that might appear in the transaction data (see figure below). Right now we’re using the free API which allows a pretty generous 1,000 calls day – but with the scale of data and use we’re talking about here, use of this free service is not going to be a viable solution in the long term.

The diagram below gives an overview of how the API currently works with the loan data from the 5 institutions.   Dave has stripped back the API so that it grabs one ISBN from each search result, and then we use xISBN to return all known variants.  These aliases are then matched to individual (and anonymised) user circulation data in the database (in other words, we find all the people who have that book in common) and we then trawl the database to see what other books those users have in common. Any items borrowed by 8 or more of the people from that subset will be automatically recommended.  Note that each recommendation is weighted by the total number of times the item has been borrowed (as per Dave Pattern’s methodology, see http://www.daveyp.com/blog/archives/1453) and ranked accordingly, with the top 40 suggestions offered; this is an attempt to present the user with relevant recommendations, rather than simply the related items that have been borrowed the most, while not swamping them with potentially hundreds, if not thousands, of suggestions.

Simple overview of how the API works

This approach has improved matters significantly – but not completely.  Behind the scenes there is a hell of a lot of processing going on, which is slowing things down somewhat – and the call upon the xISBN service in each instance is not helping matters.  The diagram above definitely belies the scale we’re often dealing with.

For example, Foucault’s History of Sexuality has been a seminal text in many advanced humanities, arts, and social science disciplines for several decades now. This work has 71 individual ISBN aliases, 3,327 individual borrowers, 182,270 cumulative ‘items’ associated with those borrowers (or loans, although we don’t count multiple loans of the same item by the same person). Of those 182,270 books borrowed by those 3,327 people, 12,497 have 8 people in common.  Using our current experimental system, the first time we ran that search it took around 70 seconds to process (!)

So we can test the qualitative value of the results with academic users, we’re storing that search locally so that the next user does not have the same problem, although (again) this is not a long-term solution for stable service delivery as it would be reasonable to argue that it goes beyond the fair use of the OCLC API. Obviously, further testing would need to be undertaken once the system was improved to evaluate the functionality and speed.

Work is now underway to put the prototype in front of groups of academics, undergraduates, and librarians, so we can further understand the value of the service in supporting learning and research. This will all be reported, along with the technical lesson learned and routes forward, in a final shared services feasibility study.  Certainly, working with the data in aggregate and at such large scale has unearthed challenges we’d not anticipated — all of them surmountable, but which mean if we take this development further we will need to go back to the drawing board in terms of system infrastructure, which is working fine as a live proof of concept, but is not production ready in terms of handling large amounts of data processing or usage.

 

Leave a Reply

Your email address will not be published. Required fields are marked *