Progress so far, and some of the challenges around identifiers and ISBNS we’re facing along the way

Over the last few weeks we’ve been liasing with our cohorts at the University of Sussex, Cambridge University Library, and Lincoln University to extract data and bring it over here to Mimas to start processing. Our aim is to add those sets to the existing API (along with updated data from JRUL and Huddersfield), so that the recommendations or aggregations of related texts produced are less ‘skewed’ to the JRUL context (course reading lists, etc).

When we ran the SALT project, we worked only with the substantial JRUL set of circulation data.  Interestingly (and usefully), the way that JRUL set up their system locally allowed us to see both ISBNs, as well as the JRUL assigned work-ID to identify items. This meant we could deal with items without ISBNs — somewhat critical to our ‘long tail’ hypothesis, which posited that recommenders could help surface under-used items, many of which might be pre-1970s, when ISBNs were phased in.

But now we’re dealing with circulation data from more than one source, and of course there are issues with this approach. The JRUL local solution for items without ISBNs is not widely applied and now we’re dealing with more datasets; we need to map items between different datasets, and the only common ID we have is ISBN. This means that for now we need to shift back to using only ISBN as the ID we deal with, and then adjust our tables and API accordingly.  We do see this as limiting, but for our key objectives in this project, it’s good enough. However, we want to return to this challenge later in the project to see if we can refine the system so it can surface older items.

The other issue emerging currently is that of multiple ISBNs for the same work – a perennial and complex issue, which is particularly coming to the fore in the debate on how to identify eBooks:

With some of our partners’ data, this field has only one value – it seems to be difficult to pinpoint exactly where in the supply chain the decision as to which ISBN to assign seems to occur (depending on vendor systems and cataloguing practices), but it’s clear it will vary a great deal according to institution and processes. On the other hand, in other datasets, multiple ISBNs for one work are recorded, and we need to make a call as to which ISBN we work with.   We could just go with the first ISBN that appears, but this will likely result in duplicates appearing in the recommendations list; it also means that the algorithm on which the recommendation itself is made is watered down (i.e., recommendations will be less meaningful).

For now, we’re going to have to settle for grabbing the first ISBN to get the demonstrator working.  But we’ll also need to develop a stage in our processing where we map ISBNs, and this would also need to be part of the API (so institutions using the API can also map effectively). Right now we’re trying to find out if there is some sort of service that might help us out here. General consensus is that ‘there must be something’ (surely we’re not the first people to tackle this) but so far we’ve not come across anything that fits the bill.  Any suggestions gratefully received!



Announcing the Copac Activity Data Project (otherwise known as SALT 2)

We’re extremely pleased to announce that thanks to funding from JISC, we are about to commence work that builds on the success of SALT, and provides further understanding of the potential of aggregating and sharing library circulation data to support recommender functionality and the local and national levels. From now until July 31st 2012, we want to  strengthen the existing business case for openly sharing circulation data to support recommendations, and will produce a scoping and feasibility report for a shared national service to support circulation data aggregation, normalisation, and distribution for reuse via an open API.

To achieve this we plan to aggregate and normalise data from libraries in addition to JRUL and to make this available along with the John Rylands Library, University of Manchester dataset through a shared API; our new partner in this include: Cambridge University library, Lincoln University Library, Sussex University Library, and University of Huddersfield Library.

CopacAD will conduct primary research to  investigate the following additional use cases:

  • an undergraduate from a teaching and learning institution searching for course related materials
  • academics/teachers using the recommender to support the development of course reading lists
  • librarians using the recommendations to support academics/lecturers and collections development.

At the same time, we’re going to develop a Shared Service Scoping and Feasibility study will explore the options for a shared service for aggregating, normalising and hosting circulation data, and the potential range of web services/APIs that could be made available on top of that data.

Issues we’ll address will include identifying what infrastructure would need to be in place, how scaleable the service would need to be, and whether the service can scale with demand, the potential use cases for such a service, and benefits to be realised, the projected costs of such a service on an ongoing basis, technical and financial sustainability, including potential business model options moving forward.

If you’re interested in learning more, here’s the proposal for this work [doc].  And as with SALT, we will be regularly updating the community on our progress and lessons learned through this blog.

Introducing the SALT recommender API (based on 10 years of University of Manchester circulation data)

I’m pleased to announce the release of the SALT recommender API which works with over ten years of circulation data from the University of Manchester’s John Rylands Library.

The data source is currently static, but nonetheless yields excellent results. Please experiment and let us know how you get on. Stay tuned for a future post detailing some work we have planned for continuing this project, which will include assessing additional use cases, aggregating more data sources (and adding them to the API) and producing a shared service feasibility report for JISC.

Refining SALT (techie lessons learned)

While early tests with a sample set of data from JRUL were encouraging, see See SALT – a demo, an overhaul of the methodology behind the recommender API was required once the full set of loan transactions was obtained.

It was feared that processing the data into the nborrowers table – containing, for each combination of two items, a count of the unique number of library users to have borrowed both items – might become too onerous with the anticipated 3 million records.  That fear turned to blind panic when 8 million loan records actually arrived!

The approach for processing the data for the API was thus re-jigged.  As before the data was loaded into two MySQL tables, items and loans, and then some simple processing pushed the total number of loans for each item into a further, nloans, table.  The remainder of the logic for the recommender was moved to run, on demand, in the API.

Given the ISBN of a certain item, let’s say ITEM A, and a threshold value, the PHP script for the API was coded to do the following:

  1. Find the list of all users in the loans table who have borrowed ITEM A
  2. For each user found in 1. find the list of all items in the loans table that have been borrowed by that user
  3. Sum across the lists of items found in 2. to compile a single list of all possible suggested items which includes, for each of these items, the number of unique users to have borrowed both that item and ITEM A
  4. From the list in 3. remove ITEM A and any items for which the number of unique users falls below the given threshold
  5. For each item in the list derived in 4. divide the number of unique users of that item by the total number of times that item has been borrowed, from the nloans table
  6. Rank the items in the list in 5. by the ratio of unique users to total loans
  7. Find the details of each item in the list in 6. from the items table and return the list of suggestions

Testing showed that certain queries of the MySQL database involved in the above process were time consuming and affected the responsiveness of the API.  The following extra pre-processing was thus performed:

  • The items table was split into 10 smaller tables
  • The loans table was split into 5 smaller tables

With queries rewritten so that searches access each of these smaller tables in turn rather than just looking at the original, large tables there was a significant boost in API performance.  The number of divisions for the above splits was somewhat arbitrary but was sufficient to render the API usable for testing.

Further analysis would more than likely bring additional performance benefits, especially relevant as the amount of data is only going to grow (*).  Also on the to-do list is expanding the range of output formats for the API; at present only xml and json are offered though both of the developers implementing the API in Copac and in JRUL respectively suggested that jsonp would be easier to work with.

(*) For reference, just over 8 million loan transactions are used for the current SALT recommender covering all available records up to July 2011, and these loans feature around 628,000 individual library items.

See SALT – a demo

A further set of sample data from JRUL, comprising 100,000 loan transactions this time, has been processed and used to test a prototype web API.  Signs are encouraging.

The process begins with data being extracted from the Talis library management system (LMS) at JRUL in CSV format.  This data is parsed by a PHP script which separates the data into two tables in a MySQL database, the bibliographic details describing an item go into a table called items and the loan specific data, including borrower ID, goes into a table called, you’ve guessed it, loans.  A further PHP script then processes the data into two additional MySQL tables, nloans and nborrowers; nloans contains the total number of times each item has been borrowed, and nborrowers contains, for each combination of two items, a count of the unique number of library users to have borrowed both items.

With the above steps complete, additional processing is performed on demand by the web API.  When called for a given item, say item_1, the API returns a list of items for suggested reading, where this list is derived as follows.  From the nborrowers table a list of items is compiled from all combinations featuring item_1.  For each item in this list the number of unique borrowers, from the nborrowers table, is divided by the total number of loans for that item, from the nloans table, following the logic used by Dave Pattern at the University of Huddersfield.  The resulting values are ranked in descending order and the details associated with each suggested item are returned by the API.

For a bit of light relief here’s an image.

A screenshot of a demonstrator for SALT.

This is a screenshot from a piece of code written to demonstrate the web API.  For a given item, identified by the ISBN, the details are retrieved from the items table in the MySQL database and displayed in [A].  An asynchronous call is made to the web API that accepts the ISBN as a parameter, along with threshold and format values which are set using the controls in [B]; threshold is the minimum number of unique borrowers that any given combination of items must have to be considered, and format specifies how the returned data is required (either xml or json).  Results from the web API are displayed in [C], with the actual output from the API reproduced in [D].  Note that all available results are returned by the API but the test code only shows the number set by the third control in [B].

The exact format of the output is yet to be ratified but the API is in a state where it can now be incorporated into prototype interfaces at JRUL and in COPAC.  In addition the remaining 3 million or so loan transactions from JRUL will be loaded and processed in readiness for user testing.

Surfacing the Academic Long Tail — Announcing new work with activity data

We’re pleased to announce that JISC has funded us to work on the SALT (Surfacing the Academic Long Tail) Project, which we’re undertaking with the University of Manchester, John Rylands University Library.

Over the next six months the SALT project will building a recommender prototype for Copac and the JRUL OPAC interface, which will be tested by the communities of users of those services.  Following on from the invaluable work undertaken at the University of Huddersfield, we’ll be working with ten years+ of aggregated and anonymised circulation data amassed by JRUL.  Our approach will be to develop an API onto that data, which in turn we’ll use to develop the recommender functionality in both services.   Obviously, we’re indebted to the previous knowledge acquired by a similar project at the University of Huddersfield and the SALT project will work closely with colleagues at Huddersfield (Dave Pattern and Graham Stone) to see what happens when we apply this concept in the research library and national library service contexts.

Our overall aim is that by working collaboratively with other institutions and Research Libraries UK, the SALT project will advance our knowledge and understanding of how best to support research in the 21st century. Libraries are a rich source of valuable information, but sometimes the sheer volume of materials they hold can be overwhelming even to the most experienced researcher — and we know that researchers’ expectation on how to discover content is shifting in an increasingly personalised digital world. We know that library users — particularly those researching niche or specialist subjects — are often seeking content based on a recommendation from a contemporary, a peer, colleagues or academic tutors. The SALT Project aims to provide libraries with the ability to provide users with that information. Similar to Amazons, ‘customers who bought this item also bought….’ the recommenders on this system will appear on a local library catalogue and on Copac and will be based on circulation data which has been gathered over the past 10 years at The University of Manchester’s internationally renowned research library.

How effective will this model prove to be for users — particularly humanities researchers users?

Here’s what we want to find out:

  • Will researchers in the field of humanities benefit from receiving book recommendations, and if so, in what ways?
  • Will the users go beyond the reading list and be exposed to rare and niche collections — will new paths of discovery be opened up?
  • Will collections in the library, previously undervalued and underused find a new appreciative audience — will the Long Tail be exposed and exploited for research?
  • Will researchers see new links in their studies, possibly in other disciplines?

We also want to consider if there are other  potential beneficiaries.  By highlighting rarer collections, valuing niche items and bringing to the surface less popular but nevertheless worthy materials, libraries will have the leverage they need to ensure the preservation of these rich materials. Can such data or services assist in decision-making around collections management? We will be consulting with Leeds University Library and the White Rose Consortium, as well as UKRR in this area.

(And finally, as part of our sustainability planning, we want to look at how scalable this approach might be for developing a shared aggregation service of circulation data for UK University Libraries.  We’re working with potential data contributors such as Cambridge University LibraryUniversity of Sussex Library, and the M25 consortium as well as RLUK to trial and provide feedback on the project outputs, with specific attention to the sustainability of an API service as a national shared service for HE/FE that supports academic excellence and drives institutional efficiencies.