Posted on February 1st, 2013 by Paul Stainthorp
It’s been a while since we ran the Jerome project at the University of Lincoln, but it’s far from dead, and thanks to the recent leaps forward in establishing a proper data.lincoln.ac.uk (and data.ac.uk) portal, you can now access a permanent copy of our open catalogue data, at:
Just as in the original Jerome application, this data is constantly harvested from our catalogue over a number of days, one record at a time in an endless cycle.
It’s a ‘minimally invasive’ method that doesn’t put too heavy a load on the catalogue itself, or require us to run any additional software on our catalogue server – and it means that, on average, no record in the open data is more than a couple of days out of date. The data harvested is stored in Nucleus before being processed and published to data.lincoln.ac.uk.
If you have any technical questions about the process, it’s worth contacting LNCD (specifically, Nick Jackson).
The biggest difference between the original Jerome and this new process is that Jerome scraped XML views of catalogue records from our web OPAC, while son-of-Jerome harvests the records one at a time over Z39.50, using the YAZ PHP extension. We’re also publishing the data this time as BibJSON, rather than MakeItUpAsWeGoAlongJSON.
There’s a lot more data to come, including:
- Richer bibliographic data on each item (it’s somewhat bare-bones at the minute!)
- Library item data (i.e. copies of particular works)
- Reading lists
- Repository records
- Usage and activity data
Posted on May 16th, 2012 by Paul Stainthorp
Today we published data on approximately 1.8 million items loaned from the University of Lincoln’s libraries since 2001. The data is available to re-use under a CC0 licence, and can be downloaded from:
We’ve done this as part of our involvement in the Copac Activity Data Project, a.k.a. SALT2. Along with data from the universities of Manchester, Sussex, Cambridge and Huddersfield, our circulation data will be used to power a ‘recommender API‘, which libraries will be able to use to build “People who borrowed X also borrowed Y“-type services. The API will benefit from the power of aggregated data from multiple institutions of different types, containing tens of millions of circulation events.
You’ll notice as well that we’ve chosen to host the data on our brand-new Orbital (v0.1) research data management application. Each dataset has a persistent citable URI. We’ll be keeping the data up-to-date, and generating a new activity data file from our library circulation logs shortly after the end of each academic year.
The data consists of a number of CSV files (one for each academic year since 2000-01, plus a huge file of all the data), containing the following fields:
||The date and time of the loan event, in the format: dd/mm/yyyy hh:mm
||A cryptographic hash of the internal system ID associated with the borrower of the item, as used in the University of Lincoln’s library system.
||A cryptographic hash of the internal system ID associated with the bibliographic work borrowed, as used in the University of Lincoln’s library system.
||The ISBN of the work borrowed (10 or 13 digits).
||The main author of the work borrowed.
||The title of the work.
||The publication year of the work in the form: yyyy
I’ll blog in detail another time about exactly how we created the data extracts. In short:
- There is a table in the SirsiDynix Horizon library management system called circ_tran which records every instance of item number X borrowed by user number Y at time Z. [#1]
- There is another table which provides a lookup between item numbers and the numbers of the bibliographic works of which they are a copy. [#2]
- Dave Pattern at the University of Huddersfield wrote a Perl script which scrapes all the bibliographic data (title, author, ISBN) for each work from our OPAC (Horizon Information Portal) and writes it to a text file. [#3]
- Developer, Jamie Mahoney of CERD/LNCD then stepped in, using some pretty heavy SQL on the original 3 data extracts, to:
- Hash the internal Horizon user and work ID numbers to provide anonymity;
- Convert the internal Horizon date and time stamps in extract [#1] from a version of Unix time into a readable datestamp (formula hint: cko_date*86400 + cko_time*60);
- Used the item/work lookup table [#2] to pull in the bibliographic details for each loan in [#1] from the bibliographic table [#3] (an epic SQL JOIN query), removing items which are no longer represented in our library system;
- Removed any items without an ISBN, which are of no use to the SALT recommender API;
- Tweaked the punctuation and formatting;
- Split the data into separate files for each year.
Once again, the data is at:
Thanks are due to Chris Leach and Dave Pattern for Horizon-fu, and to Jamie Mahoney for his patient wrangling of several millions of lines of data!
You can find out more about the Copac Activity Data Project/SALT2, at: http://copac.ac.uk/innovations/activity-data/