Posts Tagged ‘data’

Hack da Fens: open bib hack day objectives!

Posted on May 17th, 2012 by Paul Stainthorp

Most of the CLOCK project team (AB, EC, CL, TJ, PS) are at CARET in Cambridge today and tomorrow (17-18 May 2012) to generally hack bibliographic data and try and point the way for the remaining 2 months’ technical development for the CLOCK project.

After coffee on day 1 we agreed our objectives for the next two days. They are:

  1. To review what we’ve done so far and what we need to do. To play with the SPARQL and JSON-parsing search tools that Andrew Beeken has started to develop and to incorporate more data (BL, etc.)
  2. To think about the user interface for CLOCK: how do we present open bib data from multiple sources (Lincoln, Cambridge, Harvard, BL, OpenLibrary, other) in a single UI in a way which helps our users (cataloguers. researchers) solve problems?
  3. What’s the high level architecture for CLOCK? How does data flow thru’ the system – can we draw a meaningful diagram?
  4. A comparison of open data / Discovery projects that Ed Chamberlain is involved in! What can we take and re-use from OpenBiblio2 and the OEM-UK project? What might those projects be able to take and re-use from CLOCK?
  5. What are we going to do with all this data? A plan for http://data.lincoln.ac.uk/http://data.lib.cam.ac.uk/, and http://data.ac.uk/library (or http://library.data.ac.uk/).
  6. To run interviews and live cognitive workthroughs with cataloguers in Cambridge and Lincoln.

1.8 million library loans from the University of Lincoln under CC0 – Copac Activity Data/SALT2 project

Posted on May 16th, 2012 by Paul Stainthorp

Today we published data on approximately 1.8 million items loaned from the University of Lincoln’s libraries since 2001. The data is available to re-use under a CC0 licence, and can be downloaded from:

We’ve done this as part of our involvement in the Copac Activity Data Project, a.k.a. SALT2. Along with data from the universities of Manchester, Sussex, Cambridge and Huddersfield, our circulation data will be used to power a ‘recommender API‘, which libraries will be able to use to build “People who borrowed X also borrowed Y“-type services. The API will benefit from the power of aggregated data from multiple institutions of different types, containing tens of millions of circulation events.

You’ll notice as well that we’ve chosen to host the data on our brand-new Orbital (v0.1) research data management application. Each dataset has a persistent citable URI. We’ll be keeping the data up-to-date, and generating a new activity data file from our library circulation logs shortly after the end of each academic year.

The data consists of a number of CSV files (one for each academic year since 2000-01, plus a huge file of all the data), containing the following fields:

Field index Field name Description
0 CREATE_DATE The date and time of the loan event, in the format: dd/mm/yyyy hh:mm
1 BORROWER_ID A cryptographic hash of the internal system ID associated with the borrower of the item, as used in the University of Lincoln’s library system.
2 WORK_ID A cryptographic hash of the internal system ID associated with the bibliographic work borrowed, as used in the University of Lincoln’s library system.
3 CONTROL_NUMBER The ISBN of the work borrowed (10 or 13 digits).
4 AUTHOR_DISPLAY The main author of the work borrowed.
5 TITLE_DISPLAY The title of the work.
6 PUB_DATE The publication year of the work in the form: yyyy

I’ll blog in detail another time about exactly how we created the data extracts. In short:

  1. There is a table in the SirsiDynix Horizon library management system called circ_tran which records every instance of item number X borrowed by user number Y at time Z. [#1]
  2. There is another table which provides a lookup between item numbers and the numbers of the bibliographic works of which they are a copy. [#2]
  3. Dave Pattern at the University of Huddersfield wrote a Perl script which scrapes all the bibliographic data (title, author, ISBN) for each work from our OPAC (Horizon Information Portal) and writes it to a text file. [#3]
  4. Developer, Jamie Mahoney of CERD/LNCD then stepped in, using some pretty heavy SQL on the original 3 data extracts, to:
    • Hash the internal Horizon user and work ID numbers to provide anonymity;
    • Convert the internal Horizon date and time stamps in extract [#1] from a version of Unix time into a readable datestamp (formula hint: cko_date*86400 + cko_time*60);
    • Used the item/work lookup table [#2] to pull in the bibliographic details for each loan in [#1] from the bibliographic table [#3] (an epic SQL JOIN query), removing items which are no longer represented in our library system;
    • Removed any items without an ISBN, which are of no use to the SALT recommender API;
    • Tweaked the punctuation and formatting;
    • Split the data into separate files for each year.

Once again, the data is at:

Thanks are due to Chris Leach and Dave Pattern for Horizon-fu, and to Jamie Mahoney for his patient wrangling of several millions of lines of data!

You can find out more about the Copac Activity Data Project/SALT2, at: http://copac.ac.uk/innovations/activity-data/

Repository feeds on university staff profile webpages: some examples

Posted on March 28th, 2012 by Paul Stainthorp

There is a project going on at the University of Lincoln at the moment to rebuild our directory of academic staff profiles on the web, in line with our new corporate website.

As I mentioned in my presentation to library managers last week, it’s turning out to be a nice example of how new web applications can be spun up quickly at Lincoln using our existing [open and non-open] data sources (in this case, HR staff data, BuddyPress social profile data, Repository feeds, Gravatar images, and our OAuth authentication framework/Common Web Design), plus a bit of developer magic.

Screenshot of the new staff directory

You can search our staff profile directory (still in development) at: http://phone.online.lincoln.ac.uk/

There is a growing tendency for universities in all groupings—certainly for the research-intensive universities—to publish the entirety of an author’s publications to their web profile as embedded content from their repository and/or Current Research Information System (CRIS). Here are a few examples of staff profiles on other UK universities’ sites which incorporate publication lists derived from their repositories or CRISes:

We’re pulling the publication details from the Lincoln Repository for each author into their web profile (example), using a search on their University of Lincoln staff ID (which forms part of their standard HR data profile) – e.g. http://lncn.eu/ep/000157. We can then get at the Repository data in almost any format we want (BibTeX, JSON, XML, RSS, etc.). I’m also keeping a close eye on the development of the EPrints Shelves plugin, which might be an interesting tool for giving authors more flexibility and control over how their Repository publication list(s) are displayed on their web profile.

The data! The data!

Posted on October 3rd, 2011 by Paul Stainthorp

The Library Impact Data Project (LIDP), which ran from February-July this year, and in which the University of Lincoln took part, has now released a subset of the library activity data used in the analysis (which, you’ll remember, showed a statistically significant correlation across a number of universities between library activity data and student attainment).

Lincoln’s data is included in the release, which is available for re-use under an open licence, from:

http://eprints.hud.ac.uk/11543/

This data set is made available under the Open Data Commons Attribution License
http://opendatacommons.org/licenses/by/1.0/

The data contains final grade and library usage figures for 33,074 students studying undergraduate degrees at UK universities. More information on the data, and how it’s been generalised in order to preserve students’ anonymity, on the LIDP project blog.

  • There’s also a detailed report about the statistical breakdown of Lincoln’s own share of the data (this wasn’t published as part of the project reports, as it was down to each individual institution whether to make it public or not) – I’ve made the report available here [PDF].

The LIDP blog also contains information about the project ‘toolkit‘, developed to assist other institutions who may want to test their own data against the LIDP’s hypothesis, here and here.

Thanks again to Graham, Bryony and Dave at the University of Huddersfield for inviting Lincoln to take part in the project, and for their help along the way!

On to the next one…

Library Impact Data Project: good news, everybody!

Posted on June 18th, 2011 by Paul Stainthorp

I think this is worth re-posting from the LIDP blog:

LIDP graphicWe are very pleased to report that we have now received all of the data from our partner organisations and have processed all but two already!

Early results are looking positive and our next step is to report back with a brief analysis to each institution. We are planning to give them our data and a general set of data so that they can compare and contrast. There have been some issues with the data, some of which has been described in previous blogs, however, we are confident we have enough to prove the hypothesis one way or another!

In our final project meeting in July we hope to make a decision on what form the data will take when released under an Open Data Commons Licence. If all the partners agree, we will release the data individually; otherwise we will release the general set for other to analyse further.

I submitted Lincoln’s data on 13 June. It consists of fully anonymised entries for 4,268 students who graduated from the University of Lincoln with a named award, at all levels of study, at the end of the academic year 2009/10 – along with a selection of their library activity over three* years (2007/08, 2008/09, 2009/10).

The library activity data represents:

  1. The number of library items (book loans etc.) issued to each student in each of the three years; taken from the circ_tran (“circulation transactions”, presumably) table within our SirsiDynix Horizon Library Management System (LMS). We also needed a copy of Horizon’s borrower table to associate each transaction with an identifiable student.
  2. The number of times each student visited our main GCW University Library, using their student ID card to pass through the Library’s access control gates in each of the three* years; taken directly from our ‘Sentry’ access control/turnstile system. These data apply only to the main GCW University Library: there is no access control at the University of Lincoln’s other four campus libraries, so many students have ’0′ for these data. Thanks are due to my colleague Dave Masterson from the Hull Campus Library, who came in early one day, well before any students arrived, in order to break in to the Sentry system and extract this data!
  3. The number of times each student was authenticated against an electronic resource via AthensDA; taken from our Portal server access logs. Although by no means all of our e-resources go via Athens, we’re relying on it as a sort of proxy for e-resource usage more generally. Thanks to Tim Simmonds of the Online Services Team (ICT) for recovering these logs from the UL data archive.

I had also hoped to provide numbers of PC/network logins for the same students for the same three years (as Huddersfield themselves have done), but this proved impossible. We do have network login data from 2007-, but while we can associate logins with PCs in the Library for our current PCs, we can’t say with any confidence whether a login to the network in 2007-2010 occurred within the Library or elsewhere: PCs have just been moved around too much in the last four years.

Student data itself—including the ‘primary key’ of the student account ID—was kindly supplied by our Registry department from the University’s QLS student records management system.

Once we’d gathered all these various datasets together, I prevailed upon Alex Bilbie to collate them into one huge .csv file: this he did by knocking up a quick SQL database on his laptop (he’s that kind of developer), rather than the laborious Excel-heavy approach using nested COUNTIF statements which would have been my solution. (I did have a go at this method—it clearly worked well for at least one of the other LIDP partners—but it my PC nearly melted under the strain.)

The final .csv data has gone to Huddersfield for analysis and a copy is lodged in our Repository for safe keeping. Once the agreement has been made to release the LIDP data under an open licence, I’ll make the Repository copy publicly accessible.

*N.B. In the end, there was no visitor data for the year 2007/08: the access control / visitor data for that year was missing for almost all students. This may correspond to a re-issuing of library access cards for all users around that time, or the data may be missing for some other reason.

Anonymised library activity data for the academic years 2007/08, 2008/09 and 2009/10: collected for the JISC Library Impact Data Project

Posted on June 13th, 2011 by Paul Stainthorp

These data consist of entries for 4,268 anonymised students who graduated from the University of Lincoln with a named award at the end of the academic year 2009/10, along with a selection of their library activity over three years (2007/08, 2008/09, 2009/10): library item circulation, visits to the main GCW University Library, and e-resources usage represented by authentication against AthensDA.

View this item on the University Repository: http://eprints.lincoln.ac.uk/4540/

Inclusive practice, digital data, and e-books

Posted on April 7th, 2011 by Paul Stainthorp

Screenshot of the Blackboard PIP communityI attended Sue Watling‘s workshop, ‘Promoting Inclusive Practice with Digital Data‘, today. (I know that Sue has delivered the same workshop in the past to groups of Library staff.) There’s also a Blackboard community to accompany the workshop.

My particular interest in usability / accessibility / inclusive design, as Sue knows, is around the accessible nature (or otherwise) of Library-digitised and born-digital library subscription resources: e-books, e-journals, and material scanned and digitised under the CLA’s comprehensive HE licence.

In particular, Sue and I have had a number of conversations about the frustrations we share around digital texts: which ought to be inherently accessible and a great asset, but which in practice are often only available in a form (or via a platform) covered in barriers to accessibility. Also around the lack of importance which the University can seem to place on accessibility, usability and access issues.

A little while ago, Sue and I made a start on an e-book usability/accessibility reference guide. To my shame (because I do think it’s important, it’s something that doesn’t get a lot of attention, and it’s something I’m interested in) …I let it fall by the wayside.

I’ve made a start again! It’s made up of a table containing information about the features of the three Library e-book platforms which are available at the University of Lincoln, plus a guide to using e-books. Both parts are publicly-editable Google documents, so feel free to edit them.

L to the I to the D to the P

Posted on March 17th, 2011 by Paul Stainthorp

Representatives of the eight partner institutions in the JISC Activity Data LIDP (Library Impact Data Project) met in person (and in Huddersfield) for the first time last week.

Denby Dale from the trainFrom the project blog:

“In a packed agenda we discussed the project in detail – we’ll be blogging the minutes soon.

“We also approved the project plan and discussed the hypothesis in some detail – look out for our first blog on that soon too! We are now working on getting the focus group questions out to everyone in the next few days.”

The original project hypothesis bears repeating here: if we can prove that it stands up, it’s obviously of some significance to libraries in UKHE.

There is a statistically significant correlation across a number of universities between library activity data and student attainment.

N.B. that’s the first and ‘official’ version of the hypothesis, taken from the project proposal. The language may be tightened up a little bit in the final project plan – i.e. what do we mean by “student attainment” – what measurement of attainment are we taking? (It’s degree classification, btw.)

Project Partners:

  1. University of Huddersfield
  2. University of Bradford
  3. De Montfort University
  4. University of Exeter
  5. University of Lincoln
  6. Liverpool John Moores University
  7. University of Salford
  8. Teesside University

Links:

Library catalogue: Site Search analytics

Posted on March 17th, 2011 by Paul Stainthorp

A while ago (and, as with all things Horizon, with the help of Dave Pattern at Huddersfield), we enabled Google Analytics on our library OPAC (sometimes referred to as HiP, the “Horizon Information Portal“). This takes the form of a piece of Google JavaScript which lives in a ‘footer’ document common to all HiP pages.

Chris Leach gave a presentation about using Google Analytics with HiP at the last SirsiDynix Horizon User Group.

Now Nick Jackson has shown me how to enable Google’s Site Search features on our Analytics profile for the library catalogue. Site Search will allow us to ‘tease out’ the search activity within the library catalogue itself, by analysing the URL structure of HiP queries, recognising and extracting the search terms, then tracking the paths users take from those search queries to destination pages (i.e., individual bibliographic record pages on HiP).

For instance: a typical HiP search query ends up looking something like this:

http://www.library.lincoln.ac.uk/ipac20/ipac.jsp?session=1G00362045UH4.101795&menu=search&aspect=subtab13&npp=10&ipp=20&spp=20&profile=ln&ri=&index=.GW&term=journalism&x=0&y=0&aspect=subtab13

By telling Google Site Search to look in the query parameter “term” for the search keyword(s)—in this case journalism—and to ignore the “session” parameter, Google Analytics can start to group similar queries together and provide us with data about what our users are searching the catalogue for.

Screenshot of the setup page for Google Site Search

It’s been running for less than 24 hours, but already we’re starting to see build up a record of the keywords people are typing into the catalogue:

Screenshot of Google Site Search top search terms on HiP

What could we [and what should we] do with this data? Are there any Google Site Search experts out there who could give me a few tips? If anyone from within the Library at the University of Lincoln would be interested in helping to analyse the search term data, please let me or Chris know.

One thing we’ve already discussed is the idea of using the HiP search term activity as test data to ‘teach’ the Jerome machine intelligence engine about the kind of things Lincoln library users are interested in… this will help us in determining how the Jerome API’s personalisation features might be used to present and relevance-rank results.