Posts Tagged ‘Cambridge University’

Slides on the CLOCK project for #Mashcat (Cambridge mashed library cataloguing event)

Posted on July 5th, 2012 by Paul Stainthorp

Mashcat logoA whole contingent from Lincoln—Andrew Beeken, Trevor Jones, Elif Varol and I—are at the Cambridge University Clinical School at Addenbrooke’s Hospital in Cambridge, for a mashed library event – Mashcat.

Mashcat is “a mashed library event focussing on cataloguing data. For cataloguers, developers and anyone else with an interest in how library catalogue data can be created, manipulated, used and re-used by computers and software”. It’s being sponsored by DevCSI.

We’re presenting about the CLOCK project to a room full of cataloguers. No pressure. The slides are online at: http://lncn.eu/hknp

1.8 million library loans from the University of Lincoln under CC0 – Copac Activity Data/SALT2 project

Posted on May 16th, 2012 by Paul Stainthorp

Today we published data on approximately 1.8 million items loaned from the University of Lincoln’s libraries since 2001. The data is available to re-use under a CC0 licence, and can be downloaded from:

We’ve done this as part of our involvement in the Copac Activity Data Project, a.k.a. SALT2. Along with data from the universities of Manchester, Sussex, Cambridge and Huddersfield, our circulation data will be used to power a ‘recommender API‘, which libraries will be able to use to build “People who borrowed X also borrowed Y“-type services. The API will benefit from the power of aggregated data from multiple institutions of different types, containing tens of millions of circulation events.

You’ll notice as well that we’ve chosen to host the data on our brand-new Orbital (v0.1) research data management application. Each dataset has a persistent citable URI. We’ll be keeping the data up-to-date, and generating a new activity data file from our library circulation logs shortly after the end of each academic year.

The data consists of a number of CSV files (one for each academic year since 2000-01, plus a huge file of all the data), containing the following fields:

Field index Field name Description
0 CREATE_DATE The date and time of the loan event, in the format: dd/mm/yyyy hh:mm
1 BORROWER_ID A cryptographic hash of the internal system ID associated with the borrower of the item, as used in the University of Lincoln’s library system.
2 WORK_ID A cryptographic hash of the internal system ID associated with the bibliographic work borrowed, as used in the University of Lincoln’s library system.
3 CONTROL_NUMBER The ISBN of the work borrowed (10 or 13 digits).
4 AUTHOR_DISPLAY The main author of the work borrowed.
5 TITLE_DISPLAY The title of the work.
6 PUB_DATE The publication year of the work in the form: yyyy

I’ll blog in detail another time about exactly how we created the data extracts. In short:

  1. There is a table in the SirsiDynix Horizon library management system called circ_tran which records every instance of item number X borrowed by user number Y at time Z. [#1]
  2. There is another table which provides a lookup between item numbers and the numbers of the bibliographic works of which they are a copy. [#2]
  3. Dave Pattern at the University of Huddersfield wrote a Perl script which scrapes all the bibliographic data (title, author, ISBN) for each work from our OPAC (Horizon Information Portal) and writes it to a text file. [#3]
  4. Developer, Jamie Mahoney of CERD/LNCD then stepped in, using some pretty heavy SQL on the original 3 data extracts, to:
    • Hash the internal Horizon user and work ID numbers to provide anonymity;
    • Convert the internal Horizon date and time stamps in extract [#1] from a version of Unix time into a readable datestamp (formula hint: cko_date*86400 + cko_time*60);
    • Used the item/work lookup table [#2] to pull in the bibliographic details for each loan in [#1] from the bibliographic table [#3] (an epic SQL JOIN query), removing items which are no longer represented in our library system;
    • Removed any items without an ISBN, which are of no use to the SALT recommender API;
    • Tweaked the punctuation and formatting;
    • Split the data into separate files for each year.

Once again, the data is at:

Thanks are due to Chris Leach and Dave Pattern for Horizon-fu, and to Jamie Mahoney for his patient wrangling of several millions of lines of data!

You can find out more about the Copac Activity Data Project/SALT2, at: http://copac.ac.uk/innovations/activity-data/

CLOCK notes – 8 May 2012

Posted on May 8th, 2012 by Paul Stainthorp

This is what the CLOCK project team are currently up to (from meetings over the past couple of weeks and from notes made at the recent Discovery: making sure your resources are discovered, used and reused event in Birmingham):

  • Andrew Beeken has been exploring the Cambridge COMET data via its SPARQL endpoints and has already blogged about the process of using SPARQL to “build kind of a ‘Hello World’ of open data querying”. He’s now looking at the recently-released Harvard open bib data and comparing the speed, the use of matching namespaces, and the use of JSON vs RDF/XML.
  • This work is leading up to unified search and presentation of records from several sources (Cambridge/COMET, Harvard, Lincoln/Jerome, OpenLibrary, etc.). Andrew and Trevor Jones are collaborating on drawing up a high-level architecture for CLOCK, and a strategy for expressing Linked Data, which will be shared with the rest of the project team (and publicly) for discussion.
  • To support this, Alex Bilbie in ICT services at Lincoln is helping to get the original Jerome application up and running on the CLOCK server (jerome.library.lincoln.ac.uk), where it can be used as a stable platform for developing and RDF-ifying Lincoln’s own bib data.
  • Trevor Jones and Ed Chamberlain will work together on developing the work with users (in parallel, at the University of Lincoln and the University of Cambridge) to clarify their requirements for bibliographic data:
    • For cataloguers, based around a rethink of copy cataloguing workflows, we will try to tease out requirements from talking to cataloguers (and associated subject librarians) asking to be ‘positively disrupted’: what do they need to do? What is missing from their data?
    • For researchers, we will build on some initial user walkthrough analysis done by Trevor and Andrew in Lincoln, with performing arts students in LPAC (the Lincoln Performing Arts Centre). What are the research questions that users are trying to answer? How does bib data help them answer those questions? What’s missing? Ed and Trevor will agree on a set of questions and tasks;
    • These requirements will be used to feed the remainingcycles of platform development for CLOCK.
  • Ed Chamberlain will act as the conduit between CLOCK and related projects in the Discovery strand, looking for points of shared interest/technology, and blogging (or asking others to blog) about aspects of one project which can inform the others. The other projects in which Ed is involved are: the Open Education Metadata UK (OEM-UK) project at the Institute of Education (shared interest in new user interfaces for cataloguing – possibly use screencasts to demonstrate alternative workflows?) and the Open Bibliography 2 project (lots of potential technical overlap – BibJSON, JSON-LD, BibSoup.net, expression in RDF container formats).
  • Ed and I (Paul Stainthorp) will work on developing the ‘business case’ / sustainability of CLOCK and data.*.ac.uk, following up on themes discussed in the recent Discovery event, and thinking not only about institutional funding / high-level support for open bib data, but also what it takes to move open bib data publishing from a development environment into an institutionally-supported, ICT-run service.
  • Finally, PS is arranging a couple of internal CLOCK ‘hack days’ (to take place on 17th-18th May, in Cambridge) – more details to follow.

USTLG meeting on research data management

Posted on November 29th, 2011 by Paul Stainthorp

Clare CollegeYesterday I was at Clare College, University of Cambridge for a meeting organised by USTLG, the University Science & Technology Librarians Group. The group—open to any librarians involved with engineering, science or technology in UK universities—has meetings once or twice a year. The theme of yesterday’s meeting (free to attend, thanks to sponsorship from the IEEE) was data management, with an implied focus on research data.

The meeting consisted of a series of presentations (plus a fantastic lunchtime diversion, below) with plenty of time for networking – there were about 40 people there, all with an interest in research data management – though interestingly, a show of hands suggested very few people were actively engaged in looking after their own institution’s researchers’ data.

As usual, this blog post has been partially reconstructed from the Twitter stream (hashtag #ustlg).

First up, Laura Molloy, substituting for Joy Davidson of the Digital Curation Centre (DCC), on a project called the Data Management Skills Support Initiative (DaMSSI), looking at the [shades of information literacy] skills needed by different people involved in the research data curation process. “DaMSSI aims to facilitate the use of tools like Vitae’s Researcher Development Framework (RDF) and the Seven Pillars of Information Literacy model” developed by SCONUL. Key question: how do you assess the effectiveness of research data management training?

Useful links:

Second, Yvonne Nobis of Cambridge’s Central Science Library talked about supporting researchers at Cambridge: data sharing and the role of librarians; including her project—funded through CUL’s Arcadia library staff research scheme—looking at the issues involved in curating not research data per se, but the software code and techniques used to analyse that source data. Key points: [1] there are disincentives (time, and lack of recognition within ones own field) to researchers’ spending time on code/software for research data manipulation. [2] But without that investment in code, the transparency–openness–replicability of computational-data science is at risk. [3] ”Librarians are missing a trick” by not engaging in research data software curation issues. Yvonne also talked about the work of the eScience Centre.

Links and articles…

Before lunch we also got a chance to inspect the USTLG’s brand new website (and smashing new logo), at ustlg.org

Then the highlight of the day… we were invited in groups over to go over to the adjacent University Library, where we were treated to a display and commentary on some of Cambridge University’s rare science manuscripts and early printed books. All laid out in a reading room were Isaac Newton’s notebooks containing his notes on the method of fluxions (i.e. early calculus), Darwin’s field notes from the Beagle, Ernest Rutherford’s lab diaries (still slightly radioactive! – “…not ever so, but Health & Safety made us do a risk-assessment…”), plus Prof. Stephen Hawking’s typed and ring-bound first draft of A brief history of time, along with several early printed herbals and a book containing the first known technical drawings (of machines of warfare). Inspiring stuff, and really quite brilliant of them to lay it out for us to see!

In the afternoon—not directly connected with research data, but certainly of interest to the engineers involved in the Orbital project—we heard from Rachel Berrington of the IEEE, about the work of the organisation and some of the planned developments to the IEEE Xplore platform: new journal titles in 2012, a mobile platform, the inclusion of CrossRef data, and new interactive HTML content.

Handful of interesting links:

Finally, a useful presentation from Anna Collins, Research Data and Digital Curation Officer (good job title) for Cambridge’s DSpace repository. Anna spoke about the Incremental project, a joint exercise between Cambridge and the University of Glasgow, aimed at providing a best practice approach to supporting data management techniques amongst research communities. This is really good practical nuts & bolts stuff (e.g. when’s the right time to broach the subject of data curation with a PhD student? Too early, and they won’t care – too late, and the best you can do is help pick up the pieces!). I’ll be recommending my colleagues at Lincoln take a look at the materials on both institution’s websites. Top quote: ”be the boss of your hard drive”!

Links from Anna’s presentation:

(An aside: after the USTLG meeting had ended, I was lucky enough to get a quick tour of [about 1% of] the Cambridge University Library, along with a cup of tea in the staff room(!), thanks to a “badly-encoded” colleague. I won’t blog about it in any detail now—hopefully I should be back in Cambridge in January for another Orbital-related event—but it’s just a jaw-dropping library.)

The new USTLG website is at ustlg.org, and you can follow them on Twitter at @USTLG.

It’s the end of Jerome as we know it (but I feel fine)

Posted on November 28th, 2011 by Paul Stainthorp

The University of Lincoln’s Jerome project finished in August with the successful release of more than 240,000 openly-licensed bibliographic records, available over developer APIs, and a joint hack day with Cambridge University Library‘s COMET project.

Now, encouraged by positive JISC feedback, both institutions—Cambridge and Lincoln jointly—have applied for follow-up project funding under the project title CLOCK. If our bid is successful, the new project will run between December 2011–July 2012, employing a web developer based at the University of Lincoln, and distilling the work of both institutions into the development of new innovative library metadata discovery services for the scholarly community.

You can read the project proposal for CLOCK at http://lncn.eu/ijt4 – the introductory section is below.

The University of Lincoln and Cambridge University Library both delivered successful projects (Jerome and COMET) for the JISC Infrastructure for Resource Discovery Programme in 2011. This is a proposal for the continuation of and elaboration upon the work of both projects, via a programme of development work shared between the two institutions.

Throughout both projects (COMET-Jerome), parallel approaches in technology and data structure were noted and commented upon. A ‘mash day’ workshop event held in Cambridge in August aimed to explore these differences as well as areas of potential synergy. Here project members identified several points of interest to take forward.

Both projects produced outputs of interest to researchers, students, librarians, developers, and designers of bibliographic discovery environments. The CLOCK project will harness the success of these two complementary initiatives and investigate new approaches to data creation and discovery in the library domain. In particular, it will investigate, propose, and develop new, web-based bibliographic tools/APIs which will make it easier for developers, academic libraries and library end-users (esp. researchers) to find Open Bibliographic Data and incorporate that data into systems and workflows.

This project is an opportunity to [1] exploit through real-world applications the significant amount of data released openly by Cambridge University Library; [2] apply the Jerome database architecture, iterative development methodology, and API framework to a bibliographic dataset an order of magnitude greater than the University of Lincoln’s; and [3] to build and enable a new set of tools and demonstrator services which will enable the future development of public Open Bib Data web applications of practical utility to libraries and end-users.

The project will be supported by library consultant Owen Stephens, who will help to put the work into a national context, relating CLOCK to the wider movement toward Open Bib Data and the work of the JISC Discovery initiative. It will take place in an environment (Lincoln/Cambridge) where a culture of developer inquiry and experimentation is encouraged and nurtured. It is also endorsed by senior library management at both universities.

Both universities are involved in complementary development work which will  both inform and be informed by CLOCK: at Cambridge, Ed Chamberlain is guiding the development of the JISC Open Bibliography 2 project; in Lincoln, Paul Stainthorp is lead researcher on the #jiscmrd Orbital project, which is investigating the management of research data, with some areas of overlap.

CLOCK will operate as part of the wider JISC Digital Infrastructure: Information and library infrastructure: Resource discovery, and support the recent concerted effort to move toward openly licensed library discovery in UK Higher Education and beyond.

Jerome/COMET hack day: Fun in the Fens

Posted on August 10th, 2011 by Paul Stainthorp

Here’s a photo of the CARET (Centre for Applied Research in Educational Technologies) offices at the University of Cambridge, where we held our log-awaited joint Jerome/COMET hack day, on Monday 8 August. Actually, in the end, it turned out to be a kind of Jerome/COMET/SALDA/synthesis/OUseful mashup-AH!

Jerome/COMET

In attendance (for the record):

Train mayhem aside (in the end the Lincoln contingent didn’t arrive until nearly midday), it was a really useful day and well worth doing. Particular thanks to Ed Chamberlain and his colleagues for hosting the event and for arranging the food and refreshments. Thanks also to everyone who travelled from afar for no other reason than they love a good mashup.

Typically, the ever-prolific Tony Hirst has already managed to write up not one, but two blog posts about ideas that came out of the day:

  • Getting Library Catalogue Searches Out There…
  • Open Data Processes: the Open Metadata Laundry (N.B. this one relates specifically to Jerome – in particular, our notion of ‘scrubbing’ dodgy MARC records by taking only the identifiers plus the bare citation-only fields, and using that minimal set to grab additional free and Open data from the web, automatically creating new full versions of records that are inherently Open. ‘Metadata laundry’, me like.)

Here are three more ideas/conversations we had in Cambridge that I thought were going somewhere interesting. Yeah, we might get around to actually doing these, sometime…

1. Using COMET data to enhance Jerome

The ideaSimilar to the ‘metadata laundry’, above, and to the way Jerome already uses data from the Open Library, JournalTOCs, LibraryThing, etc., to enhance its book records with additional metadata. Jerome constructs a URL in the form http://data.lib.cam.ac.uk/isbn/_______, with the ISBN from the Jerome record dropped in at the end. COMET responds with a link to an open record in RDF and/or JSON, which Jerome gladly sucks in, adding any additional fields to its original source record. Enrichment ensues.

2. Using Jerome search to ‘skin’ COMET

I called this one ”Jerome Scholar” ;-) …we make use of the search aspects of Jerome (in particular, the speed of Sphinx, the ‘mixing desk‘ idea, the neat record presentation, to provide a really smooth way of interacting with the much more well-structured (hence “Scholar”) data that resides in COMET.

3. Using the differences between the two datasets to tell us something interesting

I have a notion that there’s something inherently useful about being able to compare two versions of a record for the ‘same’ object. If we could use Jerome+COMET to generate a web application/data feed – one that other discovery services could themselves consume, we’d have ways of ‘sparking off’ whole new avenues of discovery: from misspelled names, variant titles, different subject terms assigned by different cataloguing practices, etc. Like xISBN, but for non-standardised data(?). All right, that’s the fuzziest of the three ideas. And as the eminiently sensible Owen Stephens kept asking me, “…what’s the use case?”.

And then we went to the pub.

And then we went to the pub.

An elastic bucket down the data well (#rdtf in Manchester)

Posted on April 20th, 2011 by Paul Stainthorp

I was in Manchester on Monday for Opening Data – Opening Doors, a one-day “advocacy workshop” hosted by JISC and RLUK under their Resource Discovery Taskforce (#rdtf) programme. I delivered a five-minute ‘personal pitch’ about Jerome, open data, and the rapid-development ethos that’s developing at Lincoln.

Ken Chad is writing up a report from the day and Helen Harrop is producing a blog, both of which will be signposted from the website: http://rdtf.mimas.ac.uk/

The big data question

All the presentations can be viewed on slideshare, but there were some particular moments that I think are worth picking out:

The JISC deputy, Prof. David Baker was first up. His presentation, ‘A Vision for Resource Discovery‘ should be compulsory reading for university librarians. See, in particular, slides #6 (guiding principles of the RDTF), #8 (a future state of the art by 2012), and #11 (key themes).

Slide from David Baker's presentation Slide from David Baker's presentation Slide from David Baker's presentation

Following this introduction, there were three ‘perspectives’, short presentations “reflecting on the real world motivations and efforts involved in opening up bibliographic, archival and museums data to the wider world”: from the National Maritime Museum, the National Archives

…and from Ed Chamberlain of (Jerome’s ‘sister project‘) COMET (Cambridge Open METadata), the perspective from Cambridge University Library on opening up access to their non-inconsiderable bibliographic data. N.B. slides #4 (what does COMET entail?), #9 (licensing) and—more than anything else—slide #16 (“beyond bibliography”).

Slide from Ed Chamberlain's presentation Slide from Ed Chamberlain's presentation Slide from Ed Chamberlain's presentation

The first breakout/discussion session which I sat in on looked at technical and licencing constraints to opening up access to [bib] data. This was the point at which the tortured business metaphors started to pile up. ‘Buckets’ of data. ‘Elastic’ buckets that can expand to include any kind of data. And (my personal contribution, continuing the wet theme): data often exist at the bottom of a ‘well’. Just because a well is open at the top, it doesn’t necessarily make it easy to get the water out! You need another kind of bucket – a service bucket that makes it possible to extract and make use of the water. Sorry, data. What were we talking about again?

Then a series of 5-minute ‘personal pitches’, including mine just after lunch. I didn’t use slides, but I’m typing up my handwritten notes on Google Docs and I’ll post them as a separate blog post when I get a chance.

David Kay (SERO), Paul Miller (Cloud of Data) and Owen Stephens delivered the meat of the afternoon session in their presentation, ‘The Open Bibliographic Data Guide – Preparing to eat the elephant‘. The website containing the Open Bib Data Guide (which has not been formally launched until now) can be found at: http://obd.jisc.ac.uk/

The site itself is going to be invaluable in hand-holding and guiding institutions through the possibilities in opening up access to their own bibliographic data (OBD). Slides from the presentation that are particularly worth noting are #8 (which shows the colour-coding used to distinguish the different OBD use-cases) and #14 (examples of existing OBD).

Slide from the OBD presentation Slide from the OBD presentation

Paul Walk’s presentation, ‘Technical standards & the RDTF Vision: some considerations‘, is the source of the slide which I photographed (at the top of this blog post). Paul talked about ‘safe bets’; aspects of the Web that we can rely on playing a part in allowing us to create a distributed environment for resource discovery: including “ROASOADOA” (Resource- / Service- / Data-Oriented Architecture), persistent identifiers, and a RESTful approach. See also this blog post.

In the second breakout/discussion session, we discussed technical approaches. One of the themes which we kept coming back to was that of two approaches (encapsulated by Paul’s slide) which—while not mutually exclusive—may require different business cases or different explanations in order to be taken up by institutions. We characterised the two approaches as:

  • Raw open data vs Data services
  • Triple store vs RESTful APIs
  • Jerome vs COMET (bit of a caricature, this one, but not entirely unjustified!)

I was gratified that Lincoln’s approach to rapid development and provision of open services was also referred to in non-ungratifying terms, as a model which could be valuable for the HE sector as a whole.

Finally, we heard what’s next for the #rdtf programme. It’s going to be rebranded as ‘Discovery‘ and formally re-launched under the new name at another event: ‘Discovery – building a UK metadata ecology‘ on Thursday, 26 May 2011, in London. See you there?

Ken Chad is writing up a report from the day and Helen Harrop is producing a blog, both of which will be signposted from the website: http://rdtf.mimas.ac.uk.