Posts Tagged ‘data’
Posted on May 7th, 2013 by Paul Stainthorp
With the Orbital project at its end, and plans for a University research information / research data service afoot, I’m reviewing the excellent work carried out by our (now-departed) developers Harry Newton and Nick Jackson – work which linked up CKAN, the Orbital ’bridge’ application, and the Lincoln Repository (EPrints) using SWORD – described in earlier blog posts here and here.
“One important piece of work that we’re undertaking at the moment in Orbital is the facility to deposit the existence of a dataset, from CKAN and the University’s new Awards Management System (AMS), into our (EPrints) Repository via SWORD – at the same time requesting a DOI for the dataset via theDataCite API. The software at the centre of this operation is what we refer to as Orbital Bridge.”
This deposit workflow is now broadly working as it should – I think only a few tweaks would be necessary now to turn this into a working tool for the University of Lincoln.
Most urgent is the need for the University to sign up with the DataCite DOI service, which would secure a DOI for each dataset record deposited from CKAN and hence formally published by the University. This subscription should form part of the new research information service.
The underlying code could be used for other SWORD-enabled deposit from sources of metadata (e.g. the Library’s discovery system, Find it at Lincoln), to the Lincoln Repository as the University’s bibliographic ‘system of record’.
Warning: this is an extremely screenshot-heavy blog post! Click on any one of the screenshots below to view a larger image.
Here’s a step-by-step walkthrough of the entire process of adding a dataset to CKAN, and depositing it as a record in the Lincoln Repository.
- Go to the Researcher Dashboard at: https://orbital.lincoln.ac.uk/ and click on “Sign In”.

- Enter your staff accountID and password to sign in to the Researcher Dashboard.

- Once you have been signed in and returned to the Researcher Dashboard, click on your name (in the top right-hand corner) and then click on “My Projects”.

- You will see an overview of your research projects – both funded projects (derived from the AMS), and unfunded projects you have added locally. Click on the name of the project you want to add data to.

- You will be taken to a page for that research project. On the right-hand side of this page, under the heading “Options”, click on “Create Research Data Environment”.


- You will be taken to the University’s CKAN research data platform, where a page/group will have been created which corresponds to your project in the Researcher Dashboard. Sign in to CKAN using your staff accountID (there is currently no single sign-on between the Researcher Dashboard and CKAN) and password and you should be returned to the same page. However you will probably be sent instead to the CKAN home page, in which case you will have to look again for your project under the “Groups” menu.

- Toward the top of the project screen in CKAN, click on “Add Dataset” > “New Dataset…”.

- Fill in the form with information about the overall dataset, including the following fields:
- Title
- URL
- License (N.B. US spelling!)
- Description

- Then click on “Add Dataset”

- If you now click on “Further information” tab on the left-hand menu, you can add the following additional information about the dataset (this is not obvious from the initial dataset form):
- Author
- Author email
- Maintainer
- Maintainer email
- Version
- Summary [of changes]

- To attach individual data document(s)—which CKAN refers to as “resources”—to the dataset, scroll down the page and click on “Upload a file” (there are other options) > “Choose file” > “Upload”.

- Then fill in the form with the following basic information about the “resource”:
- Name
- Description
- Format
- Resource Type
- Datastore enabled (ticked by default)
- Mimetype
- Mimetype (Inner)
- “Extra Fields” (user-defined, or used by Orbital)

- To deposit a record for this dataset in the Lincoln Repository, go back to the Orbital Researcher Dashboard at: https://orbital.lincoln.ac.uk/ and navigate to your project. Toward the bottom left of the page you should now see a table containing the dataset(s) you have created in CKAN for this project. Choose which dataset you want to deposit, and hit the “Publish to Lincoln Repository” button.

- The Researcher Dashboard will then display a deposit form containing the following fields (some of which should be being autopopulated from CKAN fields but which do not appear to be):
- Title
- Description
- Type of Data
- Keywords
- Subjects
- Divisions
- Metadata visibility [Show|Hide]
- People
“Publishing will publicly announce the existence of your dataset on the Lincoln Repository, as well as start the process of long-term preservation of your data.“Usually you should only publish a dataset either at the end of a research project, or if the data is being cited in a paper. Publishing a dataset will place some restrictions on the changes you can make to the dataset in the future, such as removing your ability to delete the data. It will also generate a DOI, which allows your dataset to be uniquely identified and located using a simple identifier.“Please check the information in this form and make any necessary changes, as this is the information which will be entered into the published record of the dataset.“If you have any questions about this process please contact a member of the research services team for advice or assistance.”
- When you hit the “Publish Dataset” button, the dataset record from CKAN will be used to create a record in the Lincoln Repository. The record will be submitted for review by the Repository team, who will then make it live. N.B. for the time being, you will see an error “Validation errors: [doi] is a required string“ – this happens because the University does not currently have access to the live DataCite DOI service, which would secure a DOI for each dataset record deposited from CKAN. This should form part of the new research information service.

- Here’s an example of a record in the Lincoln Repository, created from a CKAN dataset and made live by the Repository team.

Problems with the deposit process as it currently stands:
- Permissions are not correctly cascaded from a project the Researcher Dashboard to a group in CKAN.
- There is currently no single sign-on between the Researcher Dashboard and CKAN.
- When CKAN challenges a user to log in to a group, they should be redirected back to the group page after logging in – instead they get sent back to the CKAN home page, in which case they will have to look again for their project under the “Groups” menu.
- A minor one – in CKAN ”License” (noun) appears in US spelling (should be “Licence”).
- In order to add all the information needed to deposit a dataset from CKAN, user has to click ”Further information” tab on the left-hand menu (this is not obvious from the initial dataset form).
- Some of the field labels in CKAN are a bit opaque or use technical terms (“Mimetype”) which could do with explanation.
- When depositing to EPrints, some of the deposit fields should be being autopopulated from CKAN fields – this does not appear to be happening. The fields affected are:
- “Description” (could be derived from CKAN dataset/resource Description fields)
- “Type of Data” (could be derived from CKAN resource Format field)
- Repository records created from CKAN have the data “Creator” attached, but not the “Maintainer”.
- Repository records created from CKAN don’t have a link back to the CKAN dataset (should go in the EPrints “Official URL” field) – this will be required to provide access to the data.
- After deposit, users see the error message “Validation errors: [doi] is a required string” – the University does not currently have access to the live DataCite DOI service, which would secure a DOI for each dataset record deposited from CKAN.
Tags: CKAN, dashboard, data, deposit, end, EPrints, Harry Newton, Lincoln Repository, Nick Jackson, Orbital, problems, research, research data, research project, Researcher Dashboard, SWORD, SWORD v2
Posted in Uncategorized | (0) Comments | Click here to add a comment »
Posted on February 1st, 2013 by Paul Stainthorp
It’s been a while since we ran the Jerome project at the University of Lincoln, but it’s far from dead, and thanks to the recent leaps forward in establishing a proper data.lincoln.ac.uk (and data.ac.uk) portal, you can now access a permanent copy of our open catalogue data, at:

Just as in the original Jerome application, this data is constantly harvested from our catalogue over a number of days, one record at a time in an endless cycle.
It’s a ‘minimally invasive’ method that doesn’t put too heavy a load on the catalogue itself, or require us to run any additional software on our catalogue server – and it means that, on average, no record in the open data is more than a couple of days out of date. The data harvested is stored in Nucleus before being processed and published to data.lincoln.ac.uk.
If you have any technical questions about the process, it’s worth contacting LNCD (specifically, Nick Jackson).
The biggest difference between the original Jerome and this new process is that Jerome scraped XML views of catalogue records from our web OPAC, while son-of-Jerome harvests the records one at a time over Z39.50, using the YAZ PHP extension. We’re also publishing the data this time as BibJSON, rather than MakeItUpAsWeGoAlongJSON.
There’s a lot more data to come, including:
- Richer bibliographic data on each item (it’s somewhat bare-bones at the minute!)
- Library item data (i.e. copies of particular works)
- Reading lists
- Repository records
- Usage and activity data
Tags: BibJSON, bibliographic data, CC0, data, data.ac.uk, data.lincoln.ac.uk, Jerome, library catalogue, YAZ, z39.50
Posted in Uncategorized | (1) Comment | Click here to add a comment »
Posted on July 9th, 2012 by Paul Stainthorp
Tags: BibTeX, data, EPrints, HTML, JSON, Lincoln Repository, record, RefWorks, URLs, XML
Posted in Uncategorized | (2) Comments | Click here to add a comment »
Posted on May 17th, 2012 by Paul Stainthorp
Most of the CLOCK project team (AB, EC, CL, TJ, PS) are at CARET in Cambridge today and tomorrow (17-18 May 2012) to generally hack bibliographic data and try and point the way for the remaining 2 months’ technical development for the CLOCK project.
After coffee on day 1 we agreed our objectives for the next two days. They are:
- To review what we’ve done so far and what we need to do. To play with the SPARQL and JSON-parsing search tools that Andrew Beeken has started to develop and to incorporate more data (BL, etc.)
- To think about the user interface for CLOCK: how do we present open bib data from multiple sources (Lincoln, Cambridge, Harvard, BL, OpenLibrary, other) in a single UI in a way which helps our users (cataloguers. researchers) solve problems?
- What’s the high level architecture for CLOCK? How does data flow thru’ the system – can we draw a meaningful diagram?
- A comparison of open data / Discovery projects that Ed Chamberlain is involved in! What can we take and re-use from OpenBiblio2 and the OEM-UK project? What might those projects be able to take and re-use from CLOCK?
- What are we going to do with all this data? A plan for http://data.lincoln.ac.uk/, http://data.lib.cam.ac.uk/, and http://data.ac.uk/library (or http://library.data.ac.uk/).
- To run interviews and live cognitive workthroughs with cataloguers in Cambridge and Lincoln.
Tags: architecture, Cambridge, CARET, cataloguing, data, data sources, data.ac.uk, data.lincoln.ac.uk, JSON, objectives, OEM-UK, openbib2, re-use, SPARQL, UI, user interface design
Posted in Uncategorized | Comments Off
Posted on May 16th, 2012 by Paul Stainthorp
Today we published data on approximately 1.8 million items loaned from the University of Lincoln’s libraries since 2001. The data is available to re-use under a CC0 licence, and can be downloaded from:
We’ve done this as part of our involvement in the Copac Activity Data Project, a.k.a. SALT2. Along with data from the universities of Manchester, Sussex, Cambridge and Huddersfield, our circulation data will be used to power a ‘recommender API‘, which libraries will be able to use to build “People who borrowed X also borrowed Y“-type services. The API will benefit from the power of aggregated data from multiple institutions of different types, containing tens of millions of circulation events.
You’ll notice as well that we’ve chosen to host the data on our brand-new Orbital (v0.1) research data management application. Each dataset has a persistent citable URI. We’ll be keeping the data up-to-date, and generating a new activity data file from our library circulation logs shortly after the end of each academic year.
The data consists of a number of CSV files (one for each academic year since 2000-01, plus a huge file of all the data), containing the following fields:
| Field index |
Field name |
Description |
| 0 |
CREATE_DATE |
The date and time of the loan event, in the format: dd/mm/yyyy hh:mm |
| 1 |
BORROWER_ID |
A cryptographic hash of the internal system ID associated with the borrower of the item, as used in the University of Lincoln’s library system. |
| 2 |
WORK_ID |
A cryptographic hash of the internal system ID associated with the bibliographic work borrowed, as used in the University of Lincoln’s library system. |
| 3 |
CONTROL_NUMBER |
The ISBN of the work borrowed (10 or 13 digits). |
| 4 |
AUTHOR_DISPLAY |
The main author of the work borrowed. |
| 5 |
TITLE_DISPLAY |
The title of the work. |
| 6 |
PUB_DATE |
The publication year of the work in the form: yyyy |
I’ll blog in detail another time about exactly how we created the data extracts. In short:
- There is a table in the SirsiDynix Horizon library management system called circ_tran which records every instance of item number X borrowed by user number Y at time Z. [#1]
- There is another table which provides a lookup between item numbers and the numbers of the bibliographic works of which they are a copy. [#2]
- Dave Pattern at the University of Huddersfield wrote a Perl script which scrapes all the bibliographic data (title, author, ISBN) for each work from our OPAC (Horizon Information Portal) and writes it to a text file. [#3]
- Developer, Jamie Mahoney of CERD/LNCD then stepped in, using some pretty heavy SQL on the original 3 data extracts, to:
- Hash the internal Horizon user and work ID numbers to provide anonymity;
- Convert the internal Horizon date and time stamps in extract [#1] from a version of Unix time into a readable datestamp (formula hint: cko_date*86400 + cko_time*60);
- Used the item/work lookup table [#2] to pull in the bibliographic details for each loan in [#1] from the bibliographic table [#3] (an epic SQL JOIN query), removing items which are no longer represented in our library system;
- Removed any items without an ISBN, which are of no use to the SALT recommender API;
- Tweaked the punctuation and formatting;
- Split the data into separate files for each year.
Once again, the data is at:
Thanks are due to Chris Leach and Dave Pattern for Horizon-fu, and to Jamie Mahoney for his patient wrangling of several millions of lines of data!
You can find out more about the Copac Activity Data Project/SALT2, at: http://copac.ac.uk/innovations/activity-data/
Tags: #jiscad, activity data, API, Cambridge University, CC0, Chris Leach, circulation, Copac, data, data publishing, Dave Pattern, fields, HiP, Horizon, Jamie Mahoney, LMS, LNCD, loans, Orbital, Perl, recommendation, SALT2, SirsiDynix, University of Huddersfield, University of Manchester, University of Sussex
Posted in Uncategorized | (1) Comment | Click here to add a comment »
Posted on March 28th, 2012 by Paul Stainthorp
There is a project going on at the University of Lincoln at the moment to rebuild our directory of academic staff profiles on the web, in line with our new corporate website.
As I mentioned in my presentation to library managers last week, it’s turning out to be a nice example of how new web applications can be spun up quickly at Lincoln using our existing [open and non-open] data sources (in this case, HR staff data, BuddyPress social profile data, Repository feeds, Gravatar images, and our OAuth authentication framework/Common Web Design), plus a bit of developer magic.

You can search our staff profile directory (still in development) at: http://phone.online.lincoln.ac.uk/
There is a growing tendency for universities in all groupings—certainly for the research-intensive universities—to publish the entirety of an author’s publications to their web profile as embedded content from their repository and/or Current Research Information System (CRIS). Here are a few examples of staff profiles on other UK universities’ sites which incorporate publication lists derived from their repositories or CRISes:
We’re pulling the publication details from the Lincoln Repository for each author into their web profile (example), using a search on their University of Lincoln staff ID (which forms part of their standard HR data profile) – e.g. http://lncn.eu/ep/000157. We can then get at the Repository data in almost any format we want (BibTeX, JSON, XML, RSS, etc.). I’m also keeping a close eye on the development of the EPrints Shelves plugin, which might be an interesting tool for giving authors more flexibility and control over how their Repository publication list(s) are displayed on their web profile.
Tags: BuddyPress, CRIS, CWD, data, EPrints, examples, Lincoln Repository, profile, repositories, Shelves, staff, staff ID, universities, website
Posted in Uncategorized | (1) Comment | Click here to add a comment »
Posted on October 3rd, 2011 by Paul Stainthorp
The Library Impact Data Project (LIDP), which ran from February-July this year, and in which the University of Lincoln took part, has now released a subset of the library activity data used in the analysis (which, you’ll remember, showed a statistically significant correlation across a number of universities between library activity data and student attainment).
Lincoln’s data is included in the release, which is available for re-use under an open licence, from:
http://eprints.hud.ac.uk/11543/
This data set is made available under the Open Data Commons Attribution License
http://opendatacommons.org/licenses/by/1.0/
The data contains final grade and library usage figures for 33,074 students studying undergraduate degrees at UK universities. More information on the data, and how it’s been generalised in order to preserve students’ anonymity, on the LIDP project blog.
- There’s also a detailed report about the statistical breakdown of Lincoln’s own share of the data (this wasn’t published as part of the project reports, as it was down to each individual institution whether to make it public or not) – I’ve made the report available here [PDF].
The LIDP blog also contains information about the project ‘toolkit‘, developed to assist other institutions who may want to test their own data against the LIDP’s hypothesis, here and here.
Thanks again to Graham, Bryony and Dave at the University of Huddersfield for inviting Lincoln to take part in the project, and for their help along the way!
On to the next one…
Tags: activity data, data, licence, LIDP, open data, Open Data Commons, project, toolkit, University of Huddersfield
Posted in Uncategorized | Comments Off
Posted on June 18th, 2011 by Paul Stainthorp
I think this is worth re-posting from the LIDP blog:
We are very pleased to report that we have now received all of the data from our partner organisations and have processed all but two already!
Early results are looking positive and our next step is to report back with a brief analysis to each institution. We are planning to give them our data and a general set of data so that they can compare and contrast. There have been some issues with the data, some of which has been described in previous blogs, however, we are confident we have enough to prove the hypothesis one way or another!
In our final project meeting in July we hope to make a decision on what form the data will take when released under an Open Data Commons Licence. If all the partners agree, we will release the data individually; otherwise we will release the general set for other to analyse further.
I submitted Lincoln’s data on 13 June. It consists of fully anonymised entries for 4,268 students who graduated from the University of Lincoln with a named award, at all levels of study, at the end of the academic year 2009/10 – along with a selection of their library activity over three* years (2007/08, 2008/09, 2009/10).
The library activity data represents:
- The number of library items (book loans etc.) issued to each student in each of the three years; taken from the circ_tran (“circulation transactions”, presumably) table within our SirsiDynix Horizon Library Management System (LMS). We also needed a copy of Horizon’s borrower table to associate each transaction with an identifiable student.
- The number of times each student visited our main GCW University Library, using their student ID card to pass through the Library’s access control gates in each of the three* years; taken directly from our ‘Sentry’ access control/turnstile system. These data apply only to the main GCW University Library: there is no access control at the University of Lincoln’s other four campus libraries, so many students have ’0′ for these data. Thanks are due to my colleague Dave Masterson from the Hull Campus Library, who came in early one day, well before any students arrived, in order to break in to the Sentry system and extract this data!
- The number of times each student was authenticated against an electronic resource via AthensDA; taken from our Portal server access logs. Although by no means all of our e-resources go via Athens, we’re relying on it as a sort of proxy for e-resource usage more generally. Thanks to Tim Simmonds of the Online Services Team (ICT) for recovering these logs from the UL data archive.
I had also hoped to provide numbers of PC/network logins for the same students for the same three years (as Huddersfield themselves have done), but this proved impossible. We do have network login data from 2007-, but while we can associate logins with PCs in the Library for our current PCs, we can’t say with any confidence whether a login to the network in 2007-2010 occurred within the Library or elsewhere: PCs have just been moved around too much in the last four years.
Student data itself—including the ‘primary key’ of the student account ID—was kindly supplied by our Registry department from the University’s QLS student records management system.
Once we’d gathered all these various datasets together, I prevailed upon Alex Bilbie to collate them into one huge .csv file: this he did by knocking up a quick SQL database on his laptop (he’s that kind of developer), rather than the laborious Excel-heavy approach using nested COUNTIF statements which would have been my solution. (I did have a go at this method—it clearly worked well for at least one of the other LIDP partners—but it my PC nearly melted under the strain.)
The final .csv data has gone to Huddersfield for analysis and a copy is lodged in our Repository for safe keeping. Once the agreement has been made to release the LIDP data under an open licence, I’ll make the Repository copy publicly accessible.
*N.B. In the end, there was no visitor data for the year 2007/08: the access control / visitor data for that year was missing for almost all students. This may correspond to a re-issuing of library access cards for all users around that time, or the data may be missing for some other reason.
Tags: access control, AthensDA, circulation, circ_tran, COUNTIF, csv, data, Dave Masterson, GCW, Horizon, impact data, licensing, LIDP, LMS, Microsoft Excel, network, ODCL, Online Services Team, OpenAthens, PCs, Portal, primary key, QLS, registry, repository, Sentry, SirsiDynix, student data, students, Tim Simmonds, turnstiles, University of Huddersfield
Posted in Uncategorized | (1) Comment | Click here to add a comment »
Posted on June 13th, 2011 by Paul Stainthorp
These data consist of entries for 4,268 anonymised students who graduated from the University of Lincoln with a named award at the end of the academic year 2009/10, along with a selection of their library activity over three years (2007/08, 2008/09, 2009/10): library item circulation, visits to the main GCW University Library, and e-resources usage represented by authentication against AthensDA.
View this item on the University Repository: http://eprints.lincoln.ac.uk/4540/
Tags: #jiscad, activity, activity data, anonymised, circulation, data, electronic resources, GCW, JISC, library, LIDP, sensitive, usage data
Posted in Uncategorized | (1) Comment | Click here to add a comment »
Posted on June 8th, 2011 by Paul Stainthorp
Couple of weeks now since the launch of Discovery, and I’ve singularly failed to blog it up.

Some notes, then, so it doesn’t go completely unremarked-on here.
Discovery (“a metadata ecology for UK education and research”) is a natural progression from the JISC/RLUK Resource Discovery Taskforce programme. Only with a catchier name.
And a nice logo.
The intention is that Discovery will build a UK-wide critical mass of open and reusable data—from libraries, archives and museum collections—through innovative discovery interfaces such as… I dunno… this one
The launch took place at the Wellcome Trust in London under the title ‘Discovery – building a UK metadata ecology‘. They’re announcing some useful, practical things, such as:
If I’m being honest – this maybe wasn’t the conference for me. It was a bit teaching-and-learning-theoretical/academic for my tastes. I got more out of the last RDTF event in Manchester where some practical library discovery problems were discussed in detail. (But that’s a problem with my unsubtle mind, not a problem with Discovery!)
Also, I have to say I wasn’t convinced by the ‘long tail’ competitive exercise in the afternoon of the launch event… is the real value (and the real convincing case for library discovery) genuinely to be found in providing easier access to more-and-more obscure collections, even if tied to a popular [populist?] ‘hook’ like the 2012 Olympics? Maybe. But I can’t help but feel that there’s still so much more work to be done in the mainstream of library discovery before we all get too special-collection-y. Quibbling aside – this is important stuff and it’s great to see momentum building. I’m wondering now where I can stick some of these. Discovery bumper sticker tiem?

Tags: #rdtf, #ukdiscovery, conferences, data, discovery, events, Jerome, Jerome blog, JISC, metadata, open data, RLUK
Posted in Uncategorized | (1) Comment | Click here to add a comment »