Posts Tagged ‘open data’

On open data licensing and sustainability

Posted on May 17th, 2012 by Paul Stainthorp
Last week I attended a free ‘licensing clinic‘ in Birmingham, organised by the Discovery programme – mainly as a means of kick-starting my brain into considering the copyright/licensing issues around the CLOCK project. Here are my notes.
  1. The Jerome project addressed licensing in April, 2011, and the situation hasn’t really changed for us: we’re still intending to expose as much of our bibliographic data as possible using a properly open licence such as CC0.
    • “The licensing of data is an interesting one, since we run into a whole bunch of questions around who actually owns the information in our catalogue. Since it’s all factual information (and you can’t copyright a fact) then surely it’s a free for all – except that EU law introduces a curve ball in the form of database right. Broadly speaking this provides specific protection for collections of records, but not the records themselves.”
  2. Ed Chamberlain and the COMET project also addressed licensing and the ownership of MARC records: work that we should revisit.
  3. The JISC Open Bibliographic Data Guide (obd.jisc.ac.uk) provides very clear advice and information useful in creating an open data business case. E.g.:
    • “[…]if we presume that the rationale for publication is to ensure the widest possible dissemination then adoption of a generic open data license (such as Open Data Commons or CC0) is the most effective way to make the set of potential uses unambiguous. Restrictive licenses are counter-productive[…]“
  4. There is some very helpful guidance coming out of the Discovery project around building a business case for open discovery. This was summarised at the recent Discovery programme meeting (also in Birmingham) by David Kay –
    • N.B. I’ll revisit this in a future blog post. I’m getting almost surprisingly interested in the problem of ‘selling’ the idea of open bib data to an institution, and I’ve found the Discovery work on business cases increasingly useful.
  5. At Lincoln in March, 2012, we had a very useful visit from Sander van der Waal of OSS Watch where we discussed the University of Lincoln’s approach to openness (Open Source, Open Access, as well as Open Data). Joss Winn is following this work up with the University’s IP manager with a view to writing a University policy on open licensing of our IP.
  6. Related to the ‘business case’ aspect is the work of LNCD (and also discussions I’ve had with Ed Chamberlain recently) about how to ensure sustainability of open services in a technical sense – what sort of systems architecture and processes do we need in place, and how do we work with university ICT support departments to ensure that projects become institutionally-supported services when it’s important for them to do so?
  7. At this, Birmingham event, Chris Banks of the University of Aberdeen presented about the benefits and challenges of sharing from a library director’s perspective. I was particularly interested in the metaphor of “metadata as currency”: how are aggregators creating value based on the mass accumulation of metadata, and how are they selling that value back to libraries? See Chris’s blog for more. Aberdeen are clearly doing a lot around the analysis of e-resources usage and relating it back to their library strategy / information literacy, etc.
  8. Paul Miller (Cloud of Data): one key quote “amateurs tend to do a better job of aggregating content than institutions” (e.g. collections of images on Flickr). This may be in part because individuals don’t have the same risk-averse approach, but whatever the reason
  9. Barrister Frances Davey gave us a quick run-through of IP law as it relates to data. Key quote: “the legal repercussions of publishing data openly are pretty much nil“. Fear and uncertainty poisons initiative! Frances also touched on the business / reputation-management arguments for having an active approach to open data: people might well be getting bad copies of your data already (via screenscraping) – release it yourself and take control of the quality. Example of the British Library choosing a CC0 licence precisely because of the lack of an attribution clause – then any subsequent re-use is “nothing more to do with us”.
  10. Then, after lunch, copyright consultant Naomi Korn ran a workshop on the practical aspects of choosing a licence for your data. Naomi spoke about the need to start by deciding how open you want to be as an institution (noting that institutions with a dedicated © person tend to have a greater appetite for risk) – then consider whether you have the resources in place to get where you want to be. Key quote: “Let’s do some attribution mapping!” Some link from Naomi’s workshop:
  11. At the Birmingham clinic we also discussed the risks (including the risk of doing nothing) and benefits of taking an open approach. My contribution: open bibliographic data enables high-level services to be sold back to universities (c.f. Chris Banks’ notes on metadata aggregation, above). We shouldn’t be scared of this or see it as a reason to not open up our data (we can’t compete with those companies; we want their services and we’re prepared to pay for them!); but we can build lower-level, locally-relevant services as a result of releasing our own open data, and play on the web by web rules – if we don’t make our data open for re-use on the web, we can’t even have the conversation. Lincoln’s approach is entirely around open data as a means to an end: it’s the best and most natural way of sparking off new, innovative services based on unexpected combinations of our own and other people’s data.
    • The best example of this so far are the new data-driven staff profiles at Lincoln: but we’re going to need more and more convincing examples if we’re going to make a convincing business case.
  12. Final overall quote of the day: “Writing your own open licence is an unpleasant form of vanity“.

Tick tock we don’t stop. Introducing CLOCK, a new JISC-funded resource discovery project at the universities of Lincoln and Cambridge

Posted on December 10th, 2011 by Paul Stainthorp

Cambridge CLOCKThe title says it all, really. The University of Lincoln, working in consortium with Cambridge University Library and Owen Stephens Consulting, has been awarded £49,877 by JISC to investigate ways of driving innovation in libraries’ interactions with Open Bibliographic Data, through a project we’re calling CLOCK (Cambridge-Lincoln Open Catalogue Knowledgebase).

CLOCK is a continuation of and elaboration upon the work of two recent JISC Discovery projects—Jerome at the University of Lincoln and COMET at the University of Cambridge—via a programme of development work shared between the two institutions, and with library consultant Owen Stephens. JISC were impressed enough with the work of both projects, and sufficiently interested in the potential for collaboration, that they encouraged our joint bid for follow-up funding.

Between now and the end of July, 2012, the CLOCK project will provide us with a framework to:

…[1] exploit through real-world applications the significant amount of data released openly by Cambridge University Library; [2] apply the Jerome database architecture, iterative development methodology, and API framework to a bibliographic dataset an order of magnitude greater than the University of Lincoln’s; and [3] to build and enable a new set of tools and demonstrator services which will enable the future development of public Open Bib Data web applications of practical utility to libraries and end-users.

You can read the full bid document, here.

I’m very much looking forward to working with Ed Chamberlain, Systems Librarian in the University Library at the University of Cambridge, along with Owen Stephens, veteran of a number of campaigns to open up access to library data, and Chris Leach (Systems Librarian) and Ian Snowley (University Librarian) from the University of Lincoln. Thanks are due to all of them for their help in writing the successful bid; to the Research & Enterprise Development office at Lincoln for their invaluable assistance in putting together the project budget; and to the LNCD group at the University of Lincoln for providing the kind of supportive development platform that makes these kind of projects possible.

Finally, a big thank you to Andy McGregor and the JISC Digital Infrastructure: Information and library infrastructure: Resource discovery programme, for this opportunity to further explore the blossoming environment of open bibliographic data/open discovery in libraries. If you haven’t done so already, you might like to take a look at the following websites:

As with all our projects, we’ll be blogging it comprehensively (so stand by for a steady stream of awful clock-related puns used as blog post titles). Although there’s little to see there yet, the CLOCK project blog is at: http://clock.blogs.lincoln.ac.uk/ – along with its own RSS feed RSS feed icon. Watch that space!

The University of Linking (part 2)

Posted on November 24th, 2011 by Paul Stainthorp

I’m determined there’s a better way of dealing with information about academic library opening hours than the mess of PDF documents and abuse of JavaScript we rely on at the moment.

Over coffee this morning, Mr Jackson and I drew a Linked Data graph (click for bigger):

Linked Data graph produced using LucidChart (http://www.lucidchart.com/)

What do you think about it? Is it detailed enough – or is it too pedantic? I suspect I’m not using a consistent level of abstraction across the whole graph. It was a first second attempt over a only a very small cup of coffee.

Some if not all of the terms in the graph will already have formal equivalents and HTTP URIs. Anbody care to suggest any? Some are probably already to be found at Chris Gutteridge‘s (and colleagues’) data.southampton.ac.uk – can we work the above up into actual real Linked Data triples using a standard notation?

Finally (so many questions): this still doesn’t quite solve the problem of a standard format for publishing the opening hours themselves. Could that be something as simple as a .csv file? (Easy to update by library staff.) Wouldn’t it be amazing if every academic library in the country published its opening hours (along with its geographical location and contact details) in such a format at a stable URL?

Rough notes from a JISC emerging bibliographic tools workshop, 5th October 2011

Posted on October 12th, 2011 by Paul Stainthorp

I was at Goodenough College in London last Wednesday, 5th October 2011, for a workshop organised under the JISC Discovery programme (discovery.ac.uk), to discuss approaches to publishing, managing, and using Open Bibliographic data (OBD) on the web. Here are some of the notes that I made on the day. I’ve left them rather rough because I don’t have time to bully them into proper paragraphs.

The workshop started with a general overview and discussion of the current picture of OBD.

  • We’re dealing with a growing number of technologies for open library discovery: Linked Data, BibJSON, OPDS (based on Atom), Lincoln’s NoSQL/API-centric approach, even SuperMARC(!?).
  • Few if any people have a good handle on all of these approaches, but we ought to be at least conversant with them.
  • We’re a room full of experimenters! But how can we communicate Discovery/OBD to others? How can JISC funding be used to support the work? We need to surface not only tools and data but also skills.
  • Possibility of looking to e.g. DevCSI/Netskills to help with addressing the skills gap. Are CompSci graduates being encouraged to exercise their skills in open/community development?

We then split into two groups to brainstorm “what’s interesting in bibliographic data at the moment?”: the two groups managed to fill around 8 flipchart sheets :-)

Photo of a flipchart covered in writing

A few quotes and themes I picked up on:

  • What will be the value of OA repositories in hindsight? Will it be open data (some are skeptical) or rather will it be their effect on the publishing industry?
  • A really useful application would be a fits-all API to identify possible identifiers within a record/page – ”I think this is an identifier, please tell me what sort it is” – which then leads into a web service to aggregate information about the thing itself (rights information, etc.) – jokingly called “Rate my Regex”! – some interest in this as a project.
  • Paul Walk: “Please an we have a day off from Linked Data!?
  • Idea of the role of “data doctor/data wrangler” gaining some currency in institutions.
  • There are plenty of code libs for dealing with bibliographic data: pymarc, MARC4JMARC::Record (perl). solrmarc.
  • Owen Stephens: “MARCXML is the worst of MARC combined with the worst of XML. It’s rubbish.
  • A colleague of Peter Murray-Rust (sorry, I didn’t catch your name!). Citable data is not copyrightable. Java library containing ~20,000,000 open article records???
  • Mark MacGillivray[?]: “To most people, this [taps laptop] is just a plastic box full of magic.

After lunch we split again, this time into three groups, each to consider a different aspect of managing Open Bibliographic Data; each to consider opportunities, costs, pitfalls, etc. relating to the technologies themselves as well as to the skills needed in exploiting those technologies:

  1. Transforming data
  2. Munging data (both groups 1. and 2. agreed that the two steps are really the same thing – just “more transformation” – also that ‘munging’ is an awful word…)
  3. Exploitation of data

I was part of the ‘Munging data’ group.

Challenges

  • Problems in the move from a unitary system to distributed data services – loss of control (quality of 3rd-party data can be a problem for the librarian mindset!), worries over sustainability of mashup-style approaches (c.f. dbpedia, BBC RDF, the now-defunct Talis Silkworm project). However, openness itself provides some guarantee against things becoming defunct (i.e. Open Source Software)_.
  • Need to think about the capacity (and the uneven geographic distribution) of local skills
  • “Any data is better than no data”. Use of third-party open data is not really a challenge for management any more (only cataloguers care!)? But still important are notions of provenance, attribution, putting power back in the hands of the end user.
  • We need to think at the citation level – is there a big difference between personal and institutional data?
  • Character encoding!

Gaps

  • Skills. Not enough developers. Unevenly distributed geographically. (Can we construct a course/curriculum for open community development skills?).
  • #ukdiscovery is somewhat distant from the mundane concerns of libraries. Ed Chamberlain is speaking to a group of cataloguers in Oxford about OBD – that’s the sort of thing we want!
  • Thinking about the role of CILIP and ‘professionalism’ – keeping [technical] skills up to date. Portfolios/competency framework approaches. Can we get a push from the top of the library profession?
  • Technology gaps, on the other hand, have mostly gone away. There are enough interesting and easy things to keep us busy without having to worry too much about the things that still don’t work. JISC can help to convince (smaller?) institutions that open development should be trusted.

Opportunities

  • Still attempting to overcome legacy licensing issues. Instead of concentrating on dealing with old data, why don’t we just take a “line in the sand” approach and make sure we’re being 100% open from now on. Do the OBD principles need to be extended?
  • Make use of feedback loops. Learn something about your data by feeding how it’s been used back into the system. Use this usage to inform your transformations.

</end>

The data! The data!

Posted on October 3rd, 2011 by Paul Stainthorp

The Library Impact Data Project (LIDP), which ran from February-July this year, and in which the University of Lincoln took part, has now released a subset of the library activity data used in the analysis (which, you’ll remember, showed a statistically significant correlation across a number of universities between library activity data and student attainment).

Lincoln’s data is included in the release, which is available for re-use under an open licence, from:

http://eprints.hud.ac.uk/11543/

This data set is made available under the Open Data Commons Attribution License
http://opendatacommons.org/licenses/by/1.0/

The data contains final grade and library usage figures for 33,074 students studying undergraduate degrees at UK universities. More information on the data, and how it’s been generalised in order to preserve students’ anonymity, on the LIDP project blog.

  • There’s also a detailed report about the statistical breakdown of Lincoln’s own share of the data (this wasn’t published as part of the project reports, as it was down to each individual institution whether to make it public or not) – I’ve made the report available here [PDF].

The LIDP blog also contains information about the project ‘toolkit‘, developed to assist other institutions who may want to test their own data against the LIDP’s hypothesis, here and here.

Thanks again to Graham, Bryony and Dave at the University of Huddersfield for inviting Lincoln to take part in the project, and for their help along the way!

On to the next one…

LIDP: end of project. Using libraries = good.

Posted on July 28th, 2011 by Paul Stainthorp

I was in Huddersfield last week for the final project meeting of the Library Impact Data Project (LIDP).

LIDP was successful in proving that:

There is statistically significant relationship between both book loans and e-resources use and student attainment. And this is true across all of the universities in the study that provided data in these areas.

“We want to stress here again that we realise THIS IS NOT A CAUSAL RELATIONSHIP!  Other factors make a difference to student achievement, and there are always exceptions to the rule, but we have been able to link use of library resources to academic achievement.”

An initial (outline) report on how the University of Lincoln’s own activity-attainment holds up to this same statistical inspection is available to download from here [PDF]. As much as possible of the library activity data used in the project will be released under an Open Data Commons Attribution License in the near future, and hosted on the project blog.

LIDP [old photo]Thanks are due to Graham Stone, Dave Pattern, Bryony Ramsden, and all the project partners for the opportunity for Lincoln to participate in this project. We had fun getting our together. The end-of-project blog post for LIDP is here – it suggests some very interesting areas for further investigation.

Personally, I’m very interested in looking for cross-institutional comparisons – perhaps trying to explain particular levels of activity-attainment attached to individual subject areas, irrespective of which university the student is at (i.e. does a Lincoln computing student have more in common with a Lincoln business student, or with a Huddersfield computing student?). I’d also be interested in looking particularly at those students whose library activity behaviour changes through the life of their course, and who then go on to get a better degree than they might have been predicted based on their library activity in their first year.

“Finally, we have been astonished by how much interest there has been in our project. To date we have two articles ready for publication imminently and have another 2 in the pipeline. In addition by the end of October we will have delivered 11 conference papers on the project. All articles and conference presentations are accessibly at: http://library.hud.ac.uk/blogs/projects/lidp/articles-and-conference-papers/

I can see this project getting cited, and cited again, simply every time anyone wants to argue that academic libraries are A Good Thing.

#discodev: worldwide software development competition using open library data

Posted on July 5th, 2011 by Paul Stainthorp

Copied verbatim (and under licence!) from the UK Discovery website:

Discovery logo

UK Discovery (http://discovery.ac.uk/) and the Developer Community Supporting Innovation (DevCSI) project based at UKOLN are running a global Developer Competition throughout July 2011 to build open source software applications / tools, using at least one of our 10 open data sources collected from libraries, museums and archives.

…and one of the 10 open data sources is the Jerome API we announced last week!

Enter simply by blogging about your application and emailing the blog post URI to joy.palmer@manchester.ac.uk by the deadline of 2359 (your local time) on Monday 1 August 2011.

Full details of the competition, the data sets and how to enter are at http://discovery.ac.uk/developers/competition/

Follow #discodev on Twitter to see what people are up to.

Is that a Jerome open data API I spy?

Posted on June 28th, 2011 by Paul Stainthorp

Yes. Yes, it is.

http://data.online.lincoln.ac.uk/documentation.html#bib

This is only the initial, bare-bones JSON-only service. A complete (and fully-documented) API will be released in stages over the next month, providing data in a range of output formats. We’re keeping all API and open institutional data documentation in the one place, on our open data site.

Notes from my ‘personal pitch’ (#rdtf in Manchester)

Posted on April 20th, 2011 by Paul Stainthorp

At the JISC/RLUK Opening Data – Opening Doors event in Manchester on Monday I was asked to deliver a five-minute ‘personal pitch’ relating to why the Open Data approach is important/relevant to people/institutions/communities, based around the philosophy driving work at Lincoln.

I didn’t use slides, but here is a verbatim transcription of my handwritten notes (original on Google Docs):

  1. Lincoln has mixture of internal + JISC-funded projects including Jerome, needs two pages of flipchart paper to list all projects —> leading to a project ‘ecology’.
  2. We’re developing platforms for access to space/time (location, room bookings, calendaring), asset, bibliographic, activity, user, course, research data.
  3. It’s less about open data per se (though we are opening up our data!) – more about building openly-accessible platforms for manipulating that data.
  4. ‘Nucleus’ – one platform for services on all opened institutional data. Documented APIs. Inherently rights-based.
  5. ‘Eating our own dog food’. New institutional apps are built on the Nucleus (rather than by exporting and copying data between back-office systems); internal SOA – ‘hearts and minds’ to be won in uni data teams to this approach, but ICT are committed.
  6. Easier migration. Flexible. Integration with third-party services on the same basis.
  7. Concept of Student as Producer – students as active participants in teaching and learning, research, AND in institutional service development & delivery. Conscious rejection of student as passive consumer.
  8. Students building some of the first applications of Lincoln’s open data services – we didn’t ask them to! – stuff we’d never have thought of or not had time to do.
  9. Related: the way we develop open data platforms and services in the first place. Rapid innovation. Joss Winn has approval to establish a new free-floating technology & pedagogy group; will have responsibility to develop + embed new systems.
  10. Benefits – new tools; new methods of working. Quick responses to changes in HE (essential agility!). Partnerships. Active students.
  11. Challenges – licensing (complex history of institution. Many of our MARC records are older than we are!). Too many possibilities? Where do we start?! How to communicate the benefits of this approach succinctly and convincingly. Technical challenges not trivial, but “the great thing about library data standards is that there are so many of them…”