Posted on May 17th, 2012 by Paul Stainthorp
Ed Chamberlain, who is on the CLOCK project team as a researcher, is involved in two other projects under the Discovery strand: OEM-UK and Open Bibliography 2. We’re looking for ways in which CLOCK can re-use data, code, processes and ideas from these projects (and elsewhere) – also what CLOCK could offer in return.
- Open Biblio project over the last few years; aim to aggregate large amounts of bibliographic data for scientific discovery.
- Data collected from Cambridge University, the BL, PubMed and held as RDF, used to power an open catalogue called “Bibliographica“.
- Problems around scaling the data/system led to the current JISC-funded Open Biblio 2 project (in the meantime, Cambridge and the BL had started to publish their data openly).
- Open Biblio 2 started looking at a NoSQL approach (CouchDB, Lucene/Solr) – eventually settling on Elastic Search.
- The approach of Open Biblio is to build bottom-up, community tools: BibServer and BibSoup (“Like Wikimedia for bib data”). Raises interesting questions about data quality in an open community-driven system.
- Also looking at JSON as lightweight way of sharing bib data: emerging BibJSON convention for representing bibliographic record as a JSON object (Ed wrote a MARC-to-BibJSON-parser in Perl). N.B. BibJSON is not a million miles away from the JSON that Jerome spits out! There are three hack days taking place next month in London to look specifically at BibJSON.
- Open Biblio 2 is also looking at JSON-LD (JSON for Linking Data), a ‘real’ JSON standard which does a lot of the things that RDF does.
tl;dr = use their JSON standards and BibSoup as a data source.
- The second project, OEM-UK (Open Education Metadata UK), based at the IoE in London, is focusing on cataloguing workflows.
- Data from the IoE’s SirsiDynix catalogue, plus EPrints is drawn into a Drupal framework; forms to create data (autopopulation of forms); “cataloguing the Drupal way”.
- Thought from Andrew Beeken: could we replicate this approach, using WordPress custom post types to store and display structured content? Shades of the OPACPress project which Joss Winn and I proposed—but that was not funded—several years ago.
- Some evidence that this approach is capable of speeding up the cataloguing process considerably: the more data you put in the faster it gets! Ed has some screencapture videos from OEM-UK showing workflow, including grabbing data via Zotero.
td;dr = OEM-UK are also successfully disrupting cataloguing workflows.
Posted on May 17th, 2012 by Paul Stainthorp
Most of the CLOCK project team (AB, EC, CL, TJ, PS) are at CARET in Cambridge today and tomorrow (17-18 May 2012) to generally hack bibliographic data and try and point the way for the remaining 2 months’ technical development for the CLOCK project.
After coffee on day 1 we agreed our objectives for the next two days. They are:
- To review what we’ve done so far and what we need to do. To play with the SPARQL and JSON-parsing search tools that Andrew Beeken has started to develop and to incorporate more data (BL, etc.)
- To think about the user interface for CLOCK: how do we present open bib data from multiple sources (Lincoln, Cambridge, Harvard, BL, OpenLibrary, other) in a single UI in a way which helps our users (cataloguers. researchers) solve problems?
- What’s the high level architecture for CLOCK? How does data flow thru’ the system – can we draw a meaningful diagram?
- A comparison of open data / Discovery projects that Ed Chamberlain is involved in! What can we take and re-use from OpenBiblio2 and the OEM-UK project? What might those projects be able to take and re-use from CLOCK?
- What are we going to do with all this data? A plan for http://data.lincoln.ac.uk/, http://data.lib.cam.ac.uk/, and http://data.ac.uk/library (or http://library.data.ac.uk/).
- To run interviews and live cognitive workthroughs with cataloguers in Cambridge and Lincoln.
Posted on May 8th, 2012 by Paul Stainthorp
This is what the CLOCK project team are currently up to (from meetings over the past couple of weeks and from notes made at the recent “Discovery: making sure your resources are discovered, used and reused“ event in Birmingham):
- Andrew Beeken has been exploring the Cambridge COMET data via its SPARQL endpoints and has already blogged about the process of using SPARQL to “build kind of a ‘Hello World’ of open data querying”. He’s now looking at the recently-released Harvard open bib data and comparing the speed, the use of matching namespaces, and the use of JSON vs RDF/XML.
- This work is leading up to unified search and presentation of records from several sources (Cambridge/COMET, Harvard, Lincoln/Jerome, OpenLibrary, etc.). Andrew and Trevor Jones are collaborating on drawing up a high-level architecture for CLOCK, and a strategy for expressing Linked Data, which will be shared with the rest of the project team (and publicly) for discussion.
- To support this, Alex Bilbie in ICT services at Lincoln is helping to get the original Jerome application up and running on the CLOCK server (jerome.library.lincoln.ac.uk), where it can be used as a stable platform for developing and RDF-ifying Lincoln’s own bib data.
- Trevor Jones and Ed Chamberlain will work together on developing the work with users (in parallel, at the University of Lincoln and the University of Cambridge) to clarify their requirements for bibliographic data:
- For cataloguers, based around a rethink of copy cataloguing workflows, we will try to tease out requirements from talking to cataloguers (and associated subject librarians) asking to be ‘positively disrupted’: what do they need to do? What is missing from their data?
- For researchers, we will build on some initial user walkthrough analysis done by Trevor and Andrew in Lincoln, with performing arts students in LPAC (the Lincoln Performing Arts Centre). What are the research questions that users are trying to answer? How does bib data help them answer those questions? What’s missing? Ed and Trevor will agree on a set of questions and tasks;
- These requirements will be used to feed the remainingcycles of platform development for CLOCK.
- Ed Chamberlain will act as the conduit between CLOCK and related projects in the Discovery strand, looking for points of shared interest/technology, and blogging (or asking others to blog) about aspects of one project which can inform the others. The other projects in which Ed is involved are: the Open Education Metadata UK (OEM-UK) project at the Institute of Education (shared interest in new user interfaces for cataloguing – possibly use screencasts to demonstrate alternative workflows?) and the Open Bibliography 2 project (lots of potential technical overlap – BibJSON, JSON-LD, BibSoup.net, expression in RDF container formats).
- Ed and I (Paul Stainthorp) will work on developing the ‘business case’ / sustainability of CLOCK and data.*.ac.uk, following up on themes discussed in the recent Discovery event, and thinking not only about institutional funding / high-level support for open bib data, but also what it takes to move open bib data publishing from a development environment into an institutionally-supported, ICT-run service.
- Finally, PS is arranging a couple of internal CLOCK ‘hack days’ (to take place on 17th-18th May, in Cambridge) – more details to follow.