CLOCK project implementation plan

Posted on April 23rd, 2012 by Paul Stainthorp

The University of Lincoln and Cambridge University Library both delivered successful projects (Jerome and COMET) for the JISC Infrastructure for Resource Discovery Programme in 2011. Both projects produced outputs of interest to researchers, students, librarians, developers, and designers of bibliographic discovery environments.

The CLOCK project is harnessing the success of these two complementary initiatives and investigating new approaches to data creation and discovery in the library domain. In particular, CLOCK will investigate, propose, and develop new, web-based bibliographic tools which will make it easier for different users—cataloguers in academic libraries, and the “serious”, tech-savvy researcher—to find Open Bibliographic Data and incorporate that data into systems and workflows.

This is the CLOCK project implementation plan:

Aims, Objectives and Final Output(s) of the project

The CLOCK project’s overall aim is to challenge assumptions and drive innovation in libraries’ interaction with bibliographic data. The project team believe that an important aspect of this innovation will be serious consideration given to the development of an awesome, national, open scholarly catalogue knowledgebase for the UK (“data.ac.uk/library” or “library.data.ac.uk”).

As a medium-term step towards this goal, CLOCK will explore options for updating and maintaining the shared platform on data.lincoln.ac.uk as an eventual service. Longer term maintenance of the Cambridge open data service will also be investigated.

The investigation will take an experimental approach, building upon the RDF encoded structured metadata released through the COMET project as a readily accessible resource for enrichment of data within the Jerome software environment. At this preliminary stage, four steps in record enrichment have been identified:

  1. Matching / negotiating of the best available Open bib data through common identifiers;
  2. The importance of a social/reputational aspect in identifying authoritative data;
  3. A process of harvesting a returned record (parts of a record) to be re-used;
  4. Enrichment, repair and cleansing of data in the knowledgebase (positive feedback loop).

By exploring aggregation, ‘data cleansing’ and enrichment through readily available open sources, it hope to highlight new distributed approaches to metadata production–cataloguing, storage and delivery, including minimal workflows for cataloguing around individual, disaggregarted RDF elements. The project will explore ways to do this using automated techniques built around open reusable metadata.

Whilst the two million records published under the COMET project will act as a starting point for this process, the participants aim to utilize other sources, including images, table of contents data and related supplementary resources (geo-data, author biographies, etc). Through this, there will be an additional social aspect of the project, to identify and document other authoritative open data sources to consume and to report back on successes and failures.

Alongside a focus on enrichment through open data, the project will recognise that a ‘pure open’ information environment is still far from the norm. It will also investigate methods in which open data can be consumed by semi-open and commercial resource discovery services and how such services may themselves benefit from open approaches to data publishing.

Final Project Products

Primary outputs:

  1. An enhanced Open Bibliographic Dataset containing records sourced from Jerome, COMET, and other open data sources, permissively licensed, delivered over fast API in a range of formats (e.g. MARC, RDF, JSON) as both whole records and disaggregated, Linked data: along with associated social/reputational metadata making explicit the provenance, history, and ‘pagerank’ measurements of each data element. All data and APIs produced will be published on data.lincoln.ac.uk and access will be maintained by the University of Lincoln on that platform for at least the next 3 years;
  2. A repository of Open Source software for gathering, manipulating and publishing such data plus public documentation for the APIs, clarifying in particular the utility of the ‘data cleansing’ and the social/reputational metadata in distributed cataloguing environments;
  3. A proposal for the continuation of the work of CLOCK toward the specific aim of establishing a distributed open scholarly catalogue knowledgebase for the UK (“data.ac.uk/library” or “library.data.ac.uk”).

Secondary outputs:

  1. User documentation: a formal clearly-documented user requirements analysis and evidence of user engagement (e.g. user ‘stories’);
  2. Contextual documentation: a published literature review;
  3. Technical documentation: an examination of relevant standards and processes for manipulating Open Bib Data, particularly via API, and a comparison-cum-synthesis of the parallel approaches to open data publishing taken by COMET and Jerome;
  4. Contributions toward the JISC Open Bibliographic Data guide, initially in the form of commentable public plans for implementation of the shared Lincoln-Cambridge datastore. These will be reviewed at regular intervals and will eventually build into guidance for other academic libraries on releasing data openly. An experimental focus will allow mistakes and development ‘wrong turns’ to be shared with a wider community;
  5. We will disseminate our work during and beyond the duration of the project. Progress will be communicated by regular blogging throughout the life of the project. Project members are active within the UK HE development and library communities. Through blogs, social networks and talks at events they will continue to act as champions for open data publishing, furthering the aims and objectives of the Discovery programme. In particular, we will showcase our work at relevant JISC workshops, and will produce public project documents according to the JISC branding guidelines, targeting specific and relevant audiences.

Wider Benefits to Sector & Achievements for Host Institution

Jerome provided a modern API-centric approach to open data services and discovery using NoSQL database technologies and Open Source search. COMET published over two million records under a Public Domain Data License, many of them available for query via an RDF-store/SPARQL endpoint. Tools and techniques to achieve this have also been release under a permissive software license.

The CLOCK project aims to scope and develop powerful and usable API-based web services which will make it easy to locate available Open Bibliographic Data for a given bibliographic work, These services will be aimed predominantly to meet the needs of:

  1. developers attached to academic libraries looking to build applications exploiting available Open Bibliographic Data, and techniques for interrogating and exploiting that data;
  2. cataloguers and library managers looking to innovate their resource description workflows as well as contribute to the corpus of Open Bib Data;
  3. the ‘serious’ and tech-savvy researcher, who may be keen to incorporate Open Bib Data in tools aimed at the user (discovery, citation/reference management software, repositories).

In addition:

  1. Students and staff at the University of Lincoln will benefit from a substantial increase in the size and ‘weight’ of Jerome/data.lincoln.ac.uk (“Quantity has its own quality”); from a refinement of the discovery interface; and from engagement with RDF/linked data;
  2. Cambridge University Library will benefit from aspects of the Jerome architecture (e.g. the use of schemaless databases and aggregated search indices), the practical re-use of its own data (N.B. through this consumption of its own output, important lessons on RDF utility can be learned and shared. This methodology has already pworked for CUL with its public facing API project), and from the ‘proving’ of existing approaches through agile distributed development sprints;
  3. To wider HE, the project will demonstrate the value of such data (and the development method) to universities and the wider community, enabling future developments. CLOCK is an opportunity to demonstrate ‘real world’ open web services for libraries, including [i] APIs to enhance existing free or commercial Discovery environments, [ii] the making-accessible of emerging sources of open metadata (the BL table of contents; the outputs from the Open Bibliography 2 project), (iii) a distributed ‘data cleansing’ model (articulated at the COMET-Jerome hack day in August 2011), a new more open approach to cataloguing–resource description, (iv) time and money savings for academic libraries through exploitation of the bibliographic commons and tools for engaging with it.
  4. Both institutions share a firm strategic commitment to open data publishing. It is the ambition of the project participants that any future major developments in national level resource discovery learn and benefit from the experiences gained in this project. JISC, Discovery, and the community at large will benefit from demonstrations of the above ‘real world’ discovery-enhancement tools (above), from a robust public discussion of the parallel technologies for storing and manipulating bib data – RDF store vs. schemaless approaches.

Risk Analysis and Success Plan

  1. The principal risk to the success of the project would be an inability to appoint a suitable person to the position of developer in time for the start of the project. The recruitment process for this post began in February and was completed in March 2012.
  2. A related risk is that the other members of the project team (Paul Stainthorp; Ed Chamberlain) are involved in other JISC-funded projects. The project manager (PS) will take care to ensure that—while the work of the various projects may complement CLOCK—there is clear distinction between the goals and outputs of the various projects. Weekly–fortnightly iteration meetings for the CLOCK project will help to ensure this, and Lincoln has established the LNCD group to co-ordinate the work of its overlapping commitments.
  3. As always, there is a risk that key staff may be absent through illness. We will mitigate against this through close collaboration via the web-based development tools, weekly–fortnightly iteration meetings, and periodic reviews of the project.

Risk Analysis (*overall risk = likelihood × severity):

Risk #

Likelihood 1-10

Severity 1-10

Overall risk 1-50*

1.

2

9

18

2.

3

4

12

3.

2

3

6

If the CLOCK project is a success, we anticipate it will have the following long-term effects (ETA up to one year after the end of the project):

  • Developers unconnected with Lincoln or Cambridge will exploit the APIs to build or enhance new open (and semi-open) bibliographic discovery services.
  • Academic libraries will incorporate Open Bib Data elements from CLOCK in their cataloguing–resource description workflows.
  • Serious researchers will use Open Bib Data elements from CLOCK in personal citation/reference management software.
  • A new social/reputational model of reputation in distributed cataloguing will have gained some traction in academic libraries.
  • Significant steps will have been taken toward a national, distributed open scholarly catalogue knowledgebase for the UK (“data.ac.uk/library” or “library.data.ac.uk”).

IPR

We have no objection to JISC making any part of this proposal available should the contents be requested under the Freedom of Information Act, or if we are successful in our bid for funding that our project proposal is made available on JISC’s website.

  1. Any additional bibliographic data or metadata created as a result of this project will be released under an open license that permits unrestricted re-use. Wherever possible, the Open Data Commons PDDL or CC0 will be used.
  2. All software outputs will be released under an appropriate Open Source licence (we will seek further advice from OSSWatch on the most appropriate licence).
  3. All documentation and blog posts will be released under the Creative Commons attribution share-alike licence, CC-BY-SA.

Project Team Relationships and End User Engagement

We intend to use the CLOCK blog (http://clock.blogs.lincoln.ac.uk/) to provide regular updates on the status of the project, and to provide links to working services and data. In addition to the JISC Discovery events, the developer community will be engaged through the growing data.ac.uk community and mailing list, and library staff through events such as a Mashed Library unconference planned for early July, with a cataloguing / open data theme.

The CLOCK project team will consist of:

Role Name Institution FTE/hours
Project manager Paul Stainthorp University of Lincoln 0.2FTE
Lead researcher Ed Chamberlain University of Cambridge 0.2FTE
Researcher Chris Leach University of Lincoln 0.1FTE
External consultant Owen Stephens Owen Stephens Consulting (set number of days)
Web developer Andrew Beeken University of Lincoln 0.2FTE
Web developer Trevor Jones University of Lincoln 0.2FTE
Project director Ian Snowley University of Lincoln (Uncosted)

Paul Stainthorp is Electronic Resources Librarian at the University of Lincoln. Here he will act as project manager and (jointly with Ed Chamberlain) researcher. Paul has several years’ experience of working with open metadata systems (repositories, journal article knowledgebases); he successfully project-managed the Jerome project. Here he will manage the project overall, produce reports and documentation for JISC, as well as leading the lit. review and user engagement workpackages.

Ed Chamberlain is Systems Development Librarian at Cambridge University Library. Ed will act as lead researcher / internal technical consultant and provide additional project guidance. He brings extensive experience of project management, library systems implementation, metadata publishing and open licensing. In addition to managing the COMET project, he was responsible for releasing and documenting Cambridge’s existing APIs to library services. As lead researcher for CLOCK, he will be primarily responsible for the technical standards & methods workpackages, and for guiding the work of the developer.

Chris Leach is Systems Librarian at the University of Lincoln. With more than 30 years experience in a range of technical library roles, Chris’s focus in CLOCK will be to support the analysis of existing and emerging library data standards, and to support the work of the developer.

Owen Stephens is a library consultant, and for this project will provide consultancy and advice to put the work into a national context, relating CLOCK to the wider movement toward open data and the work of the JISC Discovery initiative. Owen has a technical background in libraries with experience of service delivery and strategic planning. He has been responsible for a number of innovative projects at both institutional and national levels.

Andrew Beeken and Trevor Jones have been appointed (March 2012) as developers on CLOCK. They will act as lead programmer on the project, making use of iterative development tools as described above. They will also participate in the user requirements analysis and the review of existing data standards.

Ian Snowley, the University Librarian at the University of Lincoln, will act as project sponsor and director.

Projected Timeline, Workplan & Overall Project Methodology

The University of Lincoln has an established and rapidly-maturing agile, iterative, distributed approach to web development, supported by tools including Codeigniter, Github, Google Groups, Pivotal Tracker and WordPress – this methodology has serviced previous JISC-funded projects well and will again be employed here. Tools used will be exclusively web-based, allowing staff from Lincoln, Cambridge, JISC and elsewhere to participate.

The project will end 31 July 2011. Because of the iterative approach to development, there will be continual gathering, analysis and documentation of user/technical requirements throughout the project. Results will be disseminated via a project blog, community events, the Discovery newsletter, etc., and via more formal channels (e.g. journal articles in scholarly and trade publications for libraries) where appropriate.

High-level plan of workpackages:

Workpackage/Month Feb 2011 Mar 2011 Apr 2011 May 2011 Jun 2011 Jul 2011
1. Project initiation X
2. Community engagement X X X X X X
3. Literature review X X X
4. Gather user requirements X X X X X
5. Assess and describe existing sources of open data for harvesting X X X
6. Evaluation of technical standards & methods X X X
7. Technical development, testing and verification X X X X X
8. Documentation X X X X X X
9. Project evaluation X X
10. Dissemination X X X X X X
11. Project close X

Budget

We propose that the follow-on funding sought will be used to cover the time of the team members at Lincoln and Cambridge, and to fund part of the new post of web developer. In total, 45% of the funding sought will go on staff (incurred appointments and institutional staff allocations): this is appropriate for a project where a high level of expertise must be applied. Apart from on-costs and travel/expenses, the other significant expense is that of the consultancy work which is necessary to ensure a wider application and scope for the CLOCK project than was the case in Jerome or COMET, both being more rooted to their respective institutions.

Breakdown of the budget:

Projects Web Developer, 0.6 FTE 18.63%
Recruitment 0.68%
Equipment 3.02%
Travel 1.51%
Consultancy 4.98%
Directly Incurred Total 28.81%

Directly Allocated Staff* 22.99%
Estates (Lincoln) 6.04%
Estates (Cambridge) 1.03%
Directly Allocated Total 30.06%

Indirect Costs (Lincoln) 33.1%
Indirect Costs (Cambridge) 8.03%
Indirect Costs Total 41.13%

Total Project Cost £ 66,329.20 (100.0%)
Amount Requested from JISC £ 49,879.56 (75.2%)
Institutional Contributions £ 16,449.64 (24.8%)

Tags: , , , , , , , , , , , , ,

Leave a Reply