Posts Tagged ‘Trevor Jones’

Slides on the CLOCK project for #Mashcat (Cambridge mashed library cataloguing event)

Posted on July 5th, 2012 by Paul Stainthorp

Mashcat logoA whole contingent from Lincoln—Andrew Beeken, Trevor Jones, Elif Varol and I—are at the Cambridge University Clinical School at Addenbrooke’s Hospital in Cambridge, for a mashed library event – Mashcat.

Mashcat is “a mashed library event focussing on cataloguing data. For cataloguers, developers and anyone else with an interest in how library catalogue data can be created, manipulated, used and re-used by computers and software”. It’s being sponsored by DevCSI.

We’re presenting about the CLOCK project to a room full of cataloguers. No pressure. The slides are online at: http://lncn.eu/hknp

It’s a model and it’s looking good

Posted on May 23rd, 2012 by Paul Stainthorp

Ever since the CLOCK meeting we had in Peterborough, I’ve been trying to describe how open linked bib data might open up new models of ‘cataloguing’, resource description, and (by extension) presentation of bibliographic information to a user of a discovery system.

I’ve found it quite difficult to articulate these ideas without resorting to vague hand gestures and gibberish. At the recent CLOCK hack days at the CARET offices in Cambridge, we finally managed to capture these models on paper [actually, we used Lucidchart]. Thanks to Ed Chamberlain and Trevor Jones for taking notes as we talked through the various models, and for Ed’s colleague @ppetej for acting as a sounding board and critical friend.

The diagrams describe cataloguing processes real and hypothetical. They use a kind of pseudo-scientific notation which I find helpful; feel free to ignore it if you don’t.

Also: a cop-out disclaimer: these are rough sketches not polished theses. Please feel free to jump in and criticise, tweak, suggest improvements. If you understand Linked Data, we’re really interested in your comments about how these models could be physically represented. We’re not trying to suggest that any one of these models has all the answers or could be a ‘just-plug-it-in’ replacement for current practice, and we don’t intend to write software as part of the CLOCK project that will make these a reality. But: somewhere in the middle, we think there might be ideas or threads that are worth tinkering with and following up.

1.

The first diagram attempts to describe copy cataloguing as libraries currently understand it, and involves the transfer of MARC records between institutions. When someone catalogues a book or resource, they tend to copy an existing record from another database, alter it to their needs and use it as they see fit. The record of any changes made is lost. Over time, this convention results in many unconnected versions of a record. N.B.:

  • The ‘donor’ institution (X) has a certain reputation, which is why the ‘recipient’ institution X′ chooses to copy its records.
  • Cataloguers at recipient institutions add, delete or change individual data elements according to local practice, preference or prejudice, or to correct errors. R and R′ are now effectively different entities with no described relationship between them. There is no record of the properties of changes made; no concept of an ‘edit history’.
  • This diagram does not go so far as to include the role of the union catalogue (e.g. Copac, Newton) – where R, R′, R″, R‴, etc., are re-combined (munged was the word we used!) to prove a single, new, averaged record (which is itself just another version of R).

Cataloguing workflow diagram 1 of 3

 

2.

In the second model, which we described variously (and possibly not entirely accurately) as wiki-ish, Github-ish, OpenLibrary-ish, and LibraryThing-ish, there is only one, shared/community version of a bibliographic record for a given work, out on the web somewhere. Various institutions/their discovery systems all agree to use this one record.

  • The record is changed incrementally, one constituent data element at a time. Probably only the most recent version of the record is viewable/queryable by users and applications, although an edit history may exist and so older versions of records may be recoverable.
  • Changes are made by editors who might be cataloguers-at-institutions-with-reputations… or might not be. We’ve assumed that in this model institutional reputation is far less important. (On the Internet no-one knows you’re a cataloguer.)
  • This model doesn’t necessarily have to exist along a single timeline (although that’s how it’s shown here) – code-repository-style branching and merging is conceivable.

Cataloguing workflow diagram 2 of 3

3.

The third, final, and most speculative model is also the most complex and probably the most poorly defined, but I think the most interesting. It’s also very Linked Data.

In any resource description ‘ecosystem’, there will always be multiple versions of a description of an entity out there somewhere (see scenario #1), each providing some unique or particular value to a specific audience. Cataloguers may benefit from a workflow that allows them to view these multiple descriptions and choose the specific assertions from each description that are most relevent to their target audience. In this model:

  • The notion of a series of discrete, changeable ‘records’ largely disappears (Where to? But should it?), to be replaced by a whole mass of overlapping individual data assertions about different aspects of the entity, derived from all manner of different sources. Multiple assertions which are trying to say the same thing about an entity can co-exist.
  • Assertions have additional properties which define and qualify them.
  • No assertion is ever destroyed – though it may be awarded properties which render it superseded or deprecated. Relationships between assertions are maintained.
  • Assertions are assembled on-the-fly into any number of transient Record Representations (RR) which are not permanently stored (though could be cached) according to a set of criteria which we’ve called here a filter. A filter defines a ‘recipe’ for specific data assertions to be included or excluded in the Record Representation, and/or specifies preferences for assertions with particular properties. A discovery tool becomes a device to store filters, and to build Record Representations. Data assertions may be stored elsewhere – and distributed across multiple datastores.
  • Filters could be defined manually by a user, as a set of preferences within a discovery tool. For instance: a second year Chinese medical student at a particularly university could choose to see assertions in Mandarin, to prefer MeSH subject headings over Library of Congress, and to include notes, URLs and local physical holdings information relevant to the university they study at [added by cataloguers who work_at the same institution at which they work/study]…
  • …alternatively, filters could be defined more passively: using ‘clues’ from the user’s institutional context, geolocation, or profile on external social networks (“show me records like my friends see” or even “show me records like people with similar research interests as me see”) to build a personalised filter (leading to personalised Record Representations that no-one else sees).

Key questions: What’s the value added by this model over others? Are there any individual ideas from one model that could be applied to another, even if the model as a whole is too complex?

Cataloguing workflow diagram 3 of 3

CLOCK notes – 8 May 2012

Posted on May 8th, 2012 by Paul Stainthorp

This is what the CLOCK project team are currently up to (from meetings over the past couple of weeks and from notes made at the recent Discovery: making sure your resources are discovered, used and reused event in Birmingham):

  • Andrew Beeken has been exploring the Cambridge COMET data via its SPARQL endpoints and has already blogged about the process of using SPARQL to “build kind of a ‘Hello World’ of open data querying”. He’s now looking at the recently-released Harvard open bib data and comparing the speed, the use of matching namespaces, and the use of JSON vs RDF/XML.
  • This work is leading up to unified search and presentation of records from several sources (Cambridge/COMET, Harvard, Lincoln/Jerome, OpenLibrary, etc.). Andrew and Trevor Jones are collaborating on drawing up a high-level architecture for CLOCK, and a strategy for expressing Linked Data, which will be shared with the rest of the project team (and publicly) for discussion.
  • To support this, Alex Bilbie in ICT services at Lincoln is helping to get the original Jerome application up and running on the CLOCK server (jerome.library.lincoln.ac.uk), where it can be used as a stable platform for developing and RDF-ifying Lincoln’s own bib data.
  • Trevor Jones and Ed Chamberlain will work together on developing the work with users (in parallel, at the University of Lincoln and the University of Cambridge) to clarify their requirements for bibliographic data:
    • For cataloguers, based around a rethink of copy cataloguing workflows, we will try to tease out requirements from talking to cataloguers (and associated subject librarians) asking to be ‘positively disrupted’: what do they need to do? What is missing from their data?
    • For researchers, we will build on some initial user walkthrough analysis done by Trevor and Andrew in Lincoln, with performing arts students in LPAC (the Lincoln Performing Arts Centre). What are the research questions that users are trying to answer? How does bib data help them answer those questions? What’s missing? Ed and Trevor will agree on a set of questions and tasks;
    • These requirements will be used to feed the remainingcycles of platform development for CLOCK.
  • Ed Chamberlain will act as the conduit between CLOCK and related projects in the Discovery strand, looking for points of shared interest/technology, and blogging (or asking others to blog) about aspects of one project which can inform the others. The other projects in which Ed is involved are: the Open Education Metadata UK (OEM-UK) project at the Institute of Education (shared interest in new user interfaces for cataloguing – possibly use screencasts to demonstrate alternative workflows?) and the Open Bibliography 2 project (lots of potential technical overlap – BibJSON, JSON-LD, BibSoup.net, expression in RDF container formats).
  • Ed and I (Paul Stainthorp) will work on developing the ‘business case’ / sustainability of CLOCK and data.*.ac.uk, following up on themes discussed in the recent Discovery event, and thinking not only about institutional funding / high-level support for open bib data, but also what it takes to move open bib data publishing from a development environment into an institutionally-supported, ICT-run service.
  • Finally, PS is arranging a couple of internal CLOCK ‘hack days’ (to take place on 17th-18th May, in Cambridge) – more details to follow.

The technical approach: a CLOCK dev stack

Posted on May 2nd, 2012 by Paul Stainthorp

A note on technical development:

We’re beginning to make some progress towards a framework for development in the CLOCK project. Project developers Trevor Jones and Andrew Beeken, with the support of the other developers in LNCD, now have the following at their fingertips:

That list should give you an idea of LNCD’s approach to development. [N.B. some links may not be publicly accessible.]