Ever since the CLOCK meeting we had in Peterborough, I’ve been trying to describe how open linked bib data might open up new models of ‘cataloguing’, resource description, and (by extension) presentation of bibliographic information to a user of a discovery system.
I’ve found it quite difficult to articulate these ideas without resorting to vague hand gestures and gibberish. At the recent CLOCK hack days at the CARET offices in Cambridge, we finally managed to capture these models on paper [actually, we used Lucidchart]. Thanks to Ed Chamberlain and Trevor Jones for taking notes as we talked through the various models, and for Ed’s colleague @ppetej for acting as a sounding board and critical friend.
The diagrams describe cataloguing processes real and hypothetical. They use a kind of pseudo-scientific notation which I find helpful; feel free to ignore it if you don’t.
cop-out disclaimer: these are rough sketches not polished theses. Please feel free to jump in and criticise, tweak, suggest improvements. If you understand Linked Data, we’re really interested in your comments about how these models could be physically represented. We’re not trying to suggest that any one of these models has all the answers or could be a ‘just-plug-it-in’ replacement for current practice, and we don’t intend to write software as part of the CLOCK project that will make these a reality. But: somewhere in the middle, we think there might be ideas or threads that are worth tinkering with and following up.
The first diagram attempts to describe copy cataloguing as libraries currently understand it, and involves the transfer of MARC records between institutions. When someone catalogues a book or resource, they tend to copy an existing record from another database, alter it to their needs and use it as they see fit. The record of any changes made is lost. Over time, this convention results in many unconnected versions of a record. N.B.:
- The ‘donor’ institution (X) has a certain reputation, which is why the ‘recipient’ institution X′ chooses to copy its records.
- Cataloguers at recipient institutions add, delete or change individual data elements according to local practice, preference or prejudice, or to correct errors. R and R′ are now effectively different entities with no described relationship between them. There is no record of the properties of changes made; no concept of an ‘edit history’.
- This diagram does not go so far as to include the role of the union catalogue (e.g. Copac, Newton) – where R, R′, R″, R‴, etc., are re-combined (munged was the word we used!) to prove a single, new, averaged record (which is itself just another version of R).
In the second model, which we described variously (and possibly not entirely accurately) as wiki-ish, Github-ish, OpenLibrary-ish, and LibraryThing-ish, there is only one, shared/community version of a bibliographic record for a given work, out on the web somewhere. Various institutions/their discovery systems all agree to use this one record.
- The record is changed incrementally, one constituent data element at a time. Probably only the most recent version of the record is viewable/queryable by users and applications, although an edit history may exist and so older versions of records may be recoverable.
- Changes are made by editors who might be cataloguers-at-institutions-with-reputations… or might not be. We’ve assumed that in this model institutional reputation is far less important. (On the Internet no-one knows you’re a cataloguer.)
- This model doesn’t necessarily have to exist along a single timeline (although that’s how it’s shown here) – code-repository-style branching and merging is conceivable.
The third, final, and most speculative model is also the most complex and probably the most poorly defined, but I think the most interesting. It’s also very Linked Data.
In any resource description ‘ecosystem’, there will always be multiple versions of a description of an entity out there somewhere (see scenario #1), each providing some unique or particular value to a specific audience. Cataloguers may benefit from a workflow that allows them to view these multiple descriptions and choose the specific assertions from each description that are most relevent to their target audience. In this model:
- The notion of a series of discrete, changeable ‘records’ largely disappears (Where to? But should it?), to be replaced by a whole mass of overlapping individual data assertions about different aspects of the entity, derived from all manner of different sources. Multiple assertions which are trying to say the same thing about an entity can co-exist.
- Assertions have additional properties which define and qualify them.
- No assertion is ever destroyed – though it may be awarded properties which render it superseded or deprecated. Relationships between assertions are maintained.
- Assertions are assembled on-the-fly into any number of transient Record Representations (RR) which are not permanently stored (though could be cached) according to a set of criteria which we’ve called here a filter. A filter defines a ‘recipe’ for specific data assertions to be included or excluded in the Record Representation, and/or specifies preferences for assertions with particular properties. A discovery tool becomes a device to store filters, and to build Record Representations. Data assertions may be stored elsewhere – and distributed across multiple datastores.
- Filters could be defined manually by a user, as a set of preferences within a discovery tool. For instance: a second year Chinese medical student at a particularly university could choose to see assertions in Mandarin, to prefer MeSH subject headings over Library of Congress, and to include notes, URLs and local physical holdings information relevant to the university they study at [added by cataloguers who work_at the same institution at which they work/study]…
- …alternatively, filters could be defined more passively: using ‘clues’ from the user’s institutional context, geolocation, or profile on external social networks (“show me records like my friends see” or even “show me records like people with similar research interests as me see”) to build a personalised filter (leading to personalised Record Representations that no-one else sees).
Key questions: What’s the value added by this model over others? Are there any individual ideas from one model that could be applied to another, even if the model as a whole is too complex?