A number of research libraries, including the one where I work, are increasingly interested in catalogs built on linked data rather than on MARC records. Linked data work has been going on for years, in the hopes of reaping benefits that include “decreasing redundant cataloging work, and increasing visibility of library resources and interoperability with non-library systems” (to quote from a Wikipedia summary of the W3C’s Library Linked Data Incubator Group Final Report from 2011).
One of the key concepts introduced in most linked data catalogs is the concept of a “work”, an entity that describes literary and intellectual creations in general that are manifested in one or more specific editions (the entities described in MARC records). Back in 2010, I discussed what works were, and how they were represented in various online systems. In a followup post, I discussed a basic model for works that I had started to use in my own Online Books Page. (I continue to use that model there, though to date I’ve only created a few dozen work records there.)
When I wrote those posts, most library catalogs were still firmly MARC-based. MARC still dominates in practice, but BIBFRAME and other linked data initiatives have gained traction in libraries, making work information models and data increasingly important. When brought into regular use, work-based catalogs may bring about major changes in how we do cataloging work, and how our users discover our resources. It’s a good time, then, to consider whether the linked data library catalogs and data models we’re beginning to adopt are working in the ways we want.
In considering this question, it’s worth asking: How do we want our work cataloging to work? Here are my answers, and some of the concerns those answers raise for me:
The model for works should be simple and flexible, so that our users can understand it and use it in a wide variety of information-acquisition scenarios. It’s easy to create unnecessarily complications in our models of works. In my 2010 posts, I noted that FRBR had a two-tier model of “works” and “expressions”. I recommended instead a general model of “works” that covered pretty much any grouping of information resources that shared a common set of characteristics that a user might be seeking. Depending on what was being sought, work groupings could be as tightly defined as hardcover and paperback editions of identical printed pages, or as loose and wide as all religious text compilations commonly called Bibles. (There could also be many groupings between these extremes.)
I was glad to see early BIBFRAME models collapse the FRBR “Work” and “Expression” distinction into a single Work concept. But it now looks to me like its Works might not be defined or handled as flexibly as I hoped they might be. Hence, SVDE has found it necessary to introduce “Superworks“, and the Library of Congress has proposed “Hubs“, both of which appear intended to cover wider groupings than BIBFRAME’s Works cover. It’s not clear to me that those new concepts will be intelligible to users, or that they will suffice to cover all the kinds of groupings that might be relevant to users’ searches.
Work catalog data should be createable and maintainable by either humans or machines, as appropriate. A number of the linked data catalogs now being built create work entities automatically, using algorithms to cluster catalog records that seem to be related. Automated work clustering is indeed important at scale, particularly when you consider not only books but also articles. Projects like Unpaywall cluster millions of published articles with their preprints and other free alternatives, to aid in open access to research, and at that scale need to build most clusters automatically. But we can’t let machines have the last word on creating and clustering work entities. There may be many forms of work groupings that readers find important, and that machines can’t easily sort out. Human catalogers are often the best determiners of how to set up, maintain, describe, and annotate groupings that are relevant to human readers.
Work identifiers and data should be maximally reusable. A major potential advantage of cataloging works distinctly from particular editions is that the information about those works can be shared broadly across all libraries that hold the work, and with all users that are interested in information about the work. But those advantages largely go away if every library catalog, vendor, and consortium mints and maintains its own work identifiers and data without coordination, or if the work identifiers and data are kept proprietary and have restrictions on their reuse, or if reusable identifiers can’t easily be created at scale by libraries or scholars interested in a work. Work identifiers should persist over the long term, resolve easily to usable metadata, and grow as comprehensively as our users need them to. The identifiers should also be reusable without restrictions, and the data associated with them should also have minimal restrictions. (In particular, any data necessary to clearly define what a work identifier refers to should be open, so that others can use that identifier without confusion.)
Work cataloging should not waste people’s time. The systems we use to catalog works, if well-designed, should support catalogers doing more with their time, not less. Shared work data can potentially cut down on the time required to catalog instances of works that someone already cataloged. But if work-level linked-data cataloging tools and environments are overly cumbersome, requiring more screens and slower data entry even for routine items than in existing cataloging environments, the worth of the work comes into question. Similarly, work-aware catalogs should make it easier and quicker for users to find what they want, and not harder due to unwanted complexities in work representation and display. Linked data catalogs should also support easy reference to their work identifiers and associated library data by others who want to write about, cite, or associate additional data with those works.
Maybe works in linked data catalogs will have all of the characteristics I’m asking for above. For those who have worked more directly than I have in designing, developing, and putting data into these new catalogs, are the things I’ve described above also the things you want out of works, or are there things I’m missing? How well do you think what’s currently being developed is satisfying these wants, and where do you think we need more work or more discussion? I’d be interested in hearing from (or reading) anyone who has useful thoughts on making the work that goes into works worthwhile.