I’ve recently uploaded two talks I gave this year to my Selected Works site. I presented “How Not to Waste Catalogers’ Time: Making the most of subject headings” at the Code4lib 16 conference in Philadelphia in March. “How to Read 100 Million Publications: VIVO and Comprehensive Open Publication Databases” was a talk I gave at VIVO 2016 last month. The versions I’ve deposited in Selected Works are PDF files that include both slides and notes, so you can see both what I showed, and (approximately) what I said during the presentations.
The talks can be seen as two complementary takes on the question “What’s the value of the catalog, and cataloging, in today’s networked environment?” In “How to Read 100 Million Publications” I highlight the value of an open, comprehensive database of the scholarly record that goes down to the article level. Such a database would be quite large (on the order of 100 million records, as the title suggests), and considerably finer-grain that present-day library-run catalogs, which typically catalog at the title level, and only add a little bit of volume information at the holdings level. But as I describe in the talk, 100-million-record-scale databases are now routinely maintained, and regularly replicated, online, and even a “just the facts” database of who published what articles where could be very helpful in promoting preservation, open access, corpus analysis, research networking systems, and a variety of other applications. It would not need to be built from scratch– the data required for such a database already largely exists online, though all too often in fragmentary or proprietary collections. But a number of these could be brought together and opened up, without requiring much additional cataloging labor. Shared-effort collaborations, like those that support databases like GOKb and SHARE, as well as more crowd-sourced projects like FictionMags or Wikipedia, could support the scale needed.
While “How to Read 100 Million Publications” proposes something new we could do with library catalogs, “How Not to Waste Catalogers’ Time” urges developers to pay close attention to what catalogers have already been doing with them– particularly the often-overlooked semantic richness of their subject cataloging. At the VIVO conference I saw a demo of a BIBFRAME application where a book’s subjects were represented by an ordered set of FAST properties. That’s easy enough to model as RDF linked data, but it omits the subject ordering that can improve relevance ranking, and lets readers distinguish the relative importance of subjects in a particular work. It also doesn’t accommodate the detailed on-the-fly subject categories that coordinated subdivisions allow. It’s possible to accommodate both of these in linked data, but it requires more complicated data models, and most catalogs don’t take full advantage of their semantic strengths. My talk shows some examples of systems that can take advantage of them (as well as the strengths of FAST and Wikipedia terminology). Whether or not we implement such systems broadly, I hope we continue to support ordered, coordinated subject headings in the underlying data of the next-generation catalogs we build. It’s easy enough to automatically map from such data to FAST term sets, for discovery systems that work best with those, but you can’t reliably and automatically go back the other way and keep the original degree of precision.
In short, I want us to build catalogs that let catalogers and readers do more with metadata, rather than less. I hope that these two talks suggest some ways we can do that, and I’d love to hear from folks who have more ideas, suggestions, or questions.