Everybody's Libraries

Making discovery smarter with open data

I’ve just made a significant data enhancement to subject browsing on The Online Books Page.  It improves the concept-oriented browsing of my catalog of online books via subject maps, where users explore a subject along multiple dimensions from a starting point of interest.

Say you’d like to read some books about logic, for instance.  You’d rather not have to go find and troll all the appropriate shelf sections within math, philosophy, psychology, computing, and wherever else logic books might be found in a physical library.  And you’d rather not have to think of all the different keywords used to identify different logic-related topics in a typical online catalog. In my subject map for logic, you can see lots of suggestions of books filed both under “Logic” itself, and under related concepts.  You can go straight to a book that looks interesting, select a related subject and explore that further, or select the “i” icon next to a particular book to find more books like it.

As I’ve noted previously, the relationships and explanations that enable this sort of exploration depend on a lot of data, which has to come from somewhere.  In previous versions of my catalog, most of it came from a somewhat incomplete and not-fully-up-to-date set of authority records in our local catalog at Penn.  But the Library of Congress (LC) has recently made authoritative subject cataloging data freely available on a new website.  There, you can query it through standard interfaces, or simply download it all for analysis.

I recently downloaded their full data set (38 MB of zipped RDF), processed it, and used it to build new subject maps for The Online Books Page.   The resulting maps are substantially richer than what I had before.  My collection is fairly small by the standards of mass digitization– just shy of 40,000 items– but still, the new data, after processing, yielded over 20,000 new subject relationships, and over 600 new notes and explanations, for the subjects represented in the collection.

That’s particularly impressive when you consider that, in some ways, the RDF data is cruder than what I used before.  The RDF schemas that LC uses omit many of the details and structural cues that are in the MARC subject authority records at the Library of Congress (and at Penn).  And LC’s RDF file is also missing many subjects that I use in my catalog; in particular, at present it omits many records for geographic, personal, and organizational names.

Even so, I lost few relationships that were in my prior maps, and I gained many more.  There were two reasons for this:  First of all, LC’s file includes a lot of data records (many times more than my previous data source), and they’re more recent as well.  Second, a variety of automated inference rules– lexical, structural, geographic, and bibliographic– let me create additional links between concepts with little or no explicit authority data.  So even though LC’s RDF file includes no record for Ontario, for instance, its subject map in my collection still covers a lot of ground.

A few important things make these subject maps possible, and will help them get better in the future:

Recently, there’s been a fair bit of debate about whether to treat cataloging data as an open public good, or to keep it more restricted.  The Library of Congress’ catalog data has been publicly accessible online for years, though until recently only you could only get a little a time via manual searches, or pay a large sum to get a one-time data dump.  By creating APIs, using standard semantic XML formats, and providing free, unrestricted data downloads for their subject authority data, LC has made their data much easier for others to use in a variety of ways. It’s improved my online book catalog significantly, and can also improve many other catalogs and discovery applications.  Those of us who use this data, in turn, have incentives to work to improve and sustain it.

Making the LC Subject Headings ontology open data makes it both more useful and more viable as libraries evolve.  I thank the folks at the Library of Congress for their openness with their data, and I hope to do my part in improving and contributing to their work as well.