Making discovery smarter with open data

I’ve just made a significant data enhancement to subject browsing on The Online Books Page.  It improves the concept-oriented browsing of my catalog of online books via subject maps, where users explore a subject along multiple dimensions from a starting point of interest.

Say you’d like to read some books about logic, for instance.  You’d rather not have to go find and troll all the appropriate shelf sections within math, philosophy, psychology, computing, and wherever else logic books might be found in a physical library.  And you’d rather not have to think of all the different keywords used to identify different logic-related topics in a typical online catalog. In my subject map for logic, you can see lots of suggestions of books filed both under “Logic” itself, and under related concepts.  You can go straight to a book that looks interesting, select a related subject and explore that further, or select the “i” icon next to a particular book to find more books like it.

As I’ve noted previously, the relationships and explanations that enable this sort of exploration depend on a lot of data, which has to come from somewhere.  In previous versions of my catalog, most of it came from a somewhat incomplete and not-fully-up-to-date set of authority records in our local catalog at Penn.  But the Library of Congress (LC) has recently made authoritative subject cataloging data freely available on a new website.  There, you can query it through standard interfaces, or simply download it all for analysis.

I recently downloaded their full data set (38 MB of zipped RDF), processed it, and used it to build new subject maps for The Online Books Page.   The resulting maps are substantially richer than what I had before.  My collection is fairly small by the standards of mass digitization– just shy of 40,000 items– but still, the new data, after processing, yielded over 20,000 new subject relationships, and over 600 new notes and explanations, for the subjects represented in the collection.

That’s particularly impressive when you consider that, in some ways, the RDF data is cruder than what I used before.  The RDF schemas that LC uses omit many of the details and structural cues that are in the MARC subject authority records at the Library of Congress (and at Penn).  And LC’s RDF file is also missing many subjects that I use in my catalog; in particular, at present it omits many records for geographic, personal, and organizational names.

Even so, I lost few relationships that were in my prior maps, and I gained many more.  There were two reasons for this:  First of all, LC’s file includes a lot of data records (many times more than my previous data source), and they’re more recent as well.  Second, a variety of automated inference rules– lexical, structural, geographic, and bibliographic– let me create additional links between concepts with little or no explicit authority data.  So even though LC’s RDF file includes no record for Ontario, for instance, its subject map in my collection still covers a lot of ground.

A few important things make these subject maps possible, and will help them get better in the future:

  • A large, shared, open knowledge base: The Library of Congress Subject Headings have been built up by dedicated librarians at many institutions over more than a century.  As a shared, evolving resource, the data set supports unified searching and browsing over numerous collections, including mine.  The work of keeping it up to date, and in sync with the terms that patrons use to search, can potentially be spread out among many participants.  As an open resource, the data set can be put to a variety of uses that both increase the value of our libraries and encourage the further development of the knowledge base.
  • Making the most of automation: LC’s website and standards make it easy for me to download and process their data automatically. Once I’ve loaded their data, and my own records, I then invoke a set of automated rules to infer additional subject relationships.  None of the rules is especially complex; but put together, they do a lot to enhance the subject maps. Since the underlying data is open, anyone else is also free to develop new rules or analyses (or adapt mine, once I release them).  If a community of analyzers develops, we can learn from each other as we go.  And perhaps some of the relationships we infer through automation can be incorporated directly into later revisions of LC’s own subject data.
  • Judicious use of special-purpose data: It is sometimes useful to add to or change data obtained from external sources.  For example, I maintain a small supplementary data file on major geographic areas.  A single data record saying that Ontario is a region within Canada, and is abbreviated “Ont.”, generates much of my subject map for Ontario.  Soon, I should also be able to re-incorporate local subject records, as well as arbitrary additional overlays, to fill in conceptual gaps in LC’s file.  Since local customizations can take  a lot of effort to maintain, however, it’s best to try to incorporate local data into shared knowledge bases when feasible.  That way, others can benefit from, and add on to, your own work.

Recently, there’s been a fair bit of debate about whether to treat cataloging data as an open public good, or to keep it more restricted.  The Library of Congress’ catalog data has been publicly accessible online for years, though until recently only you could only get a little a time via manual searches, or pay a large sum to get a one-time data dump.  By creating APIs, using standard semantic XML formats, and providing free, unrestricted data downloads for their subject authority data, LC has made their data much easier for others to use in a variety of ways. It’s improved my online book catalog significantly, and can also improve many other catalogs and discovery applications.  Those of us who use this data, in turn, have incentives to work to improve and sustain it.

Making the LC Subject Headings ontology open data makes it both more useful and more viable as libraries evolve.  I thank the folks at the Library of Congress for their openness with their data, and I hope to do my part in improving and contributing to their work as well.

Implementing interoperability between library discovery tools and the ILS

Last June I gave a presentation in a NISO webinar about the work a number of colleagues and I did for the Digital Library Federation to recommend standard interfaces for Integrated Library Systems (the systems that keep track of our library’s acquisitions, catalog, and circulation) to support a wide variety of tools and applications for discovery.   Our “ILS-DI” recommendation was published in 2008, and encompassed a number of functions that some ILS’s supported.  But it also included many functions that were not generally, or uniformly, supported by ILS’s of the time.  That’s still the case today.

As I said in my presentation last June, “If we look at the ILS-DI process as a development spiral, we’ve moved from a specification stage  to an implementation stage.”  My hope has been that vendors and other library software implementers would implement the basics of what we recommended– as many agreed to— and the library community could progress from there.  This often takes longer to achieve than one might hope.

But I’m happy to report that the Code4lib community is now picking up the ball.  At this month’s Code4lib conference, a group met to discuss “collaboratively develop[ing] a middleware infrastructure” to link together ILS’s and discovery tools, based on the work done by the DLF’s ILS-DI group and by the developers of systems like Jangle and XC.  The middleware would help power discovery applications like Blacklight, VuFind, Summon, WorldCat Local, and whatever else the digital library community might invent.

I wasn’t at the Code4lib conference, but the group that met there to kick off the effort has impressive collective expertise and accomplishments.   It includes several members of the DLF’s ILS-DI group, as well as the lead implementors of several relevant systems.  Roy Tennant from OCLC Research is coordinating the initial activity, and Emily Lynema of the ILS-DI group has converted the Google groups space used by the ILS-DI group for the new effort.

And you’re welcome to join too, if you’d like to help out or learn more. “This is an open, collaborative effort” is how Roy put it in the announcement of the new initiative.  Due to some prior commitments, I’ll personally be watching more than actively participating, at least to begin with, but I’ll be watching with great interest.  To find out more, and to get involved, see the Google Group.