Everybody's Libraries

May 6, 2010

Making discovery smarter with open data

Filed under: architecture,discovery,online books,open access,sharing,subjects — John Mark Ockerbloom @ 9:06 am

I’ve just made a significant data enhancement to subject browsing on The Online Books Page.  It improves the concept-oriented browsing of my catalog of online books via subject maps, where users explore a subject along multiple dimensions from a starting point of interest.

Say you’d like to read some books about logic, for instance.  You’d rather not have to go find and troll all the appropriate shelf sections within math, philosophy, psychology, computing, and wherever else logic books might be found in a physical library.  And you’d rather not have to think of all the different keywords used to identify different logic-related topics in a typical online catalog. In my subject map for logic, you can see lots of suggestions of books filed both under “Logic” itself, and under related concepts.  You can go straight to a book that looks interesting, select a related subject and explore that further, or select the “i” icon next to a particular book to find more books like it.

As I’ve noted previously, the relationships and explanations that enable this sort of exploration depend on a lot of data, which has to come from somewhere.  In previous versions of my catalog, most of it came from a somewhat incomplete and not-fully-up-to-date set of authority records in our local catalog at Penn.  But the Library of Congress (LC) has recently made authoritative subject cataloging data freely available on a new website.  There, you can query it through standard interfaces, or simply download it all for analysis.

I recently downloaded their full data set (38 MB of zipped RDF), processed it, and used it to build new subject maps for The Online Books Page.   The resulting maps are substantially richer than what I had before.  My collection is fairly small by the standards of mass digitization– just shy of 40,000 items– but still, the new data, after processing, yielded over 20,000 new subject relationships, and over 600 new notes and explanations, for the subjects represented in the collection.

That’s particularly impressive when you consider that, in some ways, the RDF data is cruder than what I used before.  The RDF schemas that LC uses omit many of the details and structural cues that are in the MARC subject authority records at the Library of Congress (and at Penn).  And LC’s RDF file is also missing many subjects that I use in my catalog; in particular, at present it omits many records for geographic, personal, and organizational names.

Even so, I lost few relationships that were in my prior maps, and I gained many more.  There were two reasons for this:  First of all, LC’s file includes a lot of data records (many times more than my previous data source), and they’re more recent as well.  Second, a variety of automated inference rules– lexical, structural, geographic, and bibliographic– let me create additional links between concepts with little or no explicit authority data.  So even though LC’s RDF file includes no record for Ontario, for instance, its subject map in my collection still covers a lot of ground.

A few important things make these subject maps possible, and will help them get better in the future:

  • A large, shared, open knowledge base: The Library of Congress Subject Headings have been built up by dedicated librarians at many institutions over more than a century.  As a shared, evolving resource, the data set supports unified searching and browsing over numerous collections, including mine.  The work of keeping it up to date, and in sync with the terms that patrons use to search, can potentially be spread out among many participants.  As an open resource, the data set can be put to a variety of uses that both increase the value of our libraries and encourage the further development of the knowledge base.
  • Making the most of automation: LC’s website and standards make it easy for me to download and process their data automatically. Once I’ve loaded their data, and my own records, I then invoke a set of automated rules to infer additional subject relationships.  None of the rules is especially complex; but put together, they do a lot to enhance the subject maps. Since the underlying data is open, anyone else is also free to develop new rules or analyses (or adapt mine, once I release them).  If a community of analyzers develops, we can learn from each other as we go.  And perhaps some of the relationships we infer through automation can be incorporated directly into later revisions of LC’s own subject data.
  • Judicious use of special-purpose data: It is sometimes useful to add to or change data obtained from external sources.  For example, I maintain a small supplementary data file on major geographic areas.  A single data record saying that Ontario is a region within Canada, and is abbreviated “Ont.”, generates much of my subject map for Ontario.  Soon, I should also be able to re-incorporate local subject records, as well as arbitrary additional overlays, to fill in conceptual gaps in LC’s file.  Since local customizations can take  a lot of effort to maintain, however, it’s best to try to incorporate local data into shared knowledge bases when feasible.  That way, others can benefit from, and add on to, your own work.

Recently, there’s been a fair bit of debate about whether to treat cataloging data as an open public good, or to keep it more restricted.  The Library of Congress’ catalog data has been publicly accessible online for years, though until recently only you could only get a little a time via manual searches, or pay a large sum to get a one-time data dump.  By creating APIs, using standard semantic XML formats, and providing free, unrestricted data downloads for their subject authority data, LC has made their data much easier for others to use in a variety of ways. It’s improved my online book catalog significantly, and can also improve many other catalogs and discovery applications.  Those of us who use this data, in turn, have incentives to work to improve and sustain it.

Making the LC Subject Headings ontology open data makes it both more useful and more viable as libraries evolve.  I thank the folks at the Library of Congress for their openness with their data, and I hope to do my part in improving and contributing to their work as well.

3 Comments

  1. John, thanks so much for blogging about this. It’s great to see you actually using the data at id.loc.gov and reporting back on how the data can be improved … and for highlighting how important it is to get the name authority data on id.loc.gov.

    One thing that LC hasn’t publicized a whole lot is that the loading history for the data can be found here:

    http://id.loc.gov/authorities/loads/

    In addition there is a Atom feed that makes all the create, update and delete activity available:

    http://id.loc.gov/authorities/feed/

    The idea with the Atom feed is to allow you to keep your own local version of the records synchronized with id.loc.gov. You can follow the “next” URLs in the atom:link elements to drill backwards until you’ve seen a change at a record update time you already knew about. It’s a bit of an experimental more mainstream alternative to oai-pmh, so I’d be interested in any feedback you might have if it is of interest.

    Comment by Ed Summers — May 6, 2010 @ 10:40 am

  2. Nice. I’m interested in learning more about the technical details of what you’ve done.

    Is it powered by Solr, an rdbms, both, neither?

    How do you determine the narrower term relationships (and other relationships), just from the actual authority file encodings, or also using your own heuristic rules? Sounds like some of your own heuristics too, would be interested in learning more about them.

    When you get new updates from LCSH, do you have to “re-index” or “re-process” your entire db, even for just minor updates?

    Did you consider trying to combine the LCSH authorities and your local authorities, to get a superset of both?

    Are you interested in sharing your code, and what language is it written in?

    And finally, would you like to write a Code4Lib Journal article about this? I think it’d be a good one!

    Comment by Jonathan Rochkind — May 7, 2010 @ 11:37 am

  3. Ed: Thanks for your note, and for telling us about the feeds. Those sound useful too, though I’m happy to work with the big dumps myself for now.

    Jonathan: I’d be happy to consider a code4lib article if there’s interest. For now, here are my quick answers to your questions:

    * It’s not powered by either solr or a dbms. The code is based on a set of Perl modules I started writing back in the 20th century; the data’s kept in something much like Berkeley DB files, but much less space-hungry. I like the idea of moving it to Lucene, which might include Solr, but shifting things over to Java would be nontrivial, Perl versions of Lucene had performance issues last I checked, and using an external Solr instance might or might not improve things (haven’t thought much about this last option; it would at the very least involve more overhead keeping everything running).

    * The NT relationships are determined from the authority file encodings plus my heuristic rules. Some of these rules also look at subject assignments in the collection to be indexed. The exact details are a bit long for a comment but I could detail them elsewhere.

    * My current plan is to reprocess when LCSH updates, but that’s not a big deal for me. New RDFs seem to come out every month or two. Preliminary processing to prepare the new RDF for my indexes takes less than 5 minutes for the entire RDF. I can then regenerate not only the subject maps but also all other indexes for my collection in another 5 minutes; this is typically done daily. The time required scales up somewhat with collection size, but earlier test indexes with Penn’s whole catalog (and a smaller set of authorities and heuristics, granted) took under an hour, IIRC.

    * I do plan to bring in local authorities as well; I would have done it by now except for some character encoding issues. I could see some possible issues with links to obsolete forms, but there should be ways to address this. (Actually, this sort of analysis could also be tweaked to flag obsolete subject terms in one’s catalog, and possibly change them in bulk as well.)

    * Yes, I’d be happy to share code, but see comments above about 20th-century Perl. Not sure how useful it will be in general (and I’m not in a position to provide support for it), but I can show it to folks who are interested.

    Hope this answers your questions; and again, if folks would be interested in learning more about various technical aspects, I’d be happy to consider doing a paper.

    Comment by John Mark Ockerbloom — May 7, 2010 @ 12:09 pm


RSS feed for comments on this post.

The Rubric Theme Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 76 other followers

%d bloggers like this: