Everybody's Libraries

July 31, 2010

Keeping subjects up to date with open data

Filed under: data,discovery,online books,open access,sharing,subjects — John Mark Ockerbloom @ 11:51 pm

In an earlier post, I discussed how I was using the open data from the Library of Congress’ Authorities and Vocabularies service to enhance subject browsing on The Online Books Page.  More recently, I’ve used the same data to make my subjects more consistent and up to date.  In this post, I’ll describe why I need to do this, and why doing it isn’t as hard as I feared that it might be.

The Library of Congress Subject Headings (LCSH) is a standard set of subject names, descriptions, and relationships, begun in 1898, and periodically updated ever since. The names of its subjects have shifted over time, particularly in recent years.  For instance, recently subject terms mentioning “Cookery”, a word more common in the 1800s than now, were changed to use the word “Cooking“, a term that today’s library patrons are much more likely to use.

It’s good for local library catalogs that use LCSH to keep in sync with the most up to date version, not only to better match modern usage, but also to keep catalog records consistent with each other.  Especially as libraries share their online books and associated catalog records, it’s particularly important that books on the same subject use the same, up-to-date terms.  No one wants to have to search under lots of different headings, especially obsolete ones, when they’re looking for books on a particular topic.

Libraries with large, long-standing catalogs often have a hard time staying current, however.  The catalog of the university library where I work, for instance, still has some books on airplanes filed under “Aeroplanes”, a term that recalls the long-gone days when open-cockpit daredevils dominated the air.  With new items arriving every day to be cataloged, though, keeping millions of legacy records up to date can be seen as more trouble than it’s worth.

But your catalog doesn’t have to be big or old to fall out of sync.  It happens faster than you might think.   The Online Books Page currently has just over 40,000 records in its catalog, about 1% of the size of my university’s.   I only started adding LC subject headings in 2006.  I tried to make sure I was adding valid subject headings, and made changes when I heard about major term renamings (such as “Cookery” to “Cooking”).  Still, I was startled to find out that only 4 years after I’d started, hundreds of subject headings I’d assigned were already out of date, or otherwise replaced by other standardized headings.  Fortunately, I was able to find this out, and bring the records up to date, in a matter of hours, thanks to automated analysis of the open data from the Library of Congress.  Furthermore, as I updated my records manually, I became confident I could automate most of the updates, making the job faster still.

Here’s how I did it.  After downloading a fresh set of LC subject headings records in RDF, I ran a script over the data that compiled an index of authorized headings (the proper ones to use), alternate headings (the obsolete or otherwise discouraged headings), and lists of which authorized headings were used for which alternate headings. The RDF file currently contains about 390,000 authorized subject headings, and about 330,000 alternate headings.

Then I extracted all the subjects from my catalog.  (I currently have about 38,000 unique subjects.)  Then I had a script check each subject see if it was listed as an authorized heading in the RDF file.  If not, I checked to see if it was an alternate heading.  If neither was the case, and the subject had subdivisions (e.g. “Airplanes — History”) I removed a subdivision from the end and repeated the checks until a term was found in either the authorized or alternate category, or I ran out of subdivisions.

This turned up 286 unique subjects that needed replacement– over 3/4 of 1% of my headings, in less than 4 years.  (My script originally identified even more, until I realized I had to ignore the simple geographic or personal names.  Those aren’t yet in LC’s RDF file, but a few of them show up as alternate headings for other subjects.)  These 286 headings (some of them the same except for subdivisions) represented 225 distinct substitutions.  The bad headings were used in hundreds of bibliographic records, the most popular full heading being used 27 times. The vast majority of the full headings, though, were used in only one record.

What was I to replace these headings with?  Some of the headings had multiple possibilities. “Royalty” was an alternate heading for 5 different authorized headings: “Royal houses”, “Kings and rulers”, “Queens”, “Princes” and “Princesses”.   But that was the exception rather than the rule.  All but 10 of my bad headings were alternates for only one authorized heading.  After “Royalty”, the remaining 9 alternate headings presented a choice between two authorized forms.

When there’s only 1 authorized heading to go to, it’s pretty simple to have a script do the substitution automatically.  As I verified while doing the substitutions manually, nearly all the time the automatable substitution made sense.  (There were a few that didn’t: for instance. when “Mind and body — Early works to 1850″ is replaced by “Mind and body — Early works to 1800“, works first published between 1800 and 1850 get misfiled.  But few substitutions were problematic like this– and those involving dates, like this one, can be flagged by a clever script.)

If I were doing the update over again, I’ll feel more comfortable letting a script automatically reassign, and not just identify, most of my obsolete headings.  I’d still want to manually inspect changes that affect more than one or two records, to make sure I wasn’t messing up lots of records in the same way; and I’d also want to manually handle cases where more than one term could be substituted.  The rest– the vast majority of the edits– could be done fully automatically.  The occasional erroneous reassignment of a single record would be more than made up by the repair of many more obsolete and erroneous old records.  (And if my script logs changes properly, I can roll back problematic ones later on if need be.)

Mind you, now that I’ve brought my headings up to date once, I expect that further updates will be quicker anyway.  The Library of Congress releases new LCSH RDF files about every 1-2 months.  There should be many fewer changes in most such incremental updates than there would be when doing years’ worth of updates all at once.

Looking at the evolution of the Library of Congress catalog over time, I suspect that they do a lot of this sort of automatic updating already.  But many other libraries don’t, or don’t do it thoroughly or systematically.  With frequent downloads of updated LCSH data, and good automated procedures, I suspect that many more could.  I have plans to analyze some significantly larger, older, and more diverse collections of records to find out whether my suspicions are justified, and hope to report on my results in a future post.  For now, I’d like to thank the Library of Congress once again for publishing the open data that makes these sorts of catalog investigations and improvements feasible.

May 6, 2010

Making discovery smarter with open data

Filed under: architecture,discovery,online books,open access,sharing,subjects — John Mark Ockerbloom @ 9:06 am

I’ve just made a significant data enhancement to subject browsing on The Online Books Page.  It improves the concept-oriented browsing of my catalog of online books via subject maps, where users explore a subject along multiple dimensions from a starting point of interest.

Say you’d like to read some books about logic, for instance.  You’d rather not have to go find and troll all the appropriate shelf sections within math, philosophy, psychology, computing, and wherever else logic books might be found in a physical library.  And you’d rather not have to think of all the different keywords used to identify different logic-related topics in a typical online catalog. In my subject map for logic, you can see lots of suggestions of books filed both under “Logic” itself, and under related concepts.  You can go straight to a book that looks interesting, select a related subject and explore that further, or select the “i” icon next to a particular book to find more books like it.

As I’ve noted previously, the relationships and explanations that enable this sort of exploration depend on a lot of data, which has to come from somewhere.  In previous versions of my catalog, most of it came from a somewhat incomplete and not-fully-up-to-date set of authority records in our local catalog at Penn.  But the Library of Congress (LC) has recently made authoritative subject cataloging data freely available on a new website.  There, you can query it through standard interfaces, or simply download it all for analysis.

I recently downloaded their full data set (38 MB of zipped RDF), processed it, and used it to build new subject maps for The Online Books Page.   The resulting maps are substantially richer than what I had before.  My collection is fairly small by the standards of mass digitization– just shy of 40,000 items– but still, the new data, after processing, yielded over 20,000 new subject relationships, and over 600 new notes and explanations, for the subjects represented in the collection.

That’s particularly impressive when you consider that, in some ways, the RDF data is cruder than what I used before.  The RDF schemas that LC uses omit many of the details and structural cues that are in the MARC subject authority records at the Library of Congress (and at Penn).  And LC’s RDF file is also missing many subjects that I use in my catalog; in particular, at present it omits many records for geographic, personal, and organizational names.

Even so, I lost few relationships that were in my prior maps, and I gained many more.  There were two reasons for this:  First of all, LC’s file includes a lot of data records (many times more than my previous data source), and they’re more recent as well.  Second, a variety of automated inference rules– lexical, structural, geographic, and bibliographic– let me create additional links between concepts with little or no explicit authority data.  So even though LC’s RDF file includes no record for Ontario, for instance, its subject map in my collection still covers a lot of ground.

A few important things make these subject maps possible, and will help them get better in the future:

  • A large, shared, open knowledge base: The Library of Congress Subject Headings have been built up by dedicated librarians at many institutions over more than a century.  As a shared, evolving resource, the data set supports unified searching and browsing over numerous collections, including mine.  The work of keeping it up to date, and in sync with the terms that patrons use to search, can potentially be spread out among many participants.  As an open resource, the data set can be put to a variety of uses that both increase the value of our libraries and encourage the further development of the knowledge base.
  • Making the most of automation: LC’s website and standards make it easy for me to download and process their data automatically. Once I’ve loaded their data, and my own records, I then invoke a set of automated rules to infer additional subject relationships.  None of the rules is especially complex; but put together, they do a lot to enhance the subject maps. Since the underlying data is open, anyone else is also free to develop new rules or analyses (or adapt mine, once I release them).  If a community of analyzers develops, we can learn from each other as we go.  And perhaps some of the relationships we infer through automation can be incorporated directly into later revisions of LC’s own subject data.
  • Judicious use of special-purpose data: It is sometimes useful to add to or change data obtained from external sources.  For example, I maintain a small supplementary data file on major geographic areas.  A single data record saying that Ontario is a region within Canada, and is abbreviated “Ont.”, generates much of my subject map for Ontario.  Soon, I should also be able to re-incorporate local subject records, as well as arbitrary additional overlays, to fill in conceptual gaps in LC’s file.  Since local customizations can take  a lot of effort to maintain, however, it’s best to try to incorporate local data into shared knowledge bases when feasible.  That way, others can benefit from, and add on to, your own work.

Recently, there’s been a fair bit of debate about whether to treat cataloging data as an open public good, or to keep it more restricted.  The Library of Congress’ catalog data has been publicly accessible online for years, though until recently only you could only get a little a time via manual searches, or pay a large sum to get a one-time data dump.  By creating APIs, using standard semantic XML formats, and providing free, unrestricted data downloads for their subject authority data, LC has made their data much easier for others to use in a variety of ways. It’s improved my online book catalog significantly, and can also improve many other catalogs and discovery applications.  Those of us who use this data, in turn, have incentives to work to improve and sustain it.

Making the LC Subject Headings ontology open data makes it both more useful and more viable as libraries evolve.  I thank the folks at the Library of Congress for their openness with their data, and I hope to do my part in improving and contributing to their work as well.

March 9, 2010

Implementing interoperability between library discovery tools and the ILS

Filed under: architecture,discovery — John Mark Ockerbloom @ 4:57 pm

Last June I gave a presentation in a NISO webinar about the work a number of colleagues and I did for the Digital Library Federation to recommend standard interfaces for Integrated Library Systems (the systems that keep track of our library’s acquisitions, catalog, and circulation) to support a wide variety of tools and applications for discovery.   Our “ILS-DI” recommendation was published in 2008, and encompassed a number of functions that some ILS’s supported.  But it also included many functions that were not generally, or uniformly, supported by ILS’s of the time.  That’s still the case today.

As I said in my presentation last June, “If we look at the ILS-DI process as a development spiral, we’ve moved from a specification stage  to an implementation stage.”  My hope has been that vendors and other library software implementers would implement the basics of what we recommended– as many agreed to– and the library community could progress from there.  This often takes longer to achieve than one might hope.

But I’m happy to report that the Code4lib community is now picking up the ball.  At this month’s Code4lib conference, a group met to discuss “collaboratively develop[ing] a middleware infrastructure” to link together ILS’s and discovery tools, based on the work done by the DLF’s ILS-DI group and by the developers of systems like Jangle and XC.  The middleware would help power discovery applications like Blacklight, VuFind, Summon, WorldCat Local, and whatever else the digital library community might invent.

I wasn’t at the Code4lib conference, but the group that met there to kick off the effort has impressive collective expertise and accomplishments.   It includes several members of the DLF’s ILS-DI group, as well as the lead implementors of several relevant systems.  Roy Tennant from OCLC Research is coordinating the initial activity, and Emily Lynema of the ILS-DI group has converted the Google groups space used by the ILS-DI group for the new effort.

And you’re welcome to join too, if you’d like to help out or learn more. “This is an open, collaborative effort” is how Roy put it in the announcement of the new initiative.  Due to some prior commitments, I’ll personally be watching more than actively participating, at least to begin with, but I’ll be watching with great interest.  To find out more, and to get involved, see the Google Group.

January 15, 2010

January 7, 2010

December 10, 2009

December 4, 2009

June 10, 2009

Learn more about ILS discovery interfaces

Filed under: architecture,discovery,libraries — John Mark Ockerbloom @ 1:01 pm

I’m presenting today at a NISO webinar on interoperability, giving an overview of the work I did with a Digital Library Federation task group to produce recommendations for standard APIs for ILS’s supporting information discovery applications.

I’ll include a link to my presentation later today, after the webinar is over.   I’m also happy to answer questions here about the ILS-DI work.  (I’ve also covered that work here before in the blog.)

To help folks keep track of ILS-DI implementations and related activities, I’ve also created a new page on this site linking to the recommendation, implementations and followons, and related projects.  I’ve started it with just the basics, but plan to fill in more information shortly.

Update: I’ve now posted my slides and speaker notes.

January 28, 2009

Open catalog APIs and data: ALA presentation notes posted

Filed under: architecture,discovery,libraries,open access,sharing — John Mark Ockerbloom @ 3:48 pm

I’ve now posted my materials for the two panels I participated in at ALA Midwinter.

I have slides  available for “Opening the ILS for Discovery: The Digital Library Federation’s ILS-Discovery Interface Recommendations“, a presentation for LITA’s Next Generation Catalog interest group, where I gave an overview of the recommendations and their use.   At the same session, Beth Jefferson of BiblioCommons talked about some of the social and legal issues of sharing user content in library catalogs and other discovery applications.

And I have the slides and remarks I prepared for  “Open Records, Open Possibilities“, a presentation for the ALCTS panel on shared bibliographic records and the future of WorldCat.  In that one, I argue for more open access to bibliographic records, showing some of the benefits and sustainability strategies of open access models.

Karen Calhoun has also posted the slides from her presentation at that panel.  Peter Murray also presented; I haven’t yet found his slides online, but he’s blogging about what he said.  The fourth panelist, Brian Schottlaender, didn’t present slides, but instead gave thoughtful summaries and follow-on questions to some of the points the rest of us made.  From the audience, Norman Oder of Library Journal took notes and then wrote a useful report on the session.

I’d like to thank the organizers of these sessions, Sharon Shafer and Charles Wilt, for inviting me to speak, and my co-presenters for sharing their ideas and viewpoints.

January 15, 2009

Repository services, Part 2: Supporting deposit and access

Filed under: discovery,formats,repositories — John Mark Ockerbloom @ 6:01 pm

A couple of days ago, I talked about how we provided multiple repository services, and why an institutional scholarship repository needs to provide more than just a place to store stuff.  In this post, I’ll describe some of the useful basic deposit and access services for institutional scholarly repositories (IRs).

The enumeration of services in this series is based in part on discussions I’ve had with our scholarly communications librarian, Shawn Martin, but any wrong-headed or garbled statements you find here can be laid at my own feet.  (Whereupon I can pick them up, smooth them out, and find the right head for them.)

Ingestion:

One of the major challenges of running an institutional repository is filling it up with content: finding it, making sure it can go in, and making sure it goes in properly, in a manageable format, with informative metadata.  Among other things, this calls for:

  • Efficient, flexible, user-friendly deposit workflows. Most of your authors will not bother with anything that looks like it’s wasting their time.  And you shouldn’t waste your staff’s time either, or drive them mad, with needlessly tedious deposit procedures they have to do over and over and over and over again.
  • Conversion to  standard formats on ingestion. Word processing documents, and other formats tied to a particular software product, have a way of becoming opaque and unreadable a few years after the vendor has moved on to a new version, a new product, or that dot-com registry in the sky.  Our institutional repository, for instance, converts text documents to PDF on ingestion, which both helps preserve them and ensures wide readability.  (PDF is an openly specified format, readable by programs from many sources, available on virtually all kinds of computers.)
  • Journal workflows. Much of what our scholars publish is destined for scholarly journals, which in turn are typically reviewed and edited by those scholars.  Letting scholars review, compile, and publish those journals directly in the repository can save their time, and encourage rapid, open electronic access.   (And you don’t have to go back and try to get a copy for your repository when it’s already in the repository.)  Our BePress IR software has journal workflows and publication built into it.  Alternatively, specialized journal editing and publishing systems, such as Open Journal Systems, also serve as repositories for their journal content.
  • Support for automated submission protocols such as SWORD. Manual repository deposit can be tedious and error-prone, especially if there are multiple repositories that want your content (such as a funder-mandated repository, your own institution repository, and perhaps an independent subject repository.)  Manual deposit also often wastes people’s time re-entering information that’s already available online.  If you can work with an automated protocol that can automatically put content into a repository, though, things can get much better: you can support multiple simultanous deposits, ingestion procedures designed especially for your own environment that use the automated protocol for deposit, and automated bulk transfer of content from one repository to another.  SWORD is an automated repository deposit protocol that is starting to be supported by various repositories. (BePress does not yet support it, but we’re hoping they will soon).

From a practical standpoint, if you want a significant stream of content coming into your repository, you’ll probably need to have a content wrangler as well: someone who makes sure that authors’ content is going into the repository as intended. (In practice, they often end up doing the deposit themselves.)

Discovery:

You want it to be easy and enjoyable for readers to explore your site and find content of interest to them.  Here are a few important ways to enable discovery:

  • Search of full text and/or metadata, either over the repository as a whole, or over selected portions of the repository.  Full text search can be simple and turn up lots of useful content that might not be discovered through metadata search alone.  More precise, metadata-based searches can also be important for specialized needs.   Full text indexing is not always available (in some cases, you might only have page images), but it should be supported where possible.
  • Customization of discovery for different communities and collections.  Different communities may have different ways of organizing and finding things.  Some communities may want to organize primarily by topic, or author, or publication type, or date.  Some may have specialized metadata that should be available for general and targeted searching and browsing.  If you can customize how different collections can be explored, you can make them more usable to their audiences.
  • Aggregator feeds using RSS or Atom, so people can keep track of new items of interest in their favorite feed readers.  This needs to exist at multiple levels of granularity.   Many repositories give RSS feeds of everything added to the repository, but most people will be more interested in following what’s new from a particular department or author, or in a particular subject.
  • Search engine friendliness. Judging from our logs, most of the downloads of our repository papers occur not via our own searching and browsing interfaces, but via Google and other search engines that have crawled the repository.  So you need to make sure your repository is set up to make it easy and inviting for search engines to crawl.  Don’t hide things behind Flash or Javascript unless you don’t want them easily found.  Make sure your pages have informative titles, and the site doesn’t require excessive link-clicking to get to content.  You also need to make sure that your site can handle the traffic produced by search-engine indexers, some of which can be quite enthusiastic about frequently crawling content.
  • Metadata export via protocols like OAI-PMH.  This is useful in a number of ways:  It allows your content to be indexed by content aggregators; it lets you maintain and analyze your own repository’s inventory; and, in combination with automated deposit protocols like SWORD (and content aggregation languages like OAI-ORE), it may eventually make it much simpler to replicate and redeposit content in multiple repositories.

Access:

  • Persistent URIs for items. Content is easier to find and cite when it doesn’t move away from its original location.  You would think it would be well known that cool URLs don’t change, but I still find a surprisingly large number of documents put in content management systems where I know the only visible URIs will not survive the next upgrade of the system, let alone a migration to a new platform.  If possible, the persistent URI should be the only URI the user sees.  If not, the persistent URI should at least be highly visible, so that users link to it, and not the more transient URI that your repository software might use for its own purposes.
  • An adequate range of access control options for particular collections and items.  I’m all in favor of open access to content, but sometimes this is not possible or appropriate.  Some scholarship includes information that needs to be kept under wraps, or in limited release, temporarily or permanently.  We want to still be able to manage this content in the repository when appropriate.
  • Embargo management is an important part of  access control.   In some cases, users may want to keep their content limited-access for a set time period, so that they can get a patent, obey a publishing contract, or prepare for a coordinated announcement.  Currently, because of BePress’ limited embargo support, we sit on embargoed content and have to remember to put it into the repository, or manually turn on open access, when the embargo ends.  It’s much easier if depositors can just say “keep this limited access until this data, and then open it up,” and the repository service handles matters from there.

That may seem like a lot to think about, but we’re not done yet.  In the next part, I’ll talk about services for managing content in the IR, including promoting it, letting depositors know about its impact, and preserving it appropriately.

« Previous PageNext Page »

Theme: Rubric. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 47 other followers