Everybody's Libraries

October 7, 2011

My mother’s orphan

Filed under: copyright,findingada,online books,open access,people,preservation,sharing,teaching — John Mark Ockerbloom @ 5:06 pm

Before my mother was pregnant with me, she was working on a book.

The book had begun its gestation at least a year before. She had been teaching math in Massachusetts, and was involved with the Madison Project, one of the initiatives that arose from the “new math” movement of the 1960s.  What excited her, and what I caught from her not long after I was born, was the sense of discovery and play that was encouraged in the Madison teaching style.  The primary focus wasn’t so much on imparting and drilling facts and rules, or on mundane applications, but on finding patterns, solving puzzles, and figuring out the secrets of numbers and geometry and the other mathematical constructs that underlie our world. Some project participants planned a series of books that would help bring out this sense of discovery and exploration in math classes.

Two small children in the house may have delayed my mother’s ambitions, but we didn’t stop her.  When I was in kindergarten, the piles of papers in my parents’ bedroom went away, and my mother proudly showed me her new book.  The book, Discoveries in Essential Mathematics, was co-written with Ramon Steinen, and published by Charles E. Merrill. Though the textbook was written for middle schoolers, I remember reading through the book after my mother showed it to me, solving the simpler problems, and smiling when I saw my name or my sister’s in an example.

She got small royalty checks for a few years, but the book was out of print by the late 1970s, never reaching a second edition.  We kept some copies in our basement, but I didn’t know of any library that held it.  When I visited the Library of Congress as a middle schooler, wrongly convinced that they had every book ever published, I remember my disappointment when I couldn’t find Mom’s book in their card catalog.

My mother eventually retired from teaching, and the enthusiasm and talent I’d gotten from Mom for math shifted into computing, and then into digital libraries.  And when my kids reached school age, I decided to try putting her book online.  In an era of large classes, detailed state standards, and high-stakes standardized tests, it might not be a viable standard textbook any more, but I think it’s still great for curious kids who show an interest in math.

Mom thought that was a great idea.  But she didn’t know if she could grant permission on her own.  Although long out of print, the book’s copyright had automatically renewed in 2000 under US copyright law, and she wasn’t sure if she had to get the consent of her publisher or co-author before she could give me the go-ahead. She didn’t know how to reach her co-author, and her old imprint was long gone.  Even its acquirer had itself been acquired by a large conglomerate some time ago.  So I let the idea drop, thinking I’d come back to it later when I had a little time to research the copyright.

But not long after, she started a long slide into dementia, and was soon in no position to give permission to anyone.  If her book had been practically an “orphan work” before, due to uncertainty over rights, it was even more so now.  There was no trouble locating the author; but no way of getting valid permission from someone definitely known to hold the rights.

Mom died this past winter, four years after my Dad had reluctantly moved her into the nursing home for good, and four weeks after he’d made his usual daily visit, gone back home, and had a fatal heart attack.  After we paid the last of the bills, and threw out the contents of the basement (where a burst pipe ruined all the books, papers, and other things they kept down there), what remained of what they had would now go to me and my siblings.

I still had a copy at home of the teacher’s edition of Mom’s book that she had once given to Grandma.  And between my mother’s funeral and the burst pipe, I’d taken a student edition out of their basement for my kids to read.  But any faint hope of finding publishing contracts or rights assignment documents was obliterated after the pipe burst.  The basic questions were: had Mom signed her rights to the book away, as many academic authors do? If so, had she gotten them back at some point?  Or had she never had the rights in the first place, as sometimes happens with textbook authors under “work for hire” contracts?

The copyright page of the book, and the record in the 1972 Catalog of Copyright Entries, show the publisher as the copyright claimant, so I couldn’t assume she had the rights.   But I also doubted whether I could get a clear answer, or reasonable licensing terms, from the company that had eventually acquired the assets of Mom’s original publisher.

I eventually found what I needed to know on a trip to Washington, DC.  While attending a meeting on digital format registries, I realized that I was in the same building as the Copyright Office.   So after the meeting, I got a reader’s card, went upstairs, and consulted the librarians there.  We confirmed that, under the automatic renewal laws of the time, the copyright to Mom’s book would have reverted in 2000 to whoever had been declared the “author” in the book in the original registration record.   Moreover, in the absence of any contrary arrangement, any co-owner of a copyright can authorize publication, as long as they split any proceeds with the other copyright owners.

Since I was planning just to put the book online for free, the only question remaining was: who was listed as the author on the original registration: the publisher who claimed the copyright, or my mother and Dr. Steinen?  It’s not clear from the Catalog of Copyright Entries, but the original registration certificate would state it.  And the one copy known to exist of that certificate was in the archives of the Copyright Office where I was sitting.

Twenty minutes later, I had the certificate in front of me.  The name on the “claimant” line was indeed the publisher’s, but the names on the “author” line were Steinen and Ockerbloom.  My mother’s orphan was mine to claim.

There are a lot more books out there like hers.  Since I added records for Hathi Trust‘s public domain books to The Online Books Page, I’ve gotten requests to curate hundreds of out of print, largely forgotten books that are still meaningful to readers online.  Many of the people who opt to leave contact information  live in places where  books tend to be hard to get or pay for. Many others, judging from their names, seem to be related to the authors of the books they suggest. These readers have found the books after Hathi, or Google, or the Internet Archive, has resurfaced them online, and the readers want these books to live on.  If there were an easy, inexpensive, uncontroversially legal way to also bring back books that are still in copyright, but no longer commercially exploited, I’m sure I could fulfill a lot of requests for those books too.

For now, though, I’ll bring back the one orphan book I’ve been given. And I thank my mother for writing it, and the other women and men who have poured so much of their energy and teaching into their books, and the librarians of all kinds who help ensure those books stay accessible to readers who value them.  I’ll try my best to keep your legacies alive.

April 9, 2011

Opt in for open access

Filed under: copyright,libraries,online books,open access — John Mark Ockerbloom @ 8:40 am

There’s been much discussion online about Judge Chin’s long-awaited decision to reject the settlement proposed by Google and authors and publishers’ organizations over the Google Books service. Settlement discussions continue (and the court has ordered a status conference for April 25).  But it’s clear that it will be a while before this case is fully settled or decided.

Don’t count on a settlement to produce a comprehensive library

When the suit is finally resolved, it will not enable the comprehensive retrospective digital library I had been hoping for.  That, Chin clearly indicated, was an over-reach.  The  proposed settlement would have allowed Google to sell access to most pre-2009 books published in the English-speaking world whose rightsholders had not opted out.   But, as Chin wrote, “the case was about the use of an indexing and searching tool, not the sale of complete copyrighted works.”  The changes in the American copyright regime that the proposed settlement entailed, he wrote, were too sweeping for a court to approve.

Unless Congress makes changes in copyright law, then, a rightsholder has to opt in for a copyrighted book to be made readable on Google (or on another book site).  Chin’s opinion ends with a strong recommendation for the parties to craft a settlement that would largely be based on “opt-in”.  Of course, an “opt in” requirement necessarily excludes orphan works, where one cannot find a rightsholder to opt in.  And as John Wilkin recently pointed out, it’s likely that a lot of the books held by research libraries are orphan works.

Don’t count on authors to step up spontaneously

Chin expects that many authors will naturally want to opt in to make their works widely available, perhaps even without payment.  “Academic authors, almost by definition, are committed to maximizing access to knowledge,” he writes.  Indeed, one of the reasons he gives for rejecting the settlement is the argument, advanced by Pamela Samuelson and some other objectors, that the interests of academic and other non-commercially motivated authors are different from those of the commercial organizations that largely drove the settlement negotiations.

I think that Chin is right that many authors, particularly academics, care more about having their work appreciated by readers than about making money off of it.  And even those who want to maximize their earnings on new releases may prefer freely sharing their out of print books to keeping them locked away, or making a pittance on paywall-mediated access.  But that doesn’t necessarily mean that we’ll see all, or even most, of these works “opted in” to a universally accessible library.  We’ve had plenty of experience with institutional repositories showing us that even when authors are fine in principle with making their work freely available, most will not go out of their way to put their work in open-access repositories, unless there are strong forces mandating or proactively encouraging it.

Don’t count on Congress to solve the problem

The closest analogue to a “mandate” for making older books generally available would be orphan works legislation.    If well crafted, such a law could make a lot of books available to the public that now have no claimants, revenue, or current audience, and I hope that a coalition can come together to get a good law passed. But an orphan works law could take years to adopt (indeed, it’s already been debated for years). There’s no guarantee on how useful or fair the law that eventually gets passed would be, after all the committees and interest groups are done with it.  And even the best law would not cover many books that could go into a universal digital library.

Libraries have what it takes, if they’re proactive

On the other hand, we have an unprecedented opportunity right now to proactively encourage authors (academic or otherwise) to make their works freely available online.  As Google and various other projects continue to scan books from library collections, we now have millions of these authors’ books deposited in “dark” digital archives.  All an interested author has to do is say the word, and the dark  copy can be lit up for open access.  And libraries are uniquely positioned to find and encourage the authors in their communities to do this.

It’s now pretty easy to do, in many cases.  Hathi Trust, a coalition of a growing number of research institutions, currently has over 8 million volumes digitized from member libraries.  Most of the books are currently inaccessible due to copyright.  But they’ve published a permission agreement form that an author or other rightsholder can fill out and send in if they want to make their book freely readable online.  The form could be made a bit clearer and more visible, but it’s workable as it is.  As editor of The Online Books Page, I not infrequently hear from people who want to share their out of print books, or those of their ancestors, with the world.  Previously, I had to worry about how the books would get online.  Now I usually can just verify it’s in Hathi’s collection, and then refer them to the form.

Google Books also lets authors grant access rights through their partner program.  Joining the program is more complicated than sending in the Hathi form, and it’s more oriented towards selling books than sharing them.  But Google Books partners can declare their books freely readable in full if they wish, and can give them Creative Commons licenses (as they can with Hathi).  Google has even more digitized books in its archives than Hathi does.

So, all those who would love to see a wide-ranging (if not entirely comprehensive), globally accessible digital library now have a real opportunity to make it happen.  We don’t have to wait for Congress to act, or  some new utopian digital library to arise.  Thanks to mass digitization, library coalitions like Hathi’s, and the development of simplified, streamlined rights and permissions processes, it’s easier than ever for interested authors (and heirs, and publishers) to make their work freely available online.  If those us involved in libraries, scholarship, and the open access movement work to open up our own books, and those of our colleagues, we can light up access to the large, universal digital library that’s now waiting for us online.

December 25, 2010

October 29, 2010

October 18, 2010

September 17, 2010

September 8, 2010

July 31, 2010

Keeping subjects up to date with open data

Filed under: data,discovery,online books,open access,sharing,subjects — John Mark Ockerbloom @ 11:51 pm

In an earlier post, I discussed how I was using the open data from the Library of Congress’ Authorities and Vocabularies service to enhance subject browsing on The Online Books Page.  More recently, I’ve used the same data to make my subjects more consistent and up to date.  In this post, I’ll describe why I need to do this, and why doing it isn’t as hard as I feared that it might be.

The Library of Congress Subject Headings (LCSH) is a standard set of subject names, descriptions, and relationships, begun in 1898, and periodically updated ever since. The names of its subjects have shifted over time, particularly in recent years.  For instance, recently subject terms mentioning “Cookery”, a word more common in the 1800s than now, were changed to use the word “Cooking“, a term that today’s library patrons are much more likely to use.

It’s good for local library catalogs that use LCSH to keep in sync with the most up to date version, not only to better match modern usage, but also to keep catalog records consistent with each other.  Especially as libraries share their online books and associated catalog records, it’s particularly important that books on the same subject use the same, up-to-date terms.  No one wants to have to search under lots of different headings, especially obsolete ones, when they’re looking for books on a particular topic.

Libraries with large, long-standing catalogs often have a hard time staying current, however.  The catalog of the university library where I work, for instance, still has some books on airplanes filed under “Aeroplanes”, a term that recalls the long-gone days when open-cockpit daredevils dominated the air.  With new items arriving every day to be cataloged, though, keeping millions of legacy records up to date can be seen as more trouble than it’s worth.

But your catalog doesn’t have to be big or old to fall out of sync.  It happens faster than you might think.   The Online Books Page currently has just over 40,000 records in its catalog, about 1% of the size of my university’s.   I only started adding LC subject headings in 2006.  I tried to make sure I was adding valid subject headings, and made changes when I heard about major term renamings (such as “Cookery” to “Cooking”).  Still, I was startled to find out that only 4 years after I’d started, hundreds of subject headings I’d assigned were already out of date, or otherwise replaced by other standardized headings.  Fortunately, I was able to find this out, and bring the records up to date, in a matter of hours, thanks to automated analysis of the open data from the Library of Congress.  Furthermore, as I updated my records manually, I became confident I could automate most of the updates, making the job faster still.

Here’s how I did it.  After downloading a fresh set of LC subject headings records in RDF, I ran a script over the data that compiled an index of authorized headings (the proper ones to use), alternate headings (the obsolete or otherwise discouraged headings), and lists of which authorized headings were used for which alternate headings. The RDF file currently contains about 390,000 authorized subject headings, and about 330,000 alternate headings.

Then I extracted all the subjects from my catalog.  (I currently have about 38,000 unique subjects.)  Then I had a script check each subject see if it was listed as an authorized heading in the RDF file.  If not, I checked to see if it was an alternate heading.  If neither was the case, and the subject had subdivisions (e.g. “Airplanes — History”) I removed a subdivision from the end and repeated the checks until a term was found in either the authorized or alternate category, or I ran out of subdivisions.

This turned up 286 unique subjects that needed replacement– over 3/4 of 1% of my headings, in less than 4 years.  (My script originally identified even more, until I realized I had to ignore the simple geographic or personal names.  Those aren’t yet in LC’s RDF file, but a few of them show up as alternate headings for other subjects.)  These 286 headings (some of them the same except for subdivisions) represented 225 distinct substitutions.  The bad headings were used in hundreds of bibliographic records, the most popular full heading being used 27 times. The vast majority of the full headings, though, were used in only one record.

What was I to replace these headings with?  Some of the headings had multiple possibilities. “Royalty” was an alternate heading for 5 different authorized headings: “Royal houses”, “Kings and rulers”, “Queens”, “Princes” and “Princesses”.   But that was the exception rather than the rule.  All but 10 of my bad headings were alternates for only one authorized heading.  After “Royalty”, the remaining 9 alternate headings presented a choice between two authorized forms.

When there’s only 1 authorized heading to go to, it’s pretty simple to have a script do the substitution automatically.  As I verified while doing the substitutions manually, nearly all the time the automatable substitution made sense.  (There were a few that didn’t: for instance. when “Mind and body — Early works to 1850″ is replaced by “Mind and body — Early works to 1800“, works first published between 1800 and 1850 get misfiled.  But few substitutions were problematic like this– and those involving dates, like this one, can be flagged by a clever script.)

If I were doing the update over again, I’ll feel more comfortable letting a script automatically reassign, and not just identify, most of my obsolete headings.  I’d still want to manually inspect changes that affect more than one or two records, to make sure I wasn’t messing up lots of records in the same way; and I’d also want to manually handle cases where more than one term could be substituted.  The rest– the vast majority of the edits– could be done fully automatically.  The occasional erroneous reassignment of a single record would be more than made up by the repair of many more obsolete and erroneous old records.  (And if my script logs changes properly, I can roll back problematic ones later on if need be.)

Mind you, now that I’ve brought my headings up to date once, I expect that further updates will be quicker anyway.  The Library of Congress releases new LCSH RDF files about every 1-2 months.  There should be many fewer changes in most such incremental updates than there would be when doing years’ worth of updates all at once.

Looking at the evolution of the Library of Congress catalog over time, I suspect that they do a lot of this sort of automatic updating already.  But many other libraries don’t, or don’t do it thoroughly or systematically.  With frequent downloads of updated LCSH data, and good automated procedures, I suspect that many more could.  I have plans to analyze some significantly larger, older, and more diverse collections of records to find out whether my suspicions are justified, and hope to report on my results in a future post.  For now, I’d like to thank the Library of Congress once again for publishing the open data that makes these sorts of catalog investigations and improvements feasible.

June 20, 2010

How we talk about the president: A quick exploration in Google Books

Filed under: data,online books,sharing — John Mark Ockerbloom @ 10:28 pm

On The Online Books Page, I’ve been indexing a collection of memorial sermons on President Abraham Lincoln, all published shortly after his assassination, and digitized by Emory University.  Looking through them, I was struck by how often Lincoln was referred to as “our Chief Magistrate”.  That’s a term you don’t hear much nowadays, but was once much more common. Lincoln himself used the term in his first inaugural address, and he was far from the first person to do so.

Nowadays you’re more likely to hear the president referred to in different terms, with somewhat different connotations, such as “chief executive” or “commander-in-chief”.  The Constitution uses that last term in reference to the president’s command over the Armed Forces. Lately, though, I’ve heard “commander in chief” used as if it referred to the country in general.  As someone wary of the expansion of executive power in recent years, I find that usage unsettling.

I wondered, as I went through the Emory collection, whether the terms we use for the president reflect shifts in the role he has played over American history.  Is he called “commander in chief” more in times of war or military buildup, for instance?  How often was he instead called “chief magistrate” or “chief executive” over the course of American history?  And how long did “chief magistrate” stay in common use, and what replaced it?

Not too long ago, those questions would have simply remained idle curiosity.  Perhaps, if I’d had the time and patience, I could have painstakingly compiled a small selection of representative writings from various points in US history, read through them, and tried to draw conclusions from them.  But now I– and anyone else on the web– also have a big searchable, dated corpus of text to query: the Google Books collection.  Could that give me any insight into my questions?

It looks like it can, and without too much expenditure of time.  I’m by no means an expert on corpus analysis, but in a couple of hours of work, I was able to assemble promising-looking data that turned up some unexpected (but plausible) results.  Below, I’ll describe what I did and what I found out.

I started out by going to the advanced book search for Google.  From there, I specified particular decades for publications over the last 200 years: 1810-1819, 1820-1829, and so on up to 2000-2009.  For each decade, I recorded how many hits Google reported for the phrases “chief magistrate”, “chief executive”, and “commander in chief”, in volumes that also contained the word “president”.  Because the scope of Google’s collection may vary in different decades, I also recorded the total number of volumes in each decade containing the word “president”.  I then divided the number of phrase+”president” hits by the number of “president” hits, and graphed the proportional occurrences of each phrase in each decade.

The graph below shows the results.  The blue line tracks “chief magistrate”, the orange line tracks “chief executive”, and the green line tracks “commander in chief”.  The numbers in the horizontal axis refer to the decade+1800s; e.g. 1 is the 1810s, 2 is 1820s, all the way up to 20 being the 2000s.

Relative frequencies of "chief magistrate", "chief executive", and "commander in chief" used along with "president", by decade, 1810s-2000s

You can see a larger view of this graph, and the other graphs in this post, by clicking on it.

The graph suggests that “chief magistrate” was popular in the 19th century, peaking in the 1830s.  “Chief executive” arose from obscurity in the late 19th century,  overtook “chief magistrate” in the early 20th century, and then became widely used, apparently peaking in the 1980s.  (Though by then some corporate executives– think “chief executive officer”– are in the result set along with the US president.)

We don’t see a strong trend with “commander in chief”.  There are some peaks in usage in the 1830s, and the 1960s and 1970s, but they’re not dominant, and they don’t obviously correspond to any particular set of events.  What’s going on?  Was I just imagining a relation between its usage and military buildups?  Is the Google data skewed somehow?  Or is something else going on?

It’s true that the Google corpus is imperfect, as I and others have noted before.  The metadata isn’t always accurate; the number of reported hits is approximate when more than 100 or so, and the mix of volumes in Google’s corpus varies in different time periods.  (For instance, recent years of the corpus may include more magazine content than earlier years; and reprints can make texts reappear decades after they were actually written.  The rise of print-on-demand scans of old public-domain books in the 2000s may be partly responsible for the uptick in “chief magistrate” that decade, for instance.)

But I might also not be looking at the right data.  There are lots of reasons to mention “commander-in-chief” at various times.  The apparent trend that concerned me, though, was the use of “commander in chief” as an all-encompassing term.  Searching for the phrase “our commander in chief” with “president” might be better at identifying that. That search doesn’t distinguish military from civilian uses of that phrase, but an uptick in usage would indicate either a greater military presence in the published record, or a more militarized view among civilians.  So either way, it should reflect a more militaristic view of the president’s role.

Indeed, when I graph the relative occurrences of “our commander in chief” over time, the trend line looks rather different than before.  Here it is below, with the decades labeled the same way as in the first graph:

Scaled frequency of "Our commander in chief" used along with "President", by decade

Scaled frequency of "our commander in chief" used along with "president", by decade, 1810s-2000s

Here we see increases in decades that saw major wars, including 1812, the Mexican war of the 1840s, the civil war of the 1860s, and the Vietnam war expanding in the 1970s.  This past decade had the second most-frequent usage (by a small margin) of “our commander in chief” in the last 200 years of this corpus.  But it’s dwarfed by the use during the 1940s, when Americans fought in World War II.  That’s not something I’d expected, but given the total mobilization that occurred between 1941 and 1945, it makes sense.

If we look more closely at the frequency of “our commander in chief” in the last 20 years, we also find interesting results. The graph below looks at 1991 through 2009 (add 1990 to each number on the horizontal axis; and as always, click on the image for a closer look):

Scaled frequency of "our commander in chief" used along with "president", by year, 1991-2009

Not too surprisingly, after the successful Gulf War in early 1991, usage starts to decrease.  And not long after 9/11, usage increases notably, and stays high in the years to follow.  (Books take some time to go from manuscript to publication, but we see a local high by 2002, and higher usage in most of the subsequent years.)  I was a bit surprised, though, to find an initial spike in usage in 1999.  As seen in this timeline, Bill Clinton’s impeachment and trial took place in late 1998 and early 1999, and a number of the hits during this time period are in the context of questioning Clinton’s fitness to be “our commander in chief” in the light of the Lewinsky scandal.  But once public interest moved on to the 2000 elections, in which Clinton was not a candidate, usage dropped off again until the 9/11 attacks and the wars that followed.

I don’t want to oversell the importance of these searches.  Google Books search is a crude instrument for literary analysis, and I’m still a novice at corpus analysis (and at generating Excel graphs).  But my searches suggest that the corpus can be a useful tool for identifying and tracking large-scale trends in certain kinds of expression.  It’s not a substitute for the close reading that most humanities scholarship requires.  And even with the “distant reading” of searches, you still need to look at a sampling of your results to make sure you understand what you’re finding, and aren’t copying down numbers blindly.

But with those caveats, the Google Books corpus supports an enlightening high-altitude perspective on literature and culture.  The corpus is valuable not just for its size and searchability, but also for its public accessibility.  When I report on an experiment like this, anyone else who wants to can double-check my results, or try some followup searches of their own.  (Exact numbers will naturally shift somewhat over time as more volumes get added to the corpus.)  To the extent that searching, snippets, and text are open to all, Google Books can be everybody’s literary research laboratory.

May 6, 2010

Making discovery smarter with open data

Filed under: architecture,discovery,online books,open access,sharing,subjects — John Mark Ockerbloom @ 9:06 am

I’ve just made a significant data enhancement to subject browsing on The Online Books Page.  It improves the concept-oriented browsing of my catalog of online books via subject maps, where users explore a subject along multiple dimensions from a starting point of interest.

Say you’d like to read some books about logic, for instance.  You’d rather not have to go find and troll all the appropriate shelf sections within math, philosophy, psychology, computing, and wherever else logic books might be found in a physical library.  And you’d rather not have to think of all the different keywords used to identify different logic-related topics in a typical online catalog. In my subject map for logic, you can see lots of suggestions of books filed both under “Logic” itself, and under related concepts.  You can go straight to a book that looks interesting, select a related subject and explore that further, or select the “i” icon next to a particular book to find more books like it.

As I’ve noted previously, the relationships and explanations that enable this sort of exploration depend on a lot of data, which has to come from somewhere.  In previous versions of my catalog, most of it came from a somewhat incomplete and not-fully-up-to-date set of authority records in our local catalog at Penn.  But the Library of Congress (LC) has recently made authoritative subject cataloging data freely available on a new website.  There, you can query it through standard interfaces, or simply download it all for analysis.

I recently downloaded their full data set (38 MB of zipped RDF), processed it, and used it to build new subject maps for The Online Books Page.   The resulting maps are substantially richer than what I had before.  My collection is fairly small by the standards of mass digitization– just shy of 40,000 items– but still, the new data, after processing, yielded over 20,000 new subject relationships, and over 600 new notes and explanations, for the subjects represented in the collection.

That’s particularly impressive when you consider that, in some ways, the RDF data is cruder than what I used before.  The RDF schemas that LC uses omit many of the details and structural cues that are in the MARC subject authority records at the Library of Congress (and at Penn).  And LC’s RDF file is also missing many subjects that I use in my catalog; in particular, at present it omits many records for geographic, personal, and organizational names.

Even so, I lost few relationships that were in my prior maps, and I gained many more.  There were two reasons for this:  First of all, LC’s file includes a lot of data records (many times more than my previous data source), and they’re more recent as well.  Second, a variety of automated inference rules– lexical, structural, geographic, and bibliographic– let me create additional links between concepts with little or no explicit authority data.  So even though LC’s RDF file includes no record for Ontario, for instance, its subject map in my collection still covers a lot of ground.

A few important things make these subject maps possible, and will help them get better in the future:

  • A large, shared, open knowledge base: The Library of Congress Subject Headings have been built up by dedicated librarians at many institutions over more than a century.  As a shared, evolving resource, the data set supports unified searching and browsing over numerous collections, including mine.  The work of keeping it up to date, and in sync with the terms that patrons use to search, can potentially be spread out among many participants.  As an open resource, the data set can be put to a variety of uses that both increase the value of our libraries and encourage the further development of the knowledge base.
  • Making the most of automation: LC’s website and standards make it easy for me to download and process their data automatically. Once I’ve loaded their data, and my own records, I then invoke a set of automated rules to infer additional subject relationships.  None of the rules is especially complex; but put together, they do a lot to enhance the subject maps. Since the underlying data is open, anyone else is also free to develop new rules or analyses (or adapt mine, once I release them).  If a community of analyzers develops, we can learn from each other as we go.  And perhaps some of the relationships we infer through automation can be incorporated directly into later revisions of LC’s own subject data.
  • Judicious use of special-purpose data: It is sometimes useful to add to or change data obtained from external sources.  For example, I maintain a small supplementary data file on major geographic areas.  A single data record saying that Ontario is a region within Canada, and is abbreviated “Ont.”, generates much of my subject map for Ontario.  Soon, I should also be able to re-incorporate local subject records, as well as arbitrary additional overlays, to fill in conceptual gaps in LC’s file.  Since local customizations can take  a lot of effort to maintain, however, it’s best to try to incorporate local data into shared knowledge bases when feasible.  That way, others can benefit from, and add on to, your own work.

Recently, there’s been a fair bit of debate about whether to treat cataloging data as an open public good, or to keep it more restricted.  The Library of Congress’ catalog data has been publicly accessible online for years, though until recently only you could only get a little a time via manual searches, or pay a large sum to get a one-time data dump.  By creating APIs, using standard semantic XML formats, and providing free, unrestricted data downloads for their subject authority data, LC has made their data much easier for others to use in a variety of ways. It’s improved my online book catalog significantly, and can also improve many other catalogs and discovery applications.  Those of us who use this data, in turn, have incentives to work to improve and sustain it.

Making the LC Subject Headings ontology open data makes it both more useful and more viable as libraries evolve.  I thank the folks at the Library of Congress for their openness with their data, and I hope to do my part in improving and contributing to their work as well.

« Previous PageNext Page »

Theme: Rubric. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 49 other followers