Everybody's Libraries

July 31, 2010

Keeping subjects up to date with open data

Filed under: data,discovery,online books,open access,sharing,subjects — John Mark Ockerbloom @ 11:51 pm

In an earlier post, I discussed how I was using the open data from the Library of Congress’ Authorities and Vocabularies service to enhance subject browsing on The Online Books Page.  More recently, I’ve used the same data to make my subjects more consistent and up to date.  In this post, I’ll describe why I need to do this, and why doing it isn’t as hard as I feared that it might be.

The Library of Congress Subject Headings (LCSH) is a standard set of subject names, descriptions, and relationships, begun in 1898, and periodically updated ever since. The names of its subjects have shifted over time, particularly in recent years.  For instance, recently subject terms mentioning “Cookery”, a word more common in the 1800s than now, were changed to use the word “Cooking“, a term that today’s library patrons are much more likely to use.

It’s good for local library catalogs that use LCSH to keep in sync with the most up to date version, not only to better match modern usage, but also to keep catalog records consistent with each other.  Especially as libraries share their online books and associated catalog records, it’s particularly important that books on the same subject use the same, up-to-date terms.  No one wants to have to search under lots of different headings, especially obsolete ones, when they’re looking for books on a particular topic.

Libraries with large, long-standing catalogs often have a hard time staying current, however.  The catalog of the university library where I work, for instance, still has some books on airplanes filed under “Aeroplanes”, a term that recalls the long-gone days when open-cockpit daredevils dominated the air.  With new items arriving every day to be cataloged, though, keeping millions of legacy records up to date can be seen as more trouble than it’s worth.

But your catalog doesn’t have to be big or old to fall out of sync.  It happens faster than you might think.   The Online Books Page currently has just over 40,000 records in its catalog, about 1% of the size of my university’s.   I only started adding LC subject headings in 2006.  I tried to make sure I was adding valid subject headings, and made changes when I heard about major term renamings (such as “Cookery” to “Cooking”).  Still, I was startled to find out that only 4 years after I’d started, hundreds of subject headings I’d assigned were already out of date, or otherwise replaced by other standardized headings.  Fortunately, I was able to find this out, and bring the records up to date, in a matter of hours, thanks to automated analysis of the open data from the Library of Congress.  Furthermore, as I updated my records manually, I became confident I could automate most of the updates, making the job faster still.

Here’s how I did it.  After downloading a fresh set of LC subject headings records in RDF, I ran a script over the data that compiled an index of authorized headings (the proper ones to use), alternate headings (the obsolete or otherwise discouraged headings), and lists of which authorized headings were used for which alternate headings. The RDF file currently contains about 390,000 authorized subject headings, and about 330,000 alternate headings.

Then I extracted all the subjects from my catalog.  (I currently have about 38,000 unique subjects.)  Then I had a script check each subject see if it was listed as an authorized heading in the RDF file.  If not, I checked to see if it was an alternate heading.  If neither was the case, and the subject had subdivisions (e.g. “Airplanes — History”) I removed a subdivision from the end and repeated the checks until a term was found in either the authorized or alternate category, or I ran out of subdivisions.

This turned up 286 unique subjects that needed replacement– over 3/4 of 1% of my headings, in less than 4 years.  (My script originally identified even more, until I realized I had to ignore the simple geographic or personal names.  Those aren’t yet in LC’s RDF file, but a few of them show up as alternate headings for other subjects.)  These 286 headings (some of them the same except for subdivisions) represented 225 distinct substitutions.  The bad headings were used in hundreds of bibliographic records, the most popular full heading being used 27 times. The vast majority of the full headings, though, were used in only one record.

What was I to replace these headings with?  Some of the headings had multiple possibilities. “Royalty” was an alternate heading for 5 different authorized headings: “Royal houses”, “Kings and rulers”, “Queens”, “Princes” and “Princesses”.   But that was the exception rather than the rule.  All but 10 of my bad headings were alternates for only one authorized heading.  After “Royalty”, the remaining 9 alternate headings presented a choice between two authorized forms.

When there’s only 1 authorized heading to go to, it’s pretty simple to have a script do the substitution automatically.  As I verified while doing the substitutions manually, nearly all the time the automatable substitution made sense.  (There were a few that didn’t: for instance. when “Mind and body — Early works to 1850” is replaced by “Mind and body — Early works to 1800“, works first published between 1800 and 1850 get misfiled.  But few substitutions were problematic like this– and those involving dates, like this one, can be flagged by a clever script.)

If I were doing the update over again, I’ll feel more comfortable letting a script automatically reassign, and not just identify, most of my obsolete headings.  I’d still want to manually inspect changes that affect more than one or two records, to make sure I wasn’t messing up lots of records in the same way; and I’d also want to manually handle cases where more than one term could be substituted.  The rest– the vast majority of the edits– could be done fully automatically.  The occasional erroneous reassignment of a single record would be more than made up by the repair of many more obsolete and erroneous old records.  (And if my script logs changes properly, I can roll back problematic ones later on if need be.)

Mind you, now that I’ve brought my headings up to date once, I expect that further updates will be quicker anyway.  The Library of Congress releases new LCSH RDF files about every 1-2 months.  There should be many fewer changes in most such incremental updates than there would be when doing years’ worth of updates all at once.

Looking at the evolution of the Library of Congress catalog over time, I suspect that they do a lot of this sort of automatic updating already.  But many other libraries don’t, or don’t do it thoroughly or systematically.  With frequent downloads of updated LCSH data, and good automated procedures, I suspect that many more could.  I have plans to analyze some significantly larger, older, and more diverse collections of records to find out whether my suspicions are justified, and hope to report on my results in a future post.  For now, I’d like to thank the Library of Congress once again for publishing the open data that makes these sorts of catalog investigations and improvements feasible.

June 20, 2010

How we talk about the president: A quick exploration in Google Books

Filed under: data,online books,sharing — John Mark Ockerbloom @ 10:28 pm

On The Online Books Page, I’ve been indexing a collection of memorial sermons on President Abraham Lincoln, all published shortly after his assassination, and digitized by Emory University.  Looking through them, I was struck by how often Lincoln was referred to as “our Chief Magistrate”.  That’s a term you don’t hear much nowadays, but was once much more common. Lincoln himself used the term in his first inaugural address, and he was far from the first person to do so.

Nowadays you’re more likely to hear the president referred to in different terms, with somewhat different connotations, such as “chief executive” or “commander-in-chief”.  The Constitution uses that last term in reference to the president’s command over the Armed Forces. Lately, though, I’ve heard “commander in chief” used as if it referred to the country in general.  As someone wary of the expansion of executive power in recent years, I find that usage unsettling.

I wondered, as I went through the Emory collection, whether the terms we use for the president reflect shifts in the role he has played over American history.  Is he called “commander in chief” more in times of war or military buildup, for instance?  How often was he instead called “chief magistrate” or “chief executive” over the course of American history?  And how long did “chief magistrate” stay in common use, and what replaced it?

Not too long ago, those questions would have simply remained idle curiosity.  Perhaps, if I’d had the time and patience, I could have painstakingly compiled a small selection of representative writings from various points in US history, read through them, and tried to draw conclusions from them.  But now I– and anyone else on the web– also have a big searchable, dated corpus of text to query: the Google Books collection.  Could that give me any insight into my questions?

It looks like it can, and without too much expenditure of time.  I’m by no means an expert on corpus analysis, but in a couple of hours of work, I was able to assemble promising-looking data that turned up some unexpected (but plausible) results.  Below, I’ll describe what I did and what I found out.

I started out by going to the advanced book search for Google.  From there, I specified particular decades for publications over the last 200 years: 1810-1819, 1820-1829, and so on up to 2000-2009.  For each decade, I recorded how many hits Google reported for the phrases “chief magistrate”, “chief executive”, and “commander in chief”, in volumes that also contained the word “president”.  Because the scope of Google’s collection may vary in different decades, I also recorded the total number of volumes in each decade containing the word “president”.  I then divided the number of phrase+”president” hits by the number of “president” hits, and graphed the proportional occurrences of each phrase in each decade.

The graph below shows the results.  The blue line tracks “chief magistrate”, the orange line tracks “chief executive”, and the green line tracks “commander in chief”.  The numbers in the horizontal axis refer to the decade+1800s; e.g. 1 is the 1810s, 2 is 1820s, all the way up to 20 being the 2000s.

Relative frequencies of "chief magistrate", "chief executive", and "commander in chief" used along with "president", by decade, 1810s-2000s

You can see a larger view of this graph, and the other graphs in this post, by clicking on it.

The graph suggests that “chief magistrate” was popular in the 19th century, peaking in the 1830s.  “Chief executive” arose from obscurity in the late 19th century,  overtook “chief magistrate” in the early 20th century, and then became widely used, apparently peaking in the 1980s.  (Though by then some corporate executives– think “chief executive officer”– are in the result set along with the US president.)

We don’t see a strong trend with “commander in chief”.  There are some peaks in usage in the 1830s, and the 1960s and 1970s, but they’re not dominant, and they don’t obviously correspond to any particular set of events.  What’s going on?  Was I just imagining a relation between its usage and military buildups?  Is the Google data skewed somehow?  Or is something else going on?

It’s true that the Google corpus is imperfect, as I and others have noted before.  The metadata isn’t always accurate; the number of reported hits is approximate when more than 100 or so, and the mix of volumes in Google’s corpus varies in different time periods.  (For instance, recent years of the corpus may include more magazine content than earlier years; and reprints can make texts reappear decades after they were actually written.  The rise of print-on-demand scans of old public-domain books in the 2000s may be partly responsible for the uptick in “chief magistrate” that decade, for instance.)

But I might also not be looking at the right data.  There are lots of reasons to mention “commander-in-chief” at various times.  The apparent trend that concerned me, though, was the use of “commander in chief” as an all-encompassing term.  Searching for the phrase “our commander in chief” with “president” might be better at identifying that. That search doesn’t distinguish military from civilian uses of that phrase, but an uptick in usage would indicate either a greater military presence in the published record, or a more militarized view among civilians.  So either way, it should reflect a more militaristic view of the president’s role.

Indeed, when I graph the relative occurrences of “our commander in chief” over time, the trend line looks rather different than before.  Here it is below, with the decades labeled the same way as in the first graph:

Scaled frequency of "Our commander in chief" used along with "President", by decade

Scaled frequency of "our commander in chief" used along with "president", by decade, 1810s-2000s

Here we see increases in decades that saw major wars, including 1812, the Mexican war of the 1840s, the civil war of the 1860s, and the Vietnam war expanding in the 1970s.  This past decade had the second most-frequent usage (by a small margin) of “our commander in chief” in the last 200 years of this corpus.  But it’s dwarfed by the use during the 1940s, when Americans fought in World War II.  That’s not something I’d expected, but given the total mobilization that occurred between 1941 and 1945, it makes sense.

If we look more closely at the frequency of “our commander in chief” in the last 20 years, we also find interesting results. The graph below looks at 1991 through 2009 (add 1990 to each number on the horizontal axis; and as always, click on the image for a closer look):

Scaled frequency of "our commander in chief" used along with "president", by year, 1991-2009

Not too surprisingly, after the successful Gulf War in early 1991, usage starts to decrease.  And not long after 9/11, usage increases notably, and stays high in the years to follow.  (Books take some time to go from manuscript to publication, but we see a local high by 2002, and higher usage in most of the subsequent years.)  I was a bit surprised, though, to find an initial spike in usage in 1999.  As seen in this timeline, Bill Clinton’s impeachment and trial took place in late 1998 and early 1999, and a number of the hits during this time period are in the context of questioning Clinton’s fitness to be “our commander in chief” in the light of the Lewinsky scandal.  But once public interest moved on to the 2000 elections, in which Clinton was not a candidate, usage dropped off again until the 9/11 attacks and the wars that followed.

I don’t want to oversell the importance of these searches.  Google Books search is a crude instrument for literary analysis, and I’m still a novice at corpus analysis (and at generating Excel graphs).  But my searches suggest that the corpus can be a useful tool for identifying and tracking large-scale trends in certain kinds of expression.  It’s not a substitute for the close reading that most humanities scholarship requires.  And even with the “distant reading” of searches, you still need to look at a sampling of your results to make sure you understand what you’re finding, and aren’t copying down numbers blindly.

But with those caveats, the Google Books corpus supports an enlightening high-altitude perspective on literature and culture.  The corpus is valuable not just for its size and searchability, but also for its public accessibility.  When I report on an experiment like this, anyone else who wants to can double-check my results, or try some followup searches of their own.  (Exact numbers will naturally shift somewhat over time as more volumes get added to the corpus.)  To the extent that searching, snippets, and text are open to all, Google Books can be everybody’s literary research laboratory.

May 6, 2010

Making discovery smarter with open data

Filed under: architecture,discovery,online books,open access,sharing,subjects — John Mark Ockerbloom @ 9:06 am

I’ve just made a significant data enhancement to subject browsing on The Online Books Page.  It improves the concept-oriented browsing of my catalog of online books via subject maps, where users explore a subject along multiple dimensions from a starting point of interest.

Say you’d like to read some books about logic, for instance.  You’d rather not have to go find and troll all the appropriate shelf sections within math, philosophy, psychology, computing, and wherever else logic books might be found in a physical library.  And you’d rather not have to think of all the different keywords used to identify different logic-related topics in a typical online catalog. In my subject map for logic, you can see lots of suggestions of books filed both under “Logic” itself, and under related concepts.  You can go straight to a book that looks interesting, select a related subject and explore that further, or select the “i” icon next to a particular book to find more books like it.

As I’ve noted previously, the relationships and explanations that enable this sort of exploration depend on a lot of data, which has to come from somewhere.  In previous versions of my catalog, most of it came from a somewhat incomplete and not-fully-up-to-date set of authority records in our local catalog at Penn.  But the Library of Congress (LC) has recently made authoritative subject cataloging data freely available on a new website.  There, you can query it through standard interfaces, or simply download it all for analysis.

I recently downloaded their full data set (38 MB of zipped RDF), processed it, and used it to build new subject maps for The Online Books Page.   The resulting maps are substantially richer than what I had before.  My collection is fairly small by the standards of mass digitization– just shy of 40,000 items– but still, the new data, after processing, yielded over 20,000 new subject relationships, and over 600 new notes and explanations, for the subjects represented in the collection.

That’s particularly impressive when you consider that, in some ways, the RDF data is cruder than what I used before.  The RDF schemas that LC uses omit many of the details and structural cues that are in the MARC subject authority records at the Library of Congress (and at Penn).  And LC’s RDF file is also missing many subjects that I use in my catalog; in particular, at present it omits many records for geographic, personal, and organizational names.

Even so, I lost few relationships that were in my prior maps, and I gained many more.  There were two reasons for this:  First of all, LC’s file includes a lot of data records (many times more than my previous data source), and they’re more recent as well.  Second, a variety of automated inference rules– lexical, structural, geographic, and bibliographic– let me create additional links between concepts with little or no explicit authority data.  So even though LC’s RDF file includes no record for Ontario, for instance, its subject map in my collection still covers a lot of ground.

A few important things make these subject maps possible, and will help them get better in the future:

  • A large, shared, open knowledge base: The Library of Congress Subject Headings have been built up by dedicated librarians at many institutions over more than a century.  As a shared, evolving resource, the data set supports unified searching and browsing over numerous collections, including mine.  The work of keeping it up to date, and in sync with the terms that patrons use to search, can potentially be spread out among many participants.  As an open resource, the data set can be put to a variety of uses that both increase the value of our libraries and encourage the further development of the knowledge base.
  • Making the most of automation: LC’s website and standards make it easy for me to download and process their data automatically. Once I’ve loaded their data, and my own records, I then invoke a set of automated rules to infer additional subject relationships.  None of the rules is especially complex; but put together, they do a lot to enhance the subject maps. Since the underlying data is open, anyone else is also free to develop new rules or analyses (or adapt mine, once I release them).  If a community of analyzers develops, we can learn from each other as we go.  And perhaps some of the relationships we infer through automation can be incorporated directly into later revisions of LC’s own subject data.
  • Judicious use of special-purpose data: It is sometimes useful to add to or change data obtained from external sources.  For example, I maintain a small supplementary data file on major geographic areas.  A single data record saying that Ontario is a region within Canada, and is abbreviated “Ont.”, generates much of my subject map for Ontario.  Soon, I should also be able to re-incorporate local subject records, as well as arbitrary additional overlays, to fill in conceptual gaps in LC’s file.  Since local customizations can take  a lot of effort to maintain, however, it’s best to try to incorporate local data into shared knowledge bases when feasible.  That way, others can benefit from, and add on to, your own work.

Recently, there’s been a fair bit of debate about whether to treat cataloging data as an open public good, or to keep it more restricted.  The Library of Congress’ catalog data has been publicly accessible online for years, though until recently only you could only get a little a time via manual searches, or pay a large sum to get a one-time data dump.  By creating APIs, using standard semantic XML formats, and providing free, unrestricted data downloads for their subject authority data, LC has made their data much easier for others to use in a variety of ways. It’s improved my online book catalog significantly, and can also improve many other catalogs and discovery applications.  Those of us who use this data, in turn, have incentives to work to improve and sustain it.

Making the LC Subject Headings ontology open data makes it both more useful and more viable as libraries evolve.  I thank the folks at the Library of Congress for their openness with their data, and I hope to do my part in improving and contributing to their work as well.

January 1, 2010

Public domain day 2010: Drawing up the lines

Filed under: copyright,online books,open access — John Mark Ockerbloom @ 12:01 am

As we celebrate the beginning of the New Year, we also mark Public Domain Day (a holiday I’ve been regularly celebrating on this blog.)  This is the day when a year’s worth of copyrights expire in many countries around the world, and the works they cover become free for anyone to use and adapt for any purpose.

In many counties, this is a bittersweet time for fans of the public domain.  For instance, this site notes the many authors whose works enter the public domain today in Europe, now that they’ve been dead for at least 70 years.  But for many European countries, this just represents reclaimed ground that had been previously lost.   Europe retroactively extended and revived copyrights from life+50 to life+70 years in 1993, so it’s still three more years before Europe’s public domain is back to what it was then.  Many other countries, including the United States, Australia, Russia, and Mexico, are in the midst of public domain freezes.  For instance, due to a 1998 copyright extension, no copyrights of published works will expire here in the US due to age for another 9 years, at least.

In the past, many people have had only a vague idea of what’s in the public domain and what isn’t.  But thanks to mass book digitization projects, the dividing line is becoming clearer.  Millions of books published before 1923 (the year of the oldest US copyrights) are now digitized, and can be found with a simple Google search and read in full online.  At the same time, millions more digitized books from 1923 and later can also be found with searches, but are not freely readable online.

Many of those works not freely readable online have languished in obscurity for a long time.   Some of them can be shown to be in the public domain after research, and groups like Hathi Trust are starting to clear and rescue many such works.  Some of them are still under copyright, but long out of print, and may have unknown or unreachable rightsholders.  The current debate over Google Books has raised the profile of these  works, so much so that the New York Times cited “orphan books”, a term used to describe such unclearable works, as one of the buzzwords of 2009.

The dividing line between the public domain and the world of copyright could well have been different.   In 1953, for instance, US copyrights ran for a maximum of 56 years, and the last of that year’s copyrights would have expired today, were it not for extensions.  Duke’s Center for the Study of the Public Domain has a page showing what could have been entering the public domain today— everything up to the close of the Korean War.  In contrast, if the current 95-year US terms had been in effect all of last century, the copyrights of 1914 would have only expired today.  Only now would we be able to start freely digitizing the first set of books from the start of World War I.

With the dividing line better known nowadays, do we have hope of protecting the public domain against more expansions of copyright?  Many countries still stick to the life+50 years term of the Berne Convention, including Canada and New Zealand.  In those countries, works from authors who died in 1959 enter the public domain for the first time.  There’s pressure on some of these countries to increase their terms, so far resisted.  Efforts to extend copyrights on sound recordings continues in Europe, and recently succeeded in Argentina.  And secret ACTA treaty negotiations are also aimed at increasing the power of copyright holders over Internet and computer users.

But resistance to these expansions of copyright is on the rise, and public awareness of copyright extensions and their deleterious effects is quite a bit higher now than when Europe and the US extended their copyrights in the 1990s.  And with concerns expressed by a number of parties over a possible Google monopoly on orphan books, one can envision building up a critical mass of interest in freeing more of these books for all to use.

So today I celebrate the incremental expansion of the public domain, and hope to help increase it further. To that end, I have a few gifts of my own.  As in previous years, I’m freeing all the copyrights I control for publications (including public online postings) that are more than 14 years old today, so any such works published in 1995 and before are now dedicated to the public domain.  Unfortunately, I don’t control the copyright of the 1995 paper that is my most widely cited work, but at least there’s an early version openly accessible online.

I can also announce the completion of a full set of digitized active copyright renewal records for drama and works prepared for oral delivery, available from this page.  This should make it easier for people to verify the public domain status of plays, sermons, lectures, radio programs, and similar works from the mid-20th century that to date have not been clearable using online resources.  We’ve also put online many copyright renewal records for images, and hope to have a complete set of active records not too far into 2010.  Among other things, this will help enable the full digitization of book illustrations, newspaper photographs, and other important parts of the historical record that might be otherwise omitted or skipped by some mass digitization projects.

Happy Public Domain Day!  May we have much to enjoy this day, and on many more Public Domain Days to come.

(Edited later in the day January 1 to fix an inaccurately worded sentence.)

October 26, 2009

Promoting access to the best literature of the past

Filed under: online books,open access — John Mark Ockerbloom @ 3:24 pm

Last week saw widespread observance of Open Access Week 2009 .  The week primarily focused on opening access to current research and scholarship (though there’s also been a growing community working on opening access to teaching and learning content).  You can find lots of open access resources at the Open Access Directory.

Current scholarship is not spontaneously generated from the brain or lab of the writer.  Useful scholarship must understand and interpret past work, to be effective in the present.  In many fields, and not just the classical humanities, the relevant past work may stretch back hundreds or even thousands of years.  Current scholarship and study will be more effective if its source material is also made openly accessible, and if proper attention is drawn to the most useful sources.  And now is an especially opportune time for scholars of all sorts, professional and amateur, to get involved in the process.

This may seem a strange thing to say at a time when the digitization of old books and other historic materials is increasingly dominated by large-scale projects like Google and the Internet Archive.  With mass digitizers putting millions of public domain book and journal volumes online, and with a near-term possibility of millions more copyrighted volumes going online as well, how much of a role is left for individual scholars and readers?

A very important role, as it turns out.  Mass digitization projects can quickly produce large scale aggregations of pass content, but as many have pointed out, aggregation is not the same as curation, and as aggregations grow larger, being able to find the right items in a growing collection becomes increasingly important.  That’s what curation helps us do, and the large-scale digitizers are not doing a very effective job of it themselves.  Google’s PageRank algorithm may take advantage of implicit curation of web pages (through the choices of authors’ page links), but Google and other aggregators have had a much harder time drawing attention to the most useful books, scholarly articles, or other works created without built-in hyperlinks.

Sometimes this is because they haven’t digitized them, even as they’ve digitized inferior substitutes.  Over three years after Paul Duguid lamented the republication of a bowdlerized translation of Knut Hamsun’s Pan by Project Gutenberg, that version remains the only freely available one of this book available there, or at Google Books, or anywhere else online that I’ve found.   Even though an unexpurgated version of this translation was published before the bowdlerized version, no digitizer that I know of has gotten around to finding and digitizing it; and countless readers may have used the existing online copies without even knowing that they’ve been censored.  Extra bibliographic and copyright research may be necessary to determine whether a better resource is available for digitization, as it in this case.

Sometimes the content is digitized, but can’t be found easily.  Geoff Nunberg’s post on Google Books’ “metadata train wreck” shows plenty of examples of how difficult it can be to find and properly identify a particular edition in Google Books, much less figure out which edition is the best one to use.  I’ve commented in the past about the challenges of finding multi-volume works in that corpus.  And Peter Jacso has pointed out Google’s problems indexing current scholarship.  If you can’t find the paper or book you need for your research, your work will be no better than it would be if the source had never existed.

This is where scholars can potentially play a useful role.  We don’t individually digitize books by the thousands, but we do individually find, cite, and recommend useful sources, down to the particular edition, as we find them and use them in our own writings and teaching.  These citations and recommendations now often go online, in various locations.  It would be very useful to have these recommendations made more visible, and tied to freely available online copies of the sources cited, whenever legally possible. Sometimes, we also create or digitize our own editions of past works, with useful annotations, for our classes or our own work.  It would be very useful to have these made visible and persistent as well, whenever appropriate.

I hope that large resource aggregations will make it easier for scholars and others to curate the collections to make them more useful to their readers.  In the meantime, we can start with resources we have.  For example, on The Online Books Page, my catalog entry for Hamsun’s Pan notes its limitations.  My public requests page includes information on a better edition that could be digitized, by someone who has access to the edition and has some time to spare.  And my suggestion form is ready to accept links to better editions of this book, or to other online books that merit special attention.  Indeed, most of the books that I now add to my catalog derive from submissions made by various readers on this form, and I invite scholars to suggest the freely accessible books and serials that they find most useful for my catalog.

As the Little Professor notes in a recent post, the sort of bibliographic work I’ve described can be time-consuming but vitally important for making effective use of old sources, and that work has often not been done by anyone for many books outside the usual classical canons.  Yet it’s the sort of thing that scholars do, bit by bit, as part of their everyday work.  The aggregate effect of their curation and digitization, appropriately harnessed in open-access form, could greatly improve our ability to build upon the work of the past.

September 17, 2009

Google Book settlement: Alternatives and alterations

Filed under: copyright,online books,open access — John Mark Ockerbloom @ 1:35 pm

In my previous post, I worried that the Google Books settlement might fall apart in the face of opposition from influential parties like the Copyright Office, and that such a collapse might deprive the public of meaningful access to millions of out of print books.

Not everyone sees it that way.  I’ve seen various suggestions of alternatives to the settlement for making these books available.  In this post, I’ll describe some of the suggested alternatives, explain why they don’t seem to me as likely to succeed on their own, and discuss how some of them could still go forward under a settlement.

Compulsory licenses

Both the Open Book Alliance’s court filings and the Copyright Office’s testimony mention the possibility of compulsory licensing, which essentially lets people use a copyrighted work without getting permission, provided that they meet standard conditions determined by the government.  Compulsory licenses already exist in certain areas, such as musical performances and broadcasts.  If I want to cover a Beatles song on my new record, I can, as long as meet some basic conditions, including paying a standard royalty.  The (remaining) Beatles can’t hold out for a higher rate, or say that no one else is allowed to cover the recordings they’ve released.

The Google Books settlement has some similarities to a compulsory license, but with some important differences, including:

  1. Book rightsholders can choose to deny public uses of their work, or hold out for higher compensation, which they generally can’t do under a compulsory license regime. (They have to explicitly request this, though.  So it’s really what one might call a “default” license.)
  2. The license has been negotiated through a court settlement rather than Congressional action. (This was one of the main complaints of the Copyright Office.)
  3. The license given in the settlement is granted only to Google, not to other digitizers. (This has justifiably raised monopoly concerns.)

I do have a problem with the last difference as it stands.  I’d like to see the license widened so that anyone, not just Google, could digitize and make available out of print books under the same terms as Google. But there are various ways we can get to that point from the settlement.  The Book Rights Registry created by the settlement could extend Google-like rights to anyone else under the same terms, as the settlement permits them to do.  The Justice Department could require them to do so as part of an antitrust supervision.  Or Congress could decide to codify the license to apply generally.  (They’ve done this sort of thing before with fair use and the first sale doctrine, both of which originated in the courts.)

If the settlement falls apart, though, negotiation over an appropriate license has to start over from scratch, and has to persuade Congress to loosen copyrights for benefits they might not clearly see. As I suggested in my previous post, Congress’ recent tendencies have heavily favored tightening, rather than loosening, copyright control.   And I haven’t yet seen a strong coalition pushing for laws granting compulsory (or default) licenses that are as broad as would be needed.

For instance, the Open Books Alliance’s amicus brief suggests the possibility of a compulsory license, but only as “but one approach”, and that suggestion seems as much aimed at getting hold of Google’s scans as licensing the book copyrights themselves.  Their front page at present shows no explicit advocacy of compulsory copyright licenses.  Perhaps they will unite behind a workable Google Books-style compulsory license proposal in the future, but I’m not counting on that.  (Update: Just after I posted this, I saw this statement of principles go up on the OBA site.  We’ll see what develops from that.)

The Copyright Office’s congressional brief also mentions but tries to damp down the idea.  It repeatedly characterizes compulsory licensing as something that Congress only does “reluctantly” and “in the face of marketplace failure”. But despite its strong words on other subjects, it does not appear concerned over whether we in fact have a marketplace failure around broad access to out-of-print books.

Orphan works legislation

The Copyright Office filing also suggests passing orphan works legislation (as have various other parties, including Google).  An orphan works limitation on copyrights would be nice, but it’s not going to enable the sort of large, comprehensive historical corpus that the Google Books settlement would allow.

As Denise Troll Covey has pointed out, the orphan works certification requirements recommended in last year’s bill, like many other case-by-case copyright clearance procedures, are labor-intensive and slow, and may be legally risky.  (In particular, the overhead for copyright clearance, not including license payment, can be several times the cost of digitization.)  Hence, these methods are not likely to scale well.  And they would not cover the many out-of-print books that aren’t, strictly speaking, orphans.  I don’t consider it likely that a near-comprehensive library  of millions of out-of-print 20th century books will come about by this route alone any time soon.

Even so, despite its limited reach, last year’s orphan works legislation was stopped in Congress after some creator organizations objected to it.  Some of the objectors, including the  National Writers Union and the American Society of Journalists and Authors, are now members of  the Open Book Alliance, which makes me wonder how effectively that group would act as a united coalition for copyright reform.

Private negotiation

Some critics suggest that Google and other digitizers simply negotiate with each rightsholder, or a mediator designated by each  rightsholder.   It’s possible that this actually might work for many future books, if authors and publishers set up comprehensive clearinghouses (like ASCAP and Harry Fox mediate music licensing).  If new books get registered with agents like these going forward, with simple, streamlined digital rights clearing, private arrangement could work well for future books both in-print and out-of-print.  Indeed, Google’s default settlement license privileges don’t apply to new books from 2009 onward.

But it’s much less likely that this will be a practical solution to build a comprehensive collection of past out of print books from the 20th and early 21st century, because of the sheer difficulty and cost of determining and locating all the current rightsholders of books long out of print.   The friction involved in such negotiation (involving high average cost for low average compensation) is too great.  Without the settlement and/or legal reform, we risk having what James Boyle called a “20th century black hole” for books.

Copyright law reform

As James Boyle points out, it would solve a lot of the problems that keep old books  in obscurity if books didn’t get exceedingly long copyrights purely by default.  It would also help if fair use and public domain determination weren’t as risky as they are now. I’d love to see all that come to pass, but no one I know that’s knowledgeable on copyright issues is holding their breath waiting for it to happen any time soon.

Moving forward

As I’ve previously mentioned, the settlement is imperfect.  It may well need antitrust supervision, and future elaboration and extension.  (And I’ve suggested some ways that libraries and others can work to improve on it.)  It’s still the most promising starting point I’ve seen for making comprehensive, widely usable, historic digital book collections possible.  I hope that we get the chance to build on it, instead of throwing away the opportunity.  In any case, I’d be happy to hear people’s thoughts and comments about the best way to move forward.

September 15, 2009

Google Books, and missing the opportunities you don’t see

Filed under: copyright,online books,open access — John Mark Ockerbloom @ 9:12 pm

The Google Books settlement fairness hearing is still a few weeks away, but in the last few weeks the deal has been talked and shouted about with ever-higher volume.  Still, it wasn’t until the other day, in a House Judiciary Committee hearing where US Copyright Register Marybeth Peters came loaded for bear, that I started thinking there was a significant likelihood that the settlement might fall apart.

There are a number of people in different communities, including libraries, who hope this  happens.   I’m not one of them.  I’m not a lawyer, so I can’t comment with authority on whether the settlement is sound law.  But I’m quite confident that it advances good policy.  In particular, it’s one of the best feasible opportunities to bring a near-comprehensive view of the knowledge and culture of the 20th and early 21st centuries into widespread use.  And I worry that, should the settlement break down, we will not have another opportunity like it any time soon.  The settlement has flaws, like the Google Books Project itself has, but at the same time, like Google Books itself, the deal the settlement offers is incredibly useful to readers, while also giving writers new opportunities to revive, and be paid for, their out of print work.

The potential

Under the status quo, millions of books are greatly under-utilized.  It isn’t just that people don’t have easy access to them; it’s that people don’t know that particular books useful to them exist in the first place.  I work in a library that has collected millions of volumes, many of which are hardly ever checked out. Not only would Google’s expanded books offerings give our users access to millions more books, but it would also make millions of books that we already own easier for our users to find and use effectively.

Want to know what books make mention of a particular event, ancestor, or idea?  With existing libraries, and good search skills, you might be able to find books, if any, that are written primarily about those things. But you’ll probably miss much other information on those same topics, information in works that are primarily about something else.  With expanded search, and the ability to preview old book content, it could be much easier to get a more comprehensive view on a topic, and find out which books are worth obtaining for learning more.

And if that’s a big advance for people in big universities like ours, it’s an even bigger step forward for people who have not had easy access to big research libraries.  Once a search turns up a book of interest, Google Books would offer a searcher various ways of getting that book: buying online access; reading it at their library’s computer (either via a paid subscription, or via a free public access terminal); buying a print copy; or getting an inter-library loan.  These options all involve various trade-offs of cost and convenience, as is the case with libraries today.  While one could wish for better tradeoff terms, the ones proposed still represent big advances from what one can easily do today.

And as with other large online collections like Wikipedia or WorldCat, or the Web as a whole, the advantages to large book corpuses like Google’s aren’t just in the individual items, but in what can be done with the aggregation.  I don’t know exactly what new kinds of things people will find to do with a near-comprehensive collection of  20th century books, but having seen all that people have done with other information aggregated on the Internet, I’m confident that there would be many great uses found, large and small.

The peril

If the Google settlement does fall apart, are we likely to see any collection like the one it envisions any time soon?  I’m not at all confident we will.  The basic problem is that, without some sort of blanket license, it’s impractical (and in the case of true orphan works, currently impossible) to clear all the copyrights that would be required to build such a collection.  This represents a failure in copyright law.  Instead of “promot[ing] the progress of science and useful arts”, as the Constitution requires, current US copyright law effectively keeps millions of out-of-print books in obscurity, not producing significant benefits either to their creators or to their potential users.

The current proposed Google Books settlement is, among other things, an attempt to get around this failure.  If the settlement fails, would the parties make a new agreement that would allow a readable collection of millions of post-1922 online books?  The divergence in the complaints I’ve seen (for instance, on one hand that the collection would cost readers too much, and on another hand that it would pay writers too little) suggest the difficulty of coming to a new consensus that satisfies all the parties, if negotiations have to start again from scratch.  And, if the arguments of the Copyright Office and some of the other parties carry the day, even if such an agreement were reached, the agreement could not be ratified by a court anyway.  Instead, it would require acts of Congress, and maybe even re-negotiations of international treaties.

Based on past history, there are two things that would make the government likely to reform copyright law to permit mass reuse of out-of-print books.  Ether there needs to be a clear example of the benefits of such a reform, or there needs to be a strong coalition pushing for such a reform.  Clear examples have usually come from businesses that are actually in operation; for example, the player piano roll industry that successfully persuaded Congress to streamline music copyright clearance in the previous century (or the Betamax that persuaded a slender majority of the Supreme Court to declare the VCR legal).

If the proposed Google Books library service goes online, even under a flawed initial settlement, it too could provide a compelling example to encourage general copyright reform.  But without such an example, it can be hard to move Congress to act.   It’s easy to undervalue the opportunities you don’t clearly see.

What about a strong coalition pushing for a reform in the law that would let anyone create the comprehensive online collections of out of print books I’d described?  I’d like to see one, but I haven’t yet.  (Yes, there’s the Open Book Alliance, but its members don’t seem to be distinctly allied in anything particular other than objecting to the settlement.)  In my next post, I’ll discuss reforms that might do the job, and the reasons I believe they would be difficult to enact without the settlement.

May 5, 2009

April 23, 2009

David Reed: Some extracts from his life and letters

Filed under: online books,people — John Mark Ockerbloom @ 11:36 pm

Last summer I was looking for a particular book. I couldn’t find it in any library in my State. Went interlibrary loans and found one copy at the library of Congress. Only one copy in the whole country. One of the best stories I ever [heard] about this is one when one of my professors was working on a trash pile of papyrus sheets and came across one that said [it] was the works of Meander. He went through that pile of papyrus with a fine tooth comb. He didn’t find anything but that single piece. He said that it felt as though he was looking across the centuries and saying, “Somewhere out there are the works of Meander.” [Friends,] this is how things get lost forever.

David Reed, 1997

Today, there are thousands of important books that will likely never share that fate as long as civilization lasts, because they were digitized and sent all over the world.  Many of these books were first put online by Project Gutenberg.  And many of the Project Gutenberg texts are online thanks to the work of David Reed.

I scanned and released Gibbon’s Decline and Fall of the Roman Empire and hardly a day goes by when I don’t get an email from someone thanking me for releasing it on the web. At one site I know that it has been downloaded 1800+ times in all six volumes.

David Reed, 2001

In the mid-1990s, Project Gutenberg had an outlandish-sounding goal: to make 10,000 books freely available online by the start of the 21st century.  They’d only managed to put a couple hundred online by then.  Authors like Clifford Stoll were skeptical that they, or anyone else, would ever reach such a goal.

But Gutenberg was soon publishing more and more texts every month, at an ever-increasing pace.   Lots of those texts had David Reed’s name on them.  Working persistently with his own scanner, well before the era of well-funded mass digitization, he digitized and proofread long works that few other people at the time would have taken on: Gibbon’s Decline and Fall; Shakespeare’s First Folio;  Josephus’ Antiquities of the Jews; Frazer’s Golden Bough; Tocqueville’s Democracy in America.  He also scanned numerous works weighty and light from authors like Rudyard Kipling, Louisa May Alcott, Robert Frost, James Joyce, and the US government.

Some critics in academia complained that the books David and others put up for Gutenberg were not up to the standards of scholarly editions.  David didn’t begrudge the work of scholars, but he wanted to put up more works, more quickly, to reach a broader audience.  As he put it in 1999:

[I] think that [it’s] important to remember that we do all this work because we like to read and we like to share our discoveries with others…. I see no reason why the text specialists can’t have the specialist collections and the general people (like myself) have the general collections. There is room enough on the web for all of us. The real enemy are those who want to lock up all the books in the world. The real enemy are those who don’t read a single book.

David was fighting another enemy besides illiteracy, one closer to home. He had diabetes, and in the last few years of his life his health slowly worsened from complications of that disease. He didn’t mention it in this post (nor, as far as I can remember, in any of the posts he made to the Book People mailing list, from which these quotations are taken). But even while his health was failing, he continued to put books online, like this emergency childbirth manual that was posted this past October.  He was working to fulfill a dream that he described back in his 1999 post:

I dream of the day when we have 50,000 and 100,000 etext libraries on the web. Where there are 100 new etexts being released a week or every couple of days. When I can’t keep up with reading every etext that pops up on the Online Book Page or that Project Gutenberg releases. . I appreciate all the work that you are all doing. I love reading the work that you are all doing.

David died on April 21, 2009, according to the email his son Chris sent to David’s contacts list.  By then, Google Books and the Internet Archive’s book collection had made over 1 million books freely available online, the various Gutenberg projects had posted just over 30,000 books, and many smaller projects had posted numerous unique titles as well.  He lived long enough to see his dream come true, thanks in part to his own pioneering work and dedication.

I have dedicated etexts in honor of my daughter, my sons, my wife, parents and in honor of my companies I work for, even in honor of myself.

David Reed, 2001

Out there all over the Net, in millions of replicas, are the works of David Reed, transcribing many of the great authors that have also passed on.  In some sense, all of those works are dedicated  to him.  Through them, I hope his name lives on for generations to come.

March 30, 2009

How to find complete multi-volume works in Google Books

Filed under: online books — John Mark Ockerbloom @ 10:15 pm

While Google’s agreement on copyrighted books has been the subject of much discussion lately, they’ve also been continuing to add public domain titles at a brisk pace.  For instance, they announced in February that they now had 1.5 million public domain volumes formatted for mobile devices.  And last week, they noted that they had completed their scans of hundreds of thousands of volumes of 19th century public domain books from Oxford’s Bodleian library.

If you look at the three example book links in their Oxford post, you’ll notice that each of them goes to a volume of a multi-volume edition.   Works from the nineteenth century and before were often originally published in multiple volumes, such as the “three-decker” format common for Victorian novels.  When such books are reprinted today, they’re usually printed as a single volume, but to read all of many Google titles, you’ll have to range over multiple volumes.

Unfortunately, as various readers have noted, it can be quite difficult to find readable copies of all of the volumes in a multi-volume edition.  For various reasons, they often don’t all come up when you do a search for a particular title.  This can make readers think there are no complete digital editions of a work they’re seeking, even when there are.

In working with people who have helped me fill requests for public domain books, I’ve compiled a series of techniques for finding complete multi-volume sets in Google Books.  I’d be happy to hear additional tips from readers.

  • First, do a search for full-view volumes of the work you’re looking for.  One good way to do this is to go to Google’s advanced book search page, select the “full view only” option, and enter author and title words in the appropriate blanks.
  • If you get a hit, check the start and the end of the scan, to verify which volumes are actually present. Sometimes you’ll find more than one volume in the scan, either because multiple volumes were bound together, or because Google combined volumes in its scan.
  • Go to the “about this book” page for the scan, and look in the lower regions to see if there is an “Other editions” section. This often includes links to other volumes, not just other editions. If there’s a “See more” at the bottom of such a section, click on it to see more volumes or editions.  (Sometimes Google will have multiple editions as well as multiple volumes for the same work.  It’s best when possible to compile volumes from the same edition.  You can do this by matching publishers and dates between volumes, though keep in mind that some multivolume editions came out over the course of multiple years.  Editions from different publishers, or from different times, may have inconsistent content, and might not divide into volumes at the same points.)
  • If the book is from the University of Michigan (as reported either in the “about this book” page or in the scanned front pages) check the Mirlyn catalog for the book. Sometimes this will turn up volumes scanned by Google that have been put in the Hathi Trust repository, or in Google Book Search itself, but that for some reason don’t show up in an ordinary Google books search. Some other Hathi Trust libraries also have links to digitizations of their content; see this page for details.
  • If this didn’t turn up all the volumes you’re looking for, repeat the process above for the other volumes in your initial hit list. Sometimes those will have “Other editions” links to additional volumes that didn’t appear with the earlier hits.
  • If you manage to complete a set this way, consider sharing your success with other readers.  If you fill in my book suggestion form with the volumes you find,  I can list a neatly consolidated edition of all the volumes on The Online Books Page, and help other people avoid going through all the trouble you just did.  (Give the book’s title, URL for the first volume, and other information in the appropriate blanks, and then add URLs for subsequent volumes in the “Anything else we should know?” section of the form.)
  • Even if you only partially succeeded, if it’s a work you’re particularly interested in you can use my suggestion form to let me know what you’ve been able to find.  If I can’t easily find the other volumes myself, I can at least list what was found on my works-in-progress page. With luck, someone coming along later will find or digitize the remaining volumes, and I can list the set.

Similar techniques can be used for compiling runs of historic serials, which are also present in Google, and can be of great interest to readers.

If you find these suggestions useful, I hope you’ll help me compile sets of your favorite public domain works, so we can take advantage of all this wonderful old material that Google and others are digitizing.

« Previous PageNext Page »

The Rubric Theme. Blog at WordPress.com.


Get every new post delivered to your Inbox.

Join 103 other followers