Everybody's Libraries

August 31, 2010

As living arrows sent forth

Filed under: discovery,sharing — John Mark Ockerbloom @ 9:52 pm

It’s that time of year when offspring start to leave home and strike out on their own.  Young children may be starting kindergarten.  Older ones may be heading off to university.  And in between, children slowly gain a little more independence every year.  If parents are fortunate, and do our job well, we set our children going in good directions, but they then make paths for themselves.

Standards are a little like children that way.  You can invest lots of time, thought, and discussion into specifying how some set of interactions, expressions, or representations should work.  But, if you do well, what you specified will take on a life apart from you and its other parents, and make its own way in the world.  So it’s rather gratifying for me to see a couple of specifications that I’d helped parent move out into the world that way.

I’ve mentioned them both previously on this blog.  One was a fairly traditional committee effort: the DLF ILS-Discovery Interface recommendation.  After the original DLF group finished its work, a new group of folks affiliated with OCLC and the Code4lib community formed to implement the types of interfaces we’d recommended.  The new group has recently announced they’ll be supporting and contributing code to the Extensible Catalog NCIP toolkit.  This is an important step towards realizing the goal of standardized patron interaction with integrated library systems.  I’m looking forward to seeing how the project progresses, and hope I’ll hear more about it at the upcoming Digital Library Federation forum.

The other specification I’ve worked on that’s recently taken on a life of its own is the Free Decimal Correspondence (FDC).   This was a purely personal project of mine to develop a simple, freely reusable classification that was reasonably compatible with the Dewey Decimal System and the Library of Congress Subject Headings.  I created it for Public Domain Day last year, and did a few updates on it afterwards, but have largely left it on the shelf for the last while.  Now, however, it’s being used as one of the bases of the “Melvil Decimal System“, part of the Common Knowledge metadata maintained at LibraryThing.

It’s nice to see both of these efforts start to make their mark in the larger world.  I’ve seen the ILS-DI implementation work develop in good hands for a while, and I’m content at this point to watch its progress from a distance.  The Free Decimal Correspondence adoption was a bit more of a surprise, though one that was quite welcome.  (I put FDC in the public domain in part to encourage that sort of unexpected reuse.)  When the Melvil project’s use of FDC was announced, I quickly put out an update of the specification, so that recent additions and corrections I’d made could be easily reused by Melvil.

I’m still trying to figure out what further updating, if any, I should do for FDC.  Melvil already goes into more detail than FDC in many cases, and as a group project, it will most likely further outstrip FDC in size as time passes.  On the other hand, keeping in sync specifically with LC Subject Headings terminology is not necessarily a goal of Melvil’s, as it has been for FDC.  Though I’m not sure at this point if that specific feature of FDC is important to any existing or planned project out there.  And as I stated in my FDC FAQ, I don’t intend to spend a whole lot of time maintaining or supporting FDC over the long term.

But since it is getting noticeable outside use, I’ll probably spend at least some time working up to a 1.0 release.  This might simply involve making a few corrections and then declaring it done.  Or it could involve incorporating some of the information from Melvil back into FDC, to the extent that I can do so while keeping FDC in the public domain.  Or it could involve some further independent development.  To help me decide, I’d be interested in hearing from anyone who’s interested in using or developing FDC further.

Projects are never really finished until you let them go.  I’m glad to see these particular ones take flight, and hope that we in the online library community will release lots of other creations in the years to come.

July 31, 2010

Keeping subjects up to date with open data

Filed under: data,discovery,online books,open access,sharing,subjects — John Mark Ockerbloom @ 11:51 pm

In an earlier post, I discussed how I was using the open data from the Library of Congress’ Authorities and Vocabularies service to enhance subject browsing on The Online Books Page.  More recently, I’ve used the same data to make my subjects more consistent and up to date.  In this post, I’ll describe why I need to do this, and why doing it isn’t as hard as I feared that it might be.

The Library of Congress Subject Headings (LCSH) is a standard set of subject names, descriptions, and relationships, begun in 1898, and periodically updated ever since. The names of its subjects have shifted over time, particularly in recent years.  For instance, recently subject terms mentioning “Cookery”, a word more common in the 1800s than now, were changed to use the word “Cooking“, a term that today’s library patrons are much more likely to use.

It’s good for local library catalogs that use LCSH to keep in sync with the most up to date version, not only to better match modern usage, but also to keep catalog records consistent with each other.  Especially as libraries share their online books and associated catalog records, it’s particularly important that books on the same subject use the same, up-to-date terms.  No one wants to have to search under lots of different headings, especially obsolete ones, when they’re looking for books on a particular topic.

Libraries with large, long-standing catalogs often have a hard time staying current, however.  The catalog of the university library where I work, for instance, still has some books on airplanes filed under “Aeroplanes”, a term that recalls the long-gone days when open-cockpit daredevils dominated the air.  With new items arriving every day to be cataloged, though, keeping millions of legacy records up to date can be seen as more trouble than it’s worth.

But your catalog doesn’t have to be big or old to fall out of sync.  It happens faster than you might think.   The Online Books Page currently has just over 40,000 records in its catalog, about 1% of the size of my university’s.   I only started adding LC subject headings in 2006.  I tried to make sure I was adding valid subject headings, and made changes when I heard about major term renamings (such as “Cookery” to “Cooking”).  Still, I was startled to find out that only 4 years after I’d started, hundreds of subject headings I’d assigned were already out of date, or otherwise replaced by other standardized headings.  Fortunately, I was able to find this out, and bring the records up to date, in a matter of hours, thanks to automated analysis of the open data from the Library of Congress.  Furthermore, as I updated my records manually, I became confident I could automate most of the updates, making the job faster still.

Here’s how I did it.  After downloading a fresh set of LC subject headings records in RDF, I ran a script over the data that compiled an index of authorized headings (the proper ones to use), alternate headings (the obsolete or otherwise discouraged headings), and lists of which authorized headings were used for which alternate headings. The RDF file currently contains about 390,000 authorized subject headings, and about 330,000 alternate headings.

Then I extracted all the subjects from my catalog.  (I currently have about 38,000 unique subjects.)  Then I had a script check each subject see if it was listed as an authorized heading in the RDF file.  If not, I checked to see if it was an alternate heading.  If neither was the case, and the subject had subdivisions (e.g. “Airplanes — History”) I removed a subdivision from the end and repeated the checks until a term was found in either the authorized or alternate category, or I ran out of subdivisions.

This turned up 286 unique subjects that needed replacement– over 3/4 of 1% of my headings, in less than 4 years.  (My script originally identified even more, until I realized I had to ignore the simple geographic or personal names.  Those aren’t yet in LC’s RDF file, but a few of them show up as alternate headings for other subjects.)  These 286 headings (some of them the same except for subdivisions) represented 225 distinct substitutions.  The bad headings were used in hundreds of bibliographic records, the most popular full heading being used 27 times. The vast majority of the full headings, though, were used in only one record.

What was I to replace these headings with?  Some of the headings had multiple possibilities. “Royalty” was an alternate heading for 5 different authorized headings: “Royal houses”, “Kings and rulers”, “Queens”, “Princes” and “Princesses”.   But that was the exception rather than the rule.  All but 10 of my bad headings were alternates for only one authorized heading.  After “Royalty”, the remaining 9 alternate headings presented a choice between two authorized forms.

When there’s only 1 authorized heading to go to, it’s pretty simple to have a script do the substitution automatically.  As I verified while doing the substitutions manually, nearly all the time the automatable substitution made sense.  (There were a few that didn’t: for instance. when “Mind and body — Early works to 1850″ is replaced by “Mind and body — Early works to 1800“, works first published between 1800 and 1850 get misfiled.  But few substitutions were problematic like this– and those involving dates, like this one, can be flagged by a clever script.)

If I were doing the update over again, I’ll feel more comfortable letting a script automatically reassign, and not just identify, most of my obsolete headings.  I’d still want to manually inspect changes that affect more than one or two records, to make sure I wasn’t messing up lots of records in the same way; and I’d also want to manually handle cases where more than one term could be substituted.  The rest– the vast majority of the edits– could be done fully automatically.  The occasional erroneous reassignment of a single record would be more than made up by the repair of many more obsolete and erroneous old records.  (And if my script logs changes properly, I can roll back problematic ones later on if need be.)

Mind you, now that I’ve brought my headings up to date once, I expect that further updates will be quicker anyway.  The Library of Congress releases new LCSH RDF files about every 1-2 months.  There should be many fewer changes in most such incremental updates than there would be when doing years’ worth of updates all at once.

Looking at the evolution of the Library of Congress catalog over time, I suspect that they do a lot of this sort of automatic updating already.  But many other libraries don’t, or don’t do it thoroughly or systematically.  With frequent downloads of updated LCSH data, and good automated procedures, I suspect that many more could.  I have plans to analyze some significantly larger, older, and more diverse collections of records to find out whether my suspicions are justified, and hope to report on my results in a future post.  For now, I’d like to thank the Library of Congress once again for publishing the open data that makes these sorts of catalog investigations and improvements feasible.

June 20, 2010

How we talk about the president: A quick exploration in Google Books

Filed under: data,online books,sharing — John Mark Ockerbloom @ 10:28 pm

On The Online Books Page, I’ve been indexing a collection of memorial sermons on President Abraham Lincoln, all published shortly after his assassination, and digitized by Emory University.  Looking through them, I was struck by how often Lincoln was referred to as “our Chief Magistrate”.  That’s a term you don’t hear much nowadays, but was once much more common. Lincoln himself used the term in his first inaugural address, and he was far from the first person to do so.

Nowadays you’re more likely to hear the president referred to in different terms, with somewhat different connotations, such as “chief executive” or “commander-in-chief”.  The Constitution uses that last term in reference to the president’s command over the Armed Forces. Lately, though, I’ve heard “commander in chief” used as if it referred to the country in general.  As someone wary of the expansion of executive power in recent years, I find that usage unsettling.

I wondered, as I went through the Emory collection, whether the terms we use for the president reflect shifts in the role he has played over American history.  Is he called “commander in chief” more in times of war or military buildup, for instance?  How often was he instead called “chief magistrate” or “chief executive” over the course of American history?  And how long did “chief magistrate” stay in common use, and what replaced it?

Not too long ago, those questions would have simply remained idle curiosity.  Perhaps, if I’d had the time and patience, I could have painstakingly compiled a small selection of representative writings from various points in US history, read through them, and tried to draw conclusions from them.  But now I– and anyone else on the web– also have a big searchable, dated corpus of text to query: the Google Books collection.  Could that give me any insight into my questions?

It looks like it can, and without too much expenditure of time.  I’m by no means an expert on corpus analysis, but in a couple of hours of work, I was able to assemble promising-looking data that turned up some unexpected (but plausible) results.  Below, I’ll describe what I did and what I found out.

I started out by going to the advanced book search for Google.  From there, I specified particular decades for publications over the last 200 years: 1810-1819, 1820-1829, and so on up to 2000-2009.  For each decade, I recorded how many hits Google reported for the phrases “chief magistrate”, “chief executive”, and “commander in chief”, in volumes that also contained the word “president”.  Because the scope of Google’s collection may vary in different decades, I also recorded the total number of volumes in each decade containing the word “president”.  I then divided the number of phrase+”president” hits by the number of “president” hits, and graphed the proportional occurrences of each phrase in each decade.

The graph below shows the results.  The blue line tracks “chief magistrate”, the orange line tracks “chief executive”, and the green line tracks “commander in chief”.  The numbers in the horizontal axis refer to the decade+1800s; e.g. 1 is the 1810s, 2 is 1820s, all the way up to 20 being the 2000s.

Relative frequencies of "chief magistrate", "chief executive", and "commander in chief" used along with "president", by decade, 1810s-2000s

You can see a larger view of this graph, and the other graphs in this post, by clicking on it.

The graph suggests that “chief magistrate” was popular in the 19th century, peaking in the 1830s.  “Chief executive” arose from obscurity in the late 19th century,  overtook “chief magistrate” in the early 20th century, and then became widely used, apparently peaking in the 1980s.  (Though by then some corporate executives– think “chief executive officer”– are in the result set along with the US president.)

We don’t see a strong trend with “commander in chief”.  There are some peaks in usage in the 1830s, and the 1960s and 1970s, but they’re not dominant, and they don’t obviously correspond to any particular set of events.  What’s going on?  Was I just imagining a relation between its usage and military buildups?  Is the Google data skewed somehow?  Or is something else going on?

It’s true that the Google corpus is imperfect, as I and others have noted before.  The metadata isn’t always accurate; the number of reported hits is approximate when more than 100 or so, and the mix of volumes in Google’s corpus varies in different time periods.  (For instance, recent years of the corpus may include more magazine content than earlier years; and reprints can make texts reappear decades after they were actually written.  The rise of print-on-demand scans of old public-domain books in the 2000s may be partly responsible for the uptick in “chief magistrate” that decade, for instance.)

But I might also not be looking at the right data.  There are lots of reasons to mention “commander-in-chief” at various times.  The apparent trend that concerned me, though, was the use of “commander in chief” as an all-encompassing term.  Searching for the phrase “our commander in chief” with “president” might be better at identifying that. That search doesn’t distinguish military from civilian uses of that phrase, but an uptick in usage would indicate either a greater military presence in the published record, or a more militarized view among civilians.  So either way, it should reflect a more militaristic view of the president’s role.

Indeed, when I graph the relative occurrences of “our commander in chief” over time, the trend line looks rather different than before.  Here it is below, with the decades labeled the same way as in the first graph:

Scaled frequency of "Our commander in chief" used along with "President", by decade

Scaled frequency of "our commander in chief" used along with "president", by decade, 1810s-2000s

Here we see increases in decades that saw major wars, including 1812, the Mexican war of the 1840s, the civil war of the 1860s, and the Vietnam war expanding in the 1970s.  This past decade had the second most-frequent usage (by a small margin) of “our commander in chief” in the last 200 years of this corpus.  But it’s dwarfed by the use during the 1940s, when Americans fought in World War II.  That’s not something I’d expected, but given the total mobilization that occurred between 1941 and 1945, it makes sense.

If we look more closely at the frequency of “our commander in chief” in the last 20 years, we also find interesting results. The graph below looks at 1991 through 2009 (add 1990 to each number on the horizontal axis; and as always, click on the image for a closer look):

Scaled frequency of "our commander in chief" used along with "president", by year, 1991-2009

Not too surprisingly, after the successful Gulf War in early 1991, usage starts to decrease.  And not long after 9/11, usage increases notably, and stays high in the years to follow.  (Books take some time to go from manuscript to publication, but we see a local high by 2002, and higher usage in most of the subsequent years.)  I was a bit surprised, though, to find an initial spike in usage in 1999.  As seen in this timeline, Bill Clinton’s impeachment and trial took place in late 1998 and early 1999, and a number of the hits during this time period are in the context of questioning Clinton’s fitness to be “our commander in chief” in the light of the Lewinsky scandal.  But once public interest moved on to the 2000 elections, in which Clinton was not a candidate, usage dropped off again until the 9/11 attacks and the wars that followed.

I don’t want to oversell the importance of these searches.  Google Books search is a crude instrument for literary analysis, and I’m still a novice at corpus analysis (and at generating Excel graphs).  But my searches suggest that the corpus can be a useful tool for identifying and tracking large-scale trends in certain kinds of expression.  It’s not a substitute for the close reading that most humanities scholarship requires.  And even with the “distant reading” of searches, you still need to look at a sampling of your results to make sure you understand what you’re finding, and aren’t copying down numbers blindly.

But with those caveats, the Google Books corpus supports an enlightening high-altitude perspective on literature and culture.  The corpus is valuable not just for its size and searchability, but also for its public accessibility.  When I report on an experiment like this, anyone else who wants to can double-check my results, or try some followup searches of their own.  (Exact numbers will naturally shift somewhat over time as more volumes get added to the corpus.)  To the extent that searching, snippets, and text are open to all, Google Books can be everybody’s literary research laboratory.

June 11, 2010

Journal liberation: A primer

Filed under: copyright,libraries,open access,publishing,sharing — John Mark Ockerbloom @ 10:07 am

As Dorothea Salo recently noted, the problem of limited access to high-priced scholarly journals may be reaching a crisis point.  Researchers that are not at a university, or are at a not-so-wealthy one, have long been frustrated by journals that are too expensive for them to read (except via slow and cumbersome inter-library loan, or distant library visits).  Now, major universities are feeling the pain as well, as bad economic news has forced budget cuts in many research libraries, even as further price increases are expected for scholarly journals.  This has forced many libraries to consider dropping even the most prestigious journals, when their prices have risen too high to afford.

Recently, for instance, the University of California, which has been subject to significant budget cuts and furloughssent out a letter in protest of Nature Publishing Group’s proposal to raise their subscription fees by 400%.  The letter raised the possibility of cancelling all university subscriptions to NPG, and having scholars boycott the publisher.

Given that Nature is one of the most prestigious academic journals now publishing, one that has both groundbreaking current articles and a rich history of older articles, these are strong words.  But dropping subscriptions to journals like Nature might not be as as much of a hardship for readers as it once might have been.  Increasingly, it’s possible to liberate the research content of academic journals, both new and old, for the world.  And, as I’ll explain below, now may be an especially opportune time to do that.

Liberating new content

While some of the content of journals like Nature is produced by the journal’s editorial staff or other writers for hire, the research papers are typically written by outside researchers, employed by universities and other research institutions.  These researchers hold the original copyright to their articles, and even if they sign an agreement with a journal to hand over rights to them (as they commonly do), they retain whatever rights they don’t sign over.  For many journals, including the ones published by Nature Publishing Group, researchers retain the right to post the accepted version of their paper (known as a “preprint”) in local repositories.  (According to the Romeo database, they can also eventually post the “postprint”– the final draft resulting after peer review, but before actual publication in the journal– under certain conditions.)  These drafts aren’t necessarily identical to the version of record published in the journal itself, but they usually contain the same essential information.

So if you, as a reader, find a reference to a Nature paper that you can’t access, you can search to see if the authors have placed a free copy in an open access repository. If they haven’t, you can contact one of them to encourage them do do so.  To find out more about providing open access to research papers, see this guide.

If a journal’s normal policies don’t allow authors to share their work freely in an open access repository, authors  may still be able to retain their rights with a contract addendum or negotiation.  When that hasn’t worked, some academics have decided to publish in, or review for, other journals, as the California letter suggests.  (When pushed too far, some professors have even resigned en masse from editorial boards to start new journals that are friendlier to authors and readers.

If nothing else, scholarly and copyright conventions generally respect the right of authors to send individual copies of their papers to colleagues that request them.  Some repository software includes features that make such copies extremely easy to request and send out.  So even if you can’t find a free copy of a paper online already, you can often get one if you ask an author for it.

Liberating historic content

Many journals, including Nature, are important not only for their current papers, but for the historic record of past research contained in their back issues.  Those issues may be difficult to get a hold of, especially as many libraries drop print subscriptions, deaccession old journal volumes, or place them in remote storage.  And electronic access to old content, when it’s available at all, can be surprisingly expensive.  For instance, if I want to read this 3-paragraph letter to the editor from 1872 on Nature‘s web site, and I’m not signed in at a subscribing institution, the publisher asks me to pay them $32 to read it in full.

Fortunately, sufficiently old journals are in the public domain, and digitization projects are increasingly making them available for free.  At this point, nearly all volumes of Nature published before 1922 can now be read freely online, thanks to scans made available to the public by the University of Wisconsin, Google, and Hathi Trust.  I can therefore read the letters from that 1872 issue, on this page, without having to pay $32.

Mass digitization projects typically stop providing public access to content published after 1922, because copyright renewals after that year might still be in force.  However, most scholarly journals– including, as it turns out, Nature — did not file copyright renewals.  Because of this, Nature issues are actually in the public domain in the US all the way through 1963 (after which copyright renewal became automatic).  By researching copyrights for journals, we can potentially liberate lots of scholarly content that would otherwise be inaccessible to many. You can read more about journal non-renewal in this presentation, and research copyright renewals via this site.

Those knowledgeable about copyright renewal requirements may worry that the renewal requirement doesn’t apply to Nature, since it originates in the UK, and renewal requirements currently only apply to material that was published in the US before, or around the same time as, it was published abroad.  However, offering to distribute copies in the US counts as US publication for the purposes of copyright law.  Nature did just that when they offered foreign subscriptions to journal issues and sent them to the US; and as one can see from the stamp of receipt on this page, American universities were receiving copies within 30 days of the issue date, which is soon enough to retain the US renewal requirement.  Using similar evidence, one can establish US renewal requirements for many other journals originating in other countries.

Minding the gap

This still leaves a potential gap between the end of the public domain period and the present.  That gap is only going to grow wider over time, as copyright extensions continue to freeze the growth of the public domain in the US.

But the gap is not yet insurmountable, particularly for journals that are public domain into the 1960s.  If a paper published in 1964 included an author who was a graduate student or a young researcher, that author may well be still alive (and maybe even be still working) today, 46 years later.  It’s not too late to try to track authors down (or their immediate heirs), and encourage and help them to liberate their old work.

Moreover, even if those authors signed away all their rights to journal publishers long ago, or don’t remember if they still have any rights over their own work, they (or their heirs) may have an opportunity to reclaim their rights.  For some journal contributions between 1964 and 1977, copyright may have reverted to authors (or their heirs) at the time of copyright renewal, 28 years after initial publication.  In other cases, authors or heirs can reclaim rights assigned to others, using a termination of transfer.  Once authors regain their rights over their articles, they are free to do whatever they like with them, including making them freely available.

The rules for reversion of author’s rights are rather arcane, and I won’t attempt to explain them all here.  Terminations of transfer, though, involve various time windows when authors have the chance to give notice of termination, and reclaim their rights.  Some of the relevant windows are open right now.   In particular, if I’ve done the math correctly, 2010 marks the first year one can give notice to terminate the transfer of a paper copyrighted in 1964, the earliest year in which most journal papers are still under US copyright.  (The actual termination of a 1964 copyright’s transfer won’t take effect for another 10 years, though.)  There’s another window open now for copyright transfers from 1978 to 1985; some of those terminations can take effect as early as 2013.  In the future, additional years will become available for author recovery of copyrights assigned to someone else.  To find out more about taking back rights you, or researchers you know, may have signed away decades ago, see this tool from Creative Commons.

Recognizing opportunity

To sum up, we have opportunities now to liberate scholarly research over the full course of scholarly history, if we act quickly and decisively.  New research can be made freely available through open access repositories and journals.  Older research can be made freely available by establishing its public domain status, and making digitizations freely available.  And much of the research in the not-so-distant past, still subject to copyright, can be made freely available by looking back through publication lists, tracking down researchers and rights information, and where appropriate reclaiming rights previously assigned to journals.

Journal publishing plays an important role in the certification, dissemination, and preservation of scholarly information.  The research content of journals, however, is ultimately the product of scholars themselves, for the benefit of scholars and other knowledge seekers everywhere.   However the current dispute is ultimately resolved between Nature Publishing Group and the University of California, we would do well to remember the opportunities we have to liberate journal content for all.

May 6, 2010

Making discovery smarter with open data

Filed under: architecture,discovery,online books,open access,sharing,subjects — John Mark Ockerbloom @ 9:06 am

I’ve just made a significant data enhancement to subject browsing on The Online Books Page.  It improves the concept-oriented browsing of my catalog of online books via subject maps, where users explore a subject along multiple dimensions from a starting point of interest.

Say you’d like to read some books about logic, for instance.  You’d rather not have to go find and troll all the appropriate shelf sections within math, philosophy, psychology, computing, and wherever else logic books might be found in a physical library.  And you’d rather not have to think of all the different keywords used to identify different logic-related topics in a typical online catalog. In my subject map for logic, you can see lots of suggestions of books filed both under “Logic” itself, and under related concepts.  You can go straight to a book that looks interesting, select a related subject and explore that further, or select the “i” icon next to a particular book to find more books like it.

As I’ve noted previously, the relationships and explanations that enable this sort of exploration depend on a lot of data, which has to come from somewhere.  In previous versions of my catalog, most of it came from a somewhat incomplete and not-fully-up-to-date set of authority records in our local catalog at Penn.  But the Library of Congress (LC) has recently made authoritative subject cataloging data freely available on a new website.  There, you can query it through standard interfaces, or simply download it all for analysis.

I recently downloaded their full data set (38 MB of zipped RDF), processed it, and used it to build new subject maps for The Online Books Page.   The resulting maps are substantially richer than what I had before.  My collection is fairly small by the standards of mass digitization– just shy of 40,000 items– but still, the new data, after processing, yielded over 20,000 new subject relationships, and over 600 new notes and explanations, for the subjects represented in the collection.

That’s particularly impressive when you consider that, in some ways, the RDF data is cruder than what I used before.  The RDF schemas that LC uses omit many of the details and structural cues that are in the MARC subject authority records at the Library of Congress (and at Penn).  And LC’s RDF file is also missing many subjects that I use in my catalog; in particular, at present it omits many records for geographic, personal, and organizational names.

Even so, I lost few relationships that were in my prior maps, and I gained many more.  There were two reasons for this:  First of all, LC’s file includes a lot of data records (many times more than my previous data source), and they’re more recent as well.  Second, a variety of automated inference rules– lexical, structural, geographic, and bibliographic– let me create additional links between concepts with little or no explicit authority data.  So even though LC’s RDF file includes no record for Ontario, for instance, its subject map in my collection still covers a lot of ground.

A few important things make these subject maps possible, and will help them get better in the future:

  • A large, shared, open knowledge base: The Library of Congress Subject Headings have been built up by dedicated librarians at many institutions over more than a century.  As a shared, evolving resource, the data set supports unified searching and browsing over numerous collections, including mine.  The work of keeping it up to date, and in sync with the terms that patrons use to search, can potentially be spread out among many participants.  As an open resource, the data set can be put to a variety of uses that both increase the value of our libraries and encourage the further development of the knowledge base.
  • Making the most of automation: LC’s website and standards make it easy for me to download and process their data automatically. Once I’ve loaded their data, and my own records, I then invoke a set of automated rules to infer additional subject relationships.  None of the rules is especially complex; but put together, they do a lot to enhance the subject maps. Since the underlying data is open, anyone else is also free to develop new rules or analyses (or adapt mine, once I release them).  If a community of analyzers develops, we can learn from each other as we go.  And perhaps some of the relationships we infer through automation can be incorporated directly into later revisions of LC’s own subject data.
  • Judicious use of special-purpose data: It is sometimes useful to add to or change data obtained from external sources.  For example, I maintain a small supplementary data file on major geographic areas.  A single data record saying that Ontario is a region within Canada, and is abbreviated “Ont.”, generates much of my subject map for Ontario.  Soon, I should also be able to re-incorporate local subject records, as well as arbitrary additional overlays, to fill in conceptual gaps in LC’s file.  Since local customizations can take  a lot of effort to maintain, however, it’s best to try to incorporate local data into shared knowledge bases when feasible.  That way, others can benefit from, and add on to, your own work.

Recently, there’s been a fair bit of debate about whether to treat cataloging data as an open public good, or to keep it more restricted.  The Library of Congress’ catalog data has been publicly accessible online for years, though until recently only you could only get a little a time via manual searches, or pay a large sum to get a one-time data dump.  By creating APIs, using standard semantic XML formats, and providing free, unrestricted data downloads for their subject authority data, LC has made their data much easier for others to use in a variety of ways. It’s improved my online book catalog significantly, and can also improve many other catalogs and discovery applications.  Those of us who use this data, in turn, have incentives to work to improve and sustain it.

Making the LC Subject Headings ontology open data makes it both more useful and more viable as libraries evolve.  I thank the folks at the Library of Congress for their openness with their data, and I hope to do my part in improving and contributing to their work as well.

April 7, 2010

Copyright information is busting out all over

Filed under: copyright,sharing — John Mark Ockerbloom @ 3:43 pm

Like the crocuses and daffodils now coming up all over our front garden, new copyright registration information has been popping up all over the net lately.  As I’ve described in various previous posts, this information can be extremely useful for folks who want to revive, disseminate, or reuse works from the past.

Here’s a summary of the some of the recent highlights:

Copyright renewals for maps and commercial prints are now all online, and join what is now a complete set of renewals of active copyrights for still images.  The scanning was done here at the Penn Libraries by me and by the Schoenberg Center for Electronic Text and Image, from microfilms and volumes loaned by the Free Library of Philadelphia.  I thank all the folks who helped out with this project.

With this addition of this latest set of records, you can now find copyright renewals online for nearly anything you’d find in a book, if they’re recent enough to still be in force.  The only active copyright renewals of any sort not yet online at this point to my knowledge are renewals for most music prior to 1978, and a few small sets of pre-1978 renewals for film (about 2 years’ worth in all).

Original copyright registrations are also going online at a rapid rate.   The biggest publicly accessible set of original registrations from 1923 onward (the date of the oldest copyrights still in force) is at Hathi Trust, and consists of digitized volumes that have been scanned by Google for Hathi member libraries.  I’ve include them in a list of registration volumes organized by year and type of work on my Catalog of Copyright Entries Page, which has now been reorganized to combine all the original and renewal registrations known to be available online.  I’ve also added direct page links to renewal and other important sections of the volumes, so that researchers looking for those can go to them directly.  In many cases, the renewal sections can be downloaded for offline use.  I’ve also brought out statistics from the volumes, to help give readers a sense of the rate of registrations and renewals.

Google is making enhanced versions of book copyright registration volumes available online. Specifically, they’ve digitized the full set of original and renewal registrations for books from 1922-1977, in a set of scans that are of generally higher quality than the ones at Hathi Trust.  You can search the full text of the entire set at once, or search or browse individual volumes.

These scans were done specially for copyright research purposes, and seem to involve more careful scanning than the normal mass-book-digitization procedures Google used for the Hathi Trust volumes.  They aren’t entirely free of problems– I identify a few trouble spots in my listings– and they also don’t include registrations for other types of work, which has apparently confused some folks who have contacted me.  But they’re quite high quality overall, and could be a very good basis for structured data records of these copyright registrations.  Google has previously made such records available for book copyright renewals; I hope we’ll see a release of records based on these new scans before long as well.

Also in the pipeline: Based on conversations I’ve had with others interested in copyright issues, we may well see a complete set of copyright registrations and renewals online (at least in the form of page images from the Catalog of Copyright Entries) by the end of this year.  And a number of projects are working on making this digitized information more useful for practical copyright clearance.  Today, for instance, I heard about the Durationer project, being presented at the Copyright@300 conference at Berkeley later this week.  The project is developing a tool to help people determine the copyright status of specific works in specific jurisdictions, based on copyright registrations and other relevant information.

Some possible future directions: As I described in more detail in a 2007 paper, a thorough determination of a work’s copyright status depends not just on registration information, but on various other kinds of information, much of which can be found in a work’s bibliographic records.  Copyright registration data can also be used to build new bibliographic data structures.  Therefore, the interests of copyright clearance and the interests of access to bibliographic data tend to converge.  I elaborate on this idea in a guest blog post for the Open Knowledge Foundation, who I’ve started to work with in these areas.  (For folks following the debate over OCLC’s WorldCat, this convergence is also worth keeping in mind when reading the just-released WorldCat Rights and Responsibilities draft, which I hope to comment on in the not-too-distant future.)

I hope you find this new copyright information useful.  And I’m very interested in hearing what you’re doing with it, or would like to do with it.

March 24, 2010

Erin McKean and inclusive entrepreneurship

Filed under: people — John Mark Ockerbloom @ 11:33 pm

Today is Ada Lovelace Day, a day celebrating the achievements of women in science and technology.  There are all kinds of ways to be a scientist or a technologist, and just in the fields of computing and information technology I can think of a number of first-class inventors, investigators, developers, teachers, and integrators who are women.

When it comes to entrepreneurs, though, the list of women who come to my mind gets much shorter.  And that’s unfortunate, because in computing and information technology, as in many other technical fields, entrepreneurs play a major role in bringing the fruits of technology to the world at large.   If you mention Steve and SteveBill and Paul, or Larry and Sergey, lots of people will know who you’re talking about, and also know the stories of the companies they founded and their world-changing products.  Women tech-company founders, though, have not been so noticeable.  Jessica Livingston, one of the founders of the tech-startup catalyst Y Combinator, reports that nearly all of their applicants have Y chromosomes; only about 7 percent have been women. And in mid-2008, the San Jose Mercury News reported that there were no female CEOs in the top 150 Silicon Valley companies.

Despite these discouraging statistics, you can find women pioneering new technological businesses.  One who I find particularly notable (and not just because the woman technologist closest to me works for her) is Erin McKean, co-founder and CEO of Wordnik, a company that’s reinventing the dictionary for the Internet age. If you watch this 15-minute video of a 2007 TED talk, or just read the About and FAQ pages at Wordnik and play around with the site some, you’ll pick up the general ideas.   There are a few aspects of her venture that I think are worth special note:

Rethinking the familiar. Dictionaries have been around in print for centuries, and even computerized dictionaries have existed for decades.  But as Erin makes clear in the TED video, once you have the Internet, and lots of data and computing power, you can discard many of the limitations of prior conceptions of the dictionaries, and do lots of interesting new things.  You can include all the words in a language, not just the ones that pass some sort of notability test.  You can show examples from a huge array of  formal, informal, and ephemeral sources.  You can do statistical analysis to track word usage over different times, places, and genres.  In short, you can make a reference that meets a known need in new and useful ways.

Risk-taking. It’s a truism that startups are risky ventures.   You spend years of your life working long hours, often with low pay and few benefits at the start, for a project that may make you rich, but statistically is more likely to come to nothing.   And some believe that online dictionaries, in particular, may be obsolete, and vulnerable to the same sorts of disruptive markets that blew the encylopedia business to bits.  But thus far, Erin’s risk-taking seem to be paying off; the site was named one of PC Magazine’s Top 100 products of 2009 (one of only a few websites to be included in their list), and it continues to attract funding even in the midst of a severe recession.

Inclusion. One of the key distinguishing features of Wordnik, alluded to above, is its inclusiveness.  By collecting all the words and examples that it can, from a wide variety of sources, it stands out from other dictionaries.  And along with the usual definitions, pronunciations, and example sentences for words, Wordnik also brings in many other unconventional sources — Flickr photo streams, Scrabble scoring references, and user tags and comments, to name a few– to enhance understanding and enjoyment of words.  It remains to be seen which of these sources will prove most useful or popular in the long run, but openness to new information and ideas is often a crucial part of inventing and improving new technologies.   It also helps ventures evaluate and adapt their technologies and their business models, so that they can thrive rather than perish under changing conditions.

Inviting collaboration. If inclusiveness is important to you, it’s not enough just to wait for things to come to you; you need to go out and invite people to work with you.  The Wordnik site does this to a certain extent by design, encouraging people to contribute notes, tags, lists of favorite words, and other information.  (Our 7-year-old daughter was an ardent contributor of Pokemon character names and pronunciations during the site’s alpha test.)  More recently, Wordnik has partnered with the Internet Archive and a variety of publishers and publishing sites to develop Smartwords, a forthcoming open standard for querying and embedding word information on demand as people read online, or communicate via social software.  Erin and her collaborators hope that this will extend the reach of Wordnik’s services into many more contexts than just “going to look up a word in the dictionary”.

While the details above are specific to Wordnik, the four basic qualities they embody – rethinking the familiar, risk-taking, inclusion, and inviting collaboration — have general applicability, and are especially worth consideration on Ada Lovelace Day.  The low numbers quoted earlier for women tech founders and CEOs suggests to me that women may not always be seriously considered (by themselves or others) for those roles.  Thinking in terms of the qualities I’ve mentioned makes it easier to envision women in those positions.  These are also qualities that can help both men and women take a more entrepreneurial approach to the technologies (and the libraries) they develop, so that they can have a lasting, positive effect on the world.

March 23, 2010

Lots of conversation keeps stuff sustainable

Filed under: libraries,people,preservation,sharing — John Mark Ockerbloom @ 10:12 pm

Among the hats I wear at my place of work is that of LOCKSS cache administrator. LOCKSS is a useful distributed preservation system built around the principle “Lots of copies keep stuff safe” (whose initials give the system its name).  The idea is that, with the cooperation of publishers, a bunch of libraries each harvest copies of selected online content, and keep backups on our own LOCKSS caches, which are hooked up to local library proxy services.  Then, if the material ever becomes inaccessible from the publisher, our users will automatically be routed to our local copies.  Each LOCKSS cache also periodically checks with other LOCKSS caches to ensure that our copies are still in good shape, and to repair or replace copies that have been lost or damaged.  (Various security features protect against leaks of restricted content, or unauthorized revisions of content.)

LOCKSS is open source software that runs on commodity hardware.  It was originally envisioned to run virtually automatically.  As Chris Dobson described the ideal in a 2003 Searcher article, “Take a computer a generation past its prime…. Hook it up to the Internet and put it in a closet. Stick in the LOCKSS CD-ROM and boot it up. Close the closet door.”  And then presumably walk away and forget about it.

Of course, it’s not that simple in practice, particularly if your library is proactive about its preservation strategy.  The thing about preservation at scale is there’s always something that needs attention.  It might be something technical, or content-related, or planning-related, but preserving a growing collection requires ongoing thought.  And if you want to think as clearly and sensibly as you can, you’ll want to collaborate.

Right now, for instance, I’m trying to get my cache to harvest the full run of a journal that’s just been made available for LOCKSS harvesting, where we hope to provide post-cancellation access through LOCKSS.  Someone at Stanford just gave me a useful tip on how to give this journal priority over the other volumes I’ve got queued up for harvest.  Unfortunately, I can’t try it out until I get my cache back up after it failed to reboot cleanly after a power failure. While I wait to hear back instructions about how best to remedy this, I wonder whether switching to a new Linux-based version of LOCKSS might make such operating system-level problems easier to deal with.  But it would be useful to hear from folks who are running that version to see what their experience has been.

Meanwhile, we’re wondering how best to approach new publishers who have content that our bibliographers would like to preserve via LOCKSS. Our special collections folks wonder whether we should preserve some of our own home-grown content via a private LOCKSS network.  I’m also doing some ongoing monitoring and testing of our LOCKSS cache’s behavior (some of which I’ve reported on earlier), and would be interested in knowing if others are seeing some of the same kinds of things that I see on the cache I administer.

In short, there are a lot of things to think about, when LOCKSS plays a significant role in a preservation plan.  And a lot of the issues I’ve mentioned above are ones that others may be thinking about as well.  So let’s talk about them.  As the LOCKSS group has said, “”A vibrant, active, and engaged user community is key to the success of Open-Source efforts like LOCKSS.”

One thing you need for such an engaged community is a forum for them to talk to each other.  As it turns out, the LOCKSS group at Stanford tell me they created a LOCKSS Forum mailing list a while back, but I haven’t yet seen it publicized.   Its information page is at https://mailman.stanford.edu/mailman/listinfo/lockss-forum .  (Currently, archived email messages are not visible on the open web, though this may change in the future.)  If you’re interested in talking with others about how you use or might use LOCKSS to preserve access to digital content, I invite you to sign up and help get the conversation going.

March 9, 2010

Implementing interoperability between library discovery tools and the ILS

Filed under: architecture,discovery — John Mark Ockerbloom @ 4:57 pm

Last June I gave a presentation in a NISO webinar about the work a number of colleagues and I did for the Digital Library Federation to recommend standard interfaces for Integrated Library Systems (the systems that keep track of our library’s acquisitions, catalog, and circulation) to support a wide variety of tools and applications for discovery.   Our “ILS-DI” recommendation was published in 2008, and encompassed a number of functions that some ILS’s supported.  But it also included many functions that were not generally, or uniformly, supported by ILS’s of the time.  That’s still the case today.

As I said in my presentation last June, “If we look at the ILS-DI process as a development spiral, we’ve moved from a specification stage  to an implementation stage.”  My hope has been that vendors and other library software implementers would implement the basics of what we recommended– as many agreed to– and the library community could progress from there.  This often takes longer to achieve than one might hope.

But I’m happy to report that the Code4lib community is now picking up the ball.  At this month’s Code4lib conference, a group met to discuss “collaboratively develop[ing] a middleware infrastructure” to link together ILS’s and discovery tools, based on the work done by the DLF’s ILS-DI group and by the developers of systems like Jangle and XC.  The middleware would help power discovery applications like Blacklight, VuFind, Summon, WorldCat Local, and whatever else the digital library community might invent.

I wasn’t at the Code4lib conference, but the group that met there to kick off the effort has impressive collective expertise and accomplishments.   It includes several members of the DLF’s ILS-DI group, as well as the lead implementors of several relevant systems.  Roy Tennant from OCLC Research is coordinating the initial activity, and Emily Lynema of the ILS-DI group has converted the Google groups space used by the ILS-DI group for the new effort.

And you’re welcome to join too, if you’d like to help out or learn more. “This is an open, collaborative effort” is how Roy put it in the announcement of the new initiative.  Due to some prior commitments, I’ll personally be watching more than actively participating, at least to begin with, but I’ll be watching with great interest.  To find out more, and to get involved, see the Google Group.

February 8, 2010

Shedding light on images in the public domain

Filed under: copyright,sharing — John Mark Ockerbloom @ 3:05 pm

For years, I’ve regularly gotten requests from authors and publishers for licenses to reproduce images in books listed on The Online Books Page, or included in the local collection of A Celebration of Women Writers.  Sometimes these requests relate to copyrighted books that I list but don’t control rights for; in those cases, I do my best to refer the request to the book’s copyright holder.  But often, they’re for images in our own collections, from books published over 100 years ago.  In those cases, I respond that the image is in the public domain (and our digitization, which adds no originality, is also in the public domain), so no license is necessary or appropriate.

Usually that response receives a thankful reply, sometimes with signs of surprise that an image can be reused without permission.  But sometimes I’ll get back a more alarmed reply.  “My publisher says I need a license for every image in my book, or I can’t use it,” it might say, followed by a plea for help in tracking down some long-defunct 19th century publisher.

I wish I could say this was an atypical anecdote.   But, if you look around the Web, you’ll find that there are huge numbers of historic images– paintings, photographs, figures, and the like– that are behind access barriers, or closed off altogether from online access, when they don’t have to be.  Artstor has over a million images of thousands of years of art that you can’t look at unless you’re at an institution that has a subscription.  The fine arts image catalog at my own library has over 100,000 digital images, none of which can be seen online by the public outside of Penn, except in thumbnails.  Neither Artstor nor Penn want to keep art away from the public; both are nonprofit educational institutions. But clearing images for free public access on a large scale has to date been impractical for these institutions.

Restrictions on images also create holes in other works.  For instance, under the proposed Google Books settlement, images in books that might be under copyright would be blanked out unless the rightsholder to the book also asserted they held the rights in the images.  These sorts of omissions can cut the heart out of many works.  In a recent New Republic article, “For the Love of Culture“, Lawrence Lessig described how a critical table was omitted in an otherwise free article about his daughter’s possible illness, due to rights-clearance issues.  “I could not believe that we were this far down the path to insanity already,” he wrote of the incident.

Part of the insanity is that many of these images from our cultural heritage are actually in the public domain.  Many people are aware that copyrights prior to 1923 have expired in the US.  But so have many copyrights from later in the 20th century.  Pre-1964 copyrights generally had to be renewed 28 years after the start of their term, or they would expire.   (Exceptions and further details are described here.)  But most copyrights were never renewed; and that’s especially true for images.

In 1923, there were copyright registrations for 3,059 works of art, 1,149 scientific and technical drawings, 7,533 photographs, and 11,289 prints and pictorial illustrations, making a total of  23,030 copyright registrations for these classes of image.  In 1951, 28 years later, there were 198 copyright renewals for all of these image classes combined.  This represents a renewal rate of less than 1%.

We have just completed posting scans that make all active copyright renewals for artwork viewable online.  In fact, once we finish scanning one last batch of renewals for maps and for commercial prints (meaning images created for product packaging and promotion) all active copyright renewals for any type of still image will be viewable online.   In later years, the number of image copyright renewals grows slightly, but not by much. But the number of images published in those years grows substantially.

Images without a copyright registration of their own might still be under copyright if they were first published as part of a copyrighted book, newspaper, magazine, or other larger work.  Fortunately, we have complete online renewal records for those kinds of works too.  It becomes much easier to establish the public domain status of a newspaper photograph, for instance, if you know (as I previously revealed) that no newspaper outside New York renewed copyright for any issue published before the end of World War II.

Having copyright renewals online for artwork is an important step towards freeing the public domain in images.  But there’s more needed to make copyright clearance practical at a large scale.  Putting scanned renewal records into a searchable database (perhaps combined with fair use image thumbnails) will make it easier to find any copyright renewals that might exist for a particular image.  (A similar database for book renewals already exists, and there are more book renewals than image renewals.)  Making original copyright registrations available as well (as we now have for artwork through 1949, and soon will have for later years) lets us determine when the copyright for an image began, and whether it was renewed in time to prevent it from expiring.

Furthermore, establishing the history and provenance of images will let us determine when unregistered artwork enter the public domain.  Registered or not, the copyright to an image created before 1964 began no later than its first US publication, and the copyright for many such images therefore ended after 28 years due to a lack of renewal.  And the mostly-frozen American public domain still includes more work each year that was never published before 2003.  On Public Domain Day last month, all such work by artists who died in 1939 entered the public domain in the US.  (I won’t get now into the rather baroque rules for establishing “publication” of an artwork, but you can determine it if  the history of the image is documented.)

So we have a rich treasure trove of images in the public domain that’s been largely buried under presumptions and uncertainties about copyright.  By finding and sharing information about their copyrights, we can protect and enjoy these images in the commons of the public domain, where they can be viewed freely, included in new works, and reused in any way we can imagine.  If you find this prospect intriguing, I hope you’ll help bring these images to light.

« Previous PageNext Page »

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 81 other followers