June 20, 2010

How we talk about the president: A quick exploration in Google Books

On The Online Books Page, I’ve been indexing a collection of memorial sermons on President Abraham Lincoln, all published shortly after his assassination, and digitized by Emory University.  Looking through them, I was struck by how often Lincoln was referred to as “our Chief Magistrate”.  That’s a term you don’t hear much nowadays, but was once much more common. Lincoln himself used the term in his first inaugural address, and he was far from the first person to do so.

Nowadays you’re more likely to hear the president referred to in different terms, with somewhat different connotations, such as “chief executive” or “commander-in-chief”.  The Constitution uses that last term in reference to the president’s command over the Armed Forces. Lately, though, I’ve heard “commander in chief” used as if it referred to the country in general.  As someone wary of the expansion of executive power in recent years, I find that usage unsettling.

I wondered, as I went through the Emory collection, whether the terms we use for the president reflect shifts in the role he has played over American history.  Is he called “commander in chief” more in times of war or military buildup, for instance?  How often was he instead called “chief magistrate” or “chief executive” over the course of American history?  And how long did “chief magistrate” stay in common use, and what replaced it?

Not too long ago, those questions would have simply remained idle curiosity.  Perhaps, if I’d had the time and patience, I could have painstakingly compiled a small selection of representative writings from various points in US history, read through them, and tried to draw conclusions from them.  But now I– and anyone else on the web– also have a big searchable, dated corpus of text to query: the Google Books collection.  Could that give me any insight into my questions?

It looks like it can, and without too much expenditure of time.  I’m by no means an expert on corpus analysis, but in a couple of hours of work, I was able to assemble promising-looking data that turned up some unexpected (but plausible) results.  Below, I’ll describe what I did and what I found out.

I started out by going to the advanced book search for Google.  From there, I specified particular decades for publications over the last 200 years: 1810-1819, 1820-1829, and so on up to 2000-2009.  For each decade, I recorded how many hits Google reported for the phrases “chief magistrate”, “chief executive”, and “commander in chief”, in volumes that also contained the word “president”.  Because the scope of Google’s collection may vary in different decades, I also recorded the total number of volumes in each decade containing the word “president”.  I then divided the number of phrase+”president” hits by the number of “president” hits, and graphed the proportional occurrences of each phrase in each decade.

The graph below shows the results.  The blue line tracks “chief magistrate”, the orange line tracks “chief executive”, and the green line tracks “commander in chief”.  The numbers in the horizontal axis refer to the decade+1800s; e.g. 1 is the 1810s, 2 is 1820s, all the way up to 20 being the 2000s.

Relative frequencies of "chief magistrate", "chief executive", and "commander in chief" used along with "president", by decade, 1810s-2000s

You can see a larger view of this graph, and the other graphs in this post, by clicking on it.

The graph suggests that “chief magistrate” was popular in the 19th century, peaking in the 1830s.  “Chief executive” arose from obscurity in the late 19th century,  overtook “chief magistrate” in the early 20th century, and then became widely used, apparently peaking in the 1980s.  (Though by then some corporate executives– think “chief executive officer”– are in the result set along with the US president.)

We don’t see a strong trend with “commander in chief”.  There are some peaks in usage in the 1830s, and the 1960s and 1970s, but they’re not dominant, and they don’t obviously correspond to any particular set of events.  What’s going on?  Was I just imagining a relation between its usage and military buildups?  Is the Google data skewed somehow?  Or is something else going on?

It’s true that the Google corpus is imperfect, as I and others have noted before.  The metadata isn’t always accurate; the number of reported hits is approximate when more than 100 or so, and the mix of volumes in Google’s corpus varies in different time periods.  (For instance, recent years of the corpus may include more magazine content than earlier years; and reprints can make texts reappear decades after they were actually written.  The rise of print-on-demand scans of old public-domain books in the 2000s may be partly responsible for the uptick in “chief magistrate” that decade, for instance.)

But I might also not be looking at the right data.  There are lots of reasons to mention “commander-in-chief” at various times.  The apparent trend that concerned me, though, was the use of “commander in chief” as an all-encompassing term.  Searching for the phrase “our commander in chief” with “president” might be better at identifying that. That search doesn’t distinguish military from civilian uses of that phrase, but an uptick in usage would indicate either a greater military presence in the published record, or a more militarized view among civilians.  So either way, it should reflect a more militaristic view of the president’s role.

Indeed, when I graph the relative occurrences of “our commander in chief” over time, the trend line looks rather different than before.  Here it is below, with the decades labeled the same way as in the first graph:

Scaled frequency of "Our commander in chief" used along with "President", by decade

Scaled frequency of "our commander in chief" used along with "president", by decade, 1810s-2000s

Here we see increases in decades that saw major wars, including 1812, the Mexican war of the 1840s, the civil war of the 1860s, and the Vietnam war expanding in the 1970s.  This past decade had the second most-frequent usage (by a small margin) of “our commander in chief” in the last 200 years of this corpus.  But it’s dwarfed by the use during the 1940s, when Americans fought in World War II.  That’s not something I’d expected, but given the total mobilization that occurred between 1941 and 1945, it makes sense.

If we look more closely at the frequency of “our commander in chief” in the last 20 years, we also find interesting results. The graph below looks at 1991 through 2009 (add 1990 to each number on the horizontal axis; and as always, click on the image for a closer look):

Scaled frequency of "our commander in chief" used along with "president", by year, 1991-2009

Not too surprisingly, after the successful Gulf War in early 1991, usage starts to decrease.  And not long after 9/11, usage increases notably, and stays high in the years to follow.  (Books take some time to go from manuscript to publication, but we see a local high by 2002, and higher usage in most of the subsequent years.)  I was a bit surprised, though, to find an initial spike in usage in 1999.  As seen in this timeline, Bill Clinton’s impeachment and trial took place in late 1998 and early 1999, and a number of the hits during this time period are in the context of questioning Clinton’s fitness to be “our commander in chief” in the light of the Lewinsky scandal.  But once public interest moved on to the 2000 elections, in which Clinton was not a candidate, usage dropped off again until the 9/11 attacks and the wars that followed.

I don’t want to oversell the importance of these searches.  Google Books search is a crude instrument for literary analysis, and I’m still a novice at corpus analysis (and at generating Excel graphs).  But my searches suggest that the corpus can be a useful tool for identifying and tracking large-scale trends in certain kinds of expression.  It’s not a substitute for the close reading that most humanities scholarship requires.  And even with the “distant reading” of searches, you still need to look at a sampling of your results to make sure you understand what you’re finding, and aren’t copying down numbers blindly.

But with those caveats, the Google Books corpus supports an enlightening high-altitude perspective on literature and culture.  The corpus is valuable not just for its size and searchability, but also for its public accessibility.  When I report on an experiment like this, anyone else who wants to can double-check my results, or try some followup searches of their own.  (Exact numbers will naturally shift somewhat over time as more volumes get added to the corpus.)  To the extent that searching, snippets, and text are open to all, Google Books can be everybody’s literary research laboratory.

June 11, 2010

Journal liberation: A primer

As Dorothea Salo recently noted, the problem of limited access to high-priced scholarly journals may be reaching a crisis point.  Researchers that are not at a university, or are at a not-so-wealthy one, have long been frustrated by journals that are too expensive for them to read (except via slow and cumbersome inter-library loan, or distant library visits).  Now, major universities are feeling the pain as well, as bad economic news has forced budget cuts in many research libraries, even as further price increases are expected for scholarly journals.  This has forced many libraries to consider dropping even the most prestigious journals, when their prices have risen too high to afford.

Recently, for instance, the University of California, which has been subject to significant budget cuts and furloughssent out a letter in protest of Nature Publishing Group’s proposal to raise their subscription fees by 400%.  The letter raised the possibility of cancelling all university subscriptions to NPG, and having scholars boycott the publisher.

Given that Nature is one of the most prestigious academic journals now publishing, one that has both groundbreaking current articles and a rich history of older articles, these are strong words.  But dropping subscriptions to journals like Nature might not be as as much of a hardship for readers as it once might have been.  Increasingly, it’s possible to liberate the research content of academic journals, both new and old, for the world.  And, as I’ll explain below, now may be an especially opportune time to do that.

Liberating new content

While some of the content of journals like Nature is produced by the journal’s editorial staff or other writers for hire, the research papers are typically written by outside researchers, employed by universities and other research institutions.  These researchers hold the original copyright to their articles, and even if they sign an agreement with a journal to hand over rights to them (as they commonly do), they retain whatever rights they don’t sign over.  For many journals, including the ones published by Nature Publishing Group, researchers retain the right to post the accepted version of their paper (known as a “preprint”) in local repositories.  (According to the Romeo database, they can also eventually post the “postprint”– the final draft resulting after peer review, but before actual publication in the journal– under certain conditions.)  These drafts aren’t necessarily identical to the version of record published in the journal itself, but they usually contain the same essential information.

So if you, as a reader, find a reference to a Nature paper that you can’t access, you can search to see if the authors have placed a free copy in an open access repository. If they haven’t, you can contact one of them to encourage them do do so.  To find out more about providing open access to research papers, see this guide.

If a journal’s normal policies don’t allow authors to share their work freely in an open access repository, authors  may still be able to retain their rights with a contract addendum or negotiation.  When that hasn’t worked, some academics have decided to publish in, or review for, other journals, as the California letter suggests.  (When pushed too far, some professors have even resigned en masse from editorial boards to start new journals that are friendlier to authors and readers.

If nothing else, scholarly and copyright conventions generally respect the right of authors to send individual copies of their papers to colleagues that request them.  Some repository software includes features that make such copies extremely easy to request and send out.  So even if you can’t find a free copy of a paper online already, you can often get one if you ask an author for it.

Liberating historic content

Many journals, including Nature, are important not only for their current papers, but for the historic record of past research contained in their back issues.  Those issues may be difficult to get a hold of, especially as many libraries drop print subscriptions, deaccession old journal volumes, or place them in remote storage.  And electronic access to old content, when it’s available at all, can be surprisingly expensive.  For instance, if I want to read this 3-paragraph letter to the editor from 1872 on Nature‘s web site, and I’m not signed in at a subscribing institution, the publisher asks me to pay them $32 to read it in full.

Fortunately, sufficiently old journals are in the public domain, and digitization projects are increasingly making them available for free.  At this point, nearly all volumes of Nature published before 1922 can now be read freely online, thanks to scans made available to the public by the University of Wisconsin, Google, and Hathi Trust.  I can therefore read the letters from that 1872 issue, on this page, without having to pay $32.

Mass digitization projects typically stop providing public access to content published after 1922, because copyright renewals after that year might still be in force.  However, most scholarly journals– including, as it turns out, Nature — did not file copyright renewals.  Because of this, Nature issues are actually in the public domain in the US all the way through 1963 (after which copyright renewal became automatic).  By researching copyrights for journals, we can potentially liberate lots of scholarly content that would otherwise be inaccessible to many. You can read more about journal non-renewal in this presentation, and research copyright renewals via this site.

Those knowledgeable about copyright renewal requirements may worry that the renewal requirement doesn’t apply to Nature, since it originates in the UK, and renewal requirements currently only apply to material that was published in the US before, or around the same time as, it was published abroad.  However, offering to distribute copies in the US counts as US publication for the purposes of copyright law.  Nature did just that when they offered foreign subscriptions to journal issues and sent them to the US; and as one can see from the stamp of receipt on this page, American universities were receiving copies within 30 days of the issue date, which is soon enough to retain the US renewal requirement.  Using similar evidence, one can establish US renewal requirements for many other journals originating in other countries.

Minding the gap

This still leaves a potential gap between the end of the public domain period and the present.  That gap is only going to grow wider over time, as copyright extensions continue to freeze the growth of the public domain in the US.

But the gap is not yet insurmountable, particularly for journals that are public domain into the 1960s.  If a paper published in 1964 included an author who was a graduate student or a young researcher, that author may well be still alive (and maybe even be still working) today, 46 years later.  It’s not too late to try to track authors down (or their immediate heirs), and encourage and help them to liberate their old work.

Moreover, even if those authors signed away all their rights to journal publishers long ago, or don’t remember if they still have any rights over their own work, they (or their heirs) may have an opportunity to reclaim their rights.  For some journal contributions between 1964 and 1977, copyright may have reverted to authors (or their heirs) at the time of copyright renewal, 28 years after initial publication.  In other cases, authors or heirs can reclaim rights assigned to others, using a termination of transfer.  Once authors regain their rights over their articles, they are free to do whatever they like with them, including making them freely available.

The rules for reversion of author’s rights are rather arcane, and I won’t attempt to explain them all here.  Terminations of transfer, though, involve various time windows when authors have the chance to give notice of termination, and reclaim their rights.  Some of the relevant windows are open right now.   In particular, if I’ve done the math correctly, 2010 marks the first year one can give notice to terminate the transfer of a paper copyrighted in 1964, the earliest year in which most journal papers are still under US copyright.  (The actual termination of a 1964 copyright’s transfer won’t take effect for another 10 years, though.)  There’s another window open now for copyright transfers from 1978 to 1985; some of those terminations can take effect as early as 2013.  In the future, additional years will become available for author recovery of copyrights assigned to someone else.  To find out more about taking back rights you, or researchers you know, may have signed away decades ago, see this tool from Creative Commons.

Recognizing opportunity

To sum up, we have opportunities now to liberate scholarly research over the full course of scholarly history, if we act quickly and decisively.  New research can be made freely available through open access repositories and journals.  Older research can be made freely available by establishing its public domain status, and making digitizations freely available.  And much of the research in the not-so-distant past, still subject to copyright, can be made freely available by looking back through publication lists, tracking down researchers and rights information, and where appropriate reclaiming rights previously assigned to journals.

Journal publishing plays an important role in the certification, dissemination, and preservation of scholarly information.  The research content of journals, however, is ultimately the product of scholars themselves, for the benefit of scholars and other knowledge seekers everywhere.   However the current dispute is ultimately resolved between Nature Publishing Group and the University of California, we would do well to remember the opportunities we have to liberate journal content for all.

