Everybody's Libraries

January 1, 2014

Public Domain Day 2014: The fight for the public domain is on now

Filed under: copyright,data,open access,sharing — John Mark Ockerbloom @ 2:42 pm

New Years’ Day is upon us again, and with it, the return of Public Domain Day, which I’m happy to see has become a regular celebration in many places over the last few years.  (I’ve observed it here since 2008.)  In Europe, the Open Knowledge Foundation gives us a “class picture” of authors who died in 1943, and whose works are now entering the public domain there and in other “life+70 years” countries.  Meanwhile, countries that still hold to the Berne Convention’s “life+50 years” copyright term, including Canada, Japan, and New Zealand, and many others, get the works of authors who died in 1963.  (The Open Knowledge Foundation also has highlights for those countries, where Narnia/Brave-New-World/purloined-plums crossover fanfic is now completely legal.)  And Duke’s Center for the Study of the Public Domain laments that, for the 16th straight year, the US gets no more published works entering the public domain, and highlights the works that would have gone into the public domain here were it not for later copyright extensions.

It all starts to look a bit familiar after a few years, and while we may lament the delays in works entering the public domain, it may seem like there’s not much to do about it right now.  After all, most of the world is getting another year’s worth of public domain again on schedule, and many commentators on the US’s frozen public domain don’t see much changing until we approach 2019, when remaining copyrights on works published in 1923 are scheduled to finally expire.  By then, writers like Timothy Lee speculate, public domain supporters will be ready to fight the passage of another copyright term extension bill on Congress like the one that froze the public domain here back in 1998.

We can’t afford that sense of complacency.  In fact, the fight to further extend copyright is raging now, and the most significant campaigns aren’t happening in Congress or other now-closely-watched legislative chambers.  Instead, they’re happening in the more secretive world of international trade negotiations, where major intellectual property hoarders have better access than the general public, and where treaties can be used to later force extensions of the length and impact of copyright laws at the national level, in the name of “harmonization”.   Here’s what we currently have to deal with:

Remaining Berne holdouts are being pushed to add 20 more years of copyright.  Remember how I said that Canada, Japan, and New Zealand were all enjoying another year of “life+50 years” copyright expirations?  Quite possibly not for long.  All of those countries are also involved in the Trans-Pacific Partnership (TPP) negotiations, which include a strong push for more extensive copyright control.  The exact terms are being kept secret, but a leaked draft of the intellectual property chapter from August 2013 shows agreement by many of the countries’ trade negotiators to mandate “life+70 years” terms across the partnership.  That would mean a loss of 20 years of public domain for many TPP countries, and ultimately increased pressure on other countries to match the longer terms of major trade partners.  Public pressure from citizens of those countries can prevent this from happening– indeed, a leak from December hints that some countries that had favored extensions back in August are reconsidering.  So now is an excellent time to do as Gutenberg Canada suggests and let legislators and trade representatives know that you value the public domain and oppose further extensions of copyright.

Life+70 years countries still get further copyright extensions.   The push to extend copyrights further doesn’t end when a country abandons the “life+50 years” standard.  Indeed, just this past year the European Union saw another 20 years added on to the terms of sound recordings (which previously had a 50-year term of their own in addition to the underlying life+70 years copyrights on the material being recorded.)  This extension is actually less than the 95 years that US lobbyists had pushed for, and are still pushing for in the Trans-Pacific Partnership, to match terms in the US.

(Why does the US have a 95-year term in the first place that it wants Europe to harmonize with?  Because of the 20-year copyright extension that was enacted in 1998 in the name of harmonizing with Europe.  As with climbers going from handhold to handhold and foothold to foothold higher in a cliff, you can always find a way to “harmonize” copyright ever upward if you’re determined to do so.)

The next major plateau for international copyright terms, life+100 years, is now in sight.  The leaked TPP draft from August also includes a proposal from Mexico to add yet another 30 years onto copyright terms, to life+100 years, which that country adopted not many years ago.  It doesn’t have much chance of passage in the TPP negotiations, where to my knowledge only Mexico has favored the measure.   But it makes “life+70″ seem reasonable in comparison, and sets a precedent for future, smaller-scale trade deals that could eventually establish longer terms.  It’s worth remembering, for instance, that Europe’s “life+70″ terms started out in only a couple of countries, spread to the rest of Europe in European Union trade deals, and then to the US and much of the rest of the world.  Likewise, Mexico’s “life+100″ proposal might be more influential in smaller-scale Latin American trade deals, and once established there, spread to the US and other countries.  With 5 years to go before US copyrights are scheduled to expire again in significant numbers, there’s time for copyright maximalists to get momentum going for more international “harmonization”.

What’s in the public domain now isn’t guaranteed to stay there.  That’s been the case for a while in Europe, where the public domain is only now getting back to where it was 20 years ago.  (The European Union’s 1990s extension directive rolled back the public domain in many European countries, so in places like the United Kingdom, where the new terms went into effect in 1996, the public domain is only now getting to where it was in 1994.)  But now in the US as well, where “what enters the public domain stays in the public domain” has been a long-standing custom, the Supreme Court has ruled that Congress can in fact remove works from the public domain in certain circumstances.   The circumstances at issue in the case they ruled on?  An international trade agreement– which as we’ve seen above is now the prevailing way of getting copyrights extended in the first place.   Even an agreement that just establishes life+70 years as a universal requirement, but doesn’t include the usual grandfathered exception for older works, could put the public domain status of works going back as far the 1870s into question, as we’ve seen with HathiTrust international copyright determinations.

But we can help turn the tide.  It’s also possible to cooperate internationally to improve access to creative works, and not just lock it up further.  We saw that start to happen this past year, for instance, with the signing of the Marrakesh Treaty on copyright exceptions and limitations, intended to ensure that those with disabilities that make it hard to read books normally can access the wealth of literature and learning available to the rest of the world.  The treaty still needs to be ratified before it can go into effect, so we need to make sure ratification goes through in our various countries.  It’s a hopeful first step in international cooperation increasing access instead of raising barriers to access.

Another improvement now being discussed is to require rightsholders to register ongoing interest in a work if they want to keep it under copyright past a certain point.  That idea, which reintroduces the concept of “formalities”, has been floated some prominent figures like US Copyright Register Maria Pallante.  Such formalities would alleviate the problem of “orphan works” no longer being exploited by their owners but not available for free use.   (And a sensible, uniform formalities system could be simpler and more straightforward than the old country-by-country formalities that Berne got rid of, or the formalities people already accept for property like motor vehicles and real estate.)  Pallante’s initial proposal represents a fairly small step; for compatibility with the Berne Convention, formalities would not be required until the last 20 years of a full copyright term.  But with enough public support, it could help move copyright away from a “one size fits all” approach to one that more sensibly balances the interests of various kinds of creators and readers.

We can also make our own work more freely available.  For the last several years, I’ve been applying my own personal “formalities” program, in which I release into the public domain works I’ve created that I don’t need to further limit.  So in keeping with the original 14-year renewable terms of US copyright law, I now declare that all work that I published in 1999, and that I have sole control of rights over, is hereby dedicated to the public domain via a CC0 grant.  (They join other works from the 1900s that I’ve also dedicated to the public domain in previous years.)  For 1999, this mostly consists of material I put online, including all versions of  Catholic Resources on the Net, one of the first websites of its kind, which I edited from 1993 to 1999.  It also includes another year’s history of The Online Books Page.

Not that you have to wait 14 years to free your work.  Earlier this year, I released much of the catalog data from the Online Books Page into the public domain.  The metadata in that site’s “curated collection” continues to be released as open data under a CC0 grant as soon as it is published, so other library catalogs, aggregators, and other sites can freely reuse, analyze, and republish it as they see fit.

We can do more with work that’s under copyright, or that seems to be.  Sometimes we let worries about copyright keep us from taking full advantage of what copyright law actually allows us to do with works.  In the past couple of years, we saw court rulings supporting the rights of Google and HathiTrust to use digitized, but not publicly readable, copies of in-copyright books for indexing, search, and preservation purposes.   (Both cases are currently being appealed by the Authors Guild.)  HathiTrust has also researched hundreds of thousands of book copyrights, and as of a month ago they’d enabled access to nearly 200,000 volumes that were classified as in-copyright under simple precautionary guideliness, but determined to be actually in the public domain after closer examination.)

In the coming year, I’d like to see if we can do similar work to open up access to historical journals and other serials as well.  For instance, Duke’s survey of the lost public domain mentions that articles from 1957 major science journals like Nature, Science, and JAMA are behind paywalls, but as far as I’ve been able to tell, none of those three journals renewed copyrights for their 1957 issues.  Scientists are also increasingly making current work openly available through open access journals, open access repositories, and even discipline-wide initiatives like SCOAP3, which also debuts today.

There are also some potentially useful copyright exemptions for libraries in Section 108 of US copyright law that we could use to provide more access to brittle materials, materials nearing the end of their copyright term, and materials used by print-impaired users.

Supporters of the public domain that sit around and wait for the next copyright extension to get introduced into their legislatures are like generals expecting victory by fighting the last warThere’s a lot that public domain supporters can do, and need to do, now.  That includes countering the ongoing extension of copyright through international trade agreements, promoting initiatives to restore a proper balance of interest between rightsholders and readers, improving access to copyrighted work where allowed, making work available that’s new to the public domain (or that we haven’t yet figured out is out of copyright), and looking for opportunities to share our own work more widely with the world.

So enjoy the New Year and the Public Domain Day holiday.  And then let’s get to work.

August 23, 2013

September 27, 2011

Libraries: Be careful what your web sites “Like”

Filed under: crimes and misdemeanors,data,libraries,people,privacy — John Mark Ockerbloom @ 6:15 pm

Imagine you’re working in a library, and someone with a suit and a buzz cut comes up to you, gestures towards a patron who’s leaving the building, and says “That guy you were just helping out; can you tell me what books he was looking at?”

Many librarians would react to this request with alarm.  The code of ethics adopted by the American Library Association states “We protect each library user’s right to privacy and confidentiality with respect to information sought or received and resources consulted, borrowed, acquired or transmitted.”  Librarians will typically refuse to give such information without a carefully-verified search warrant, and many are also campaigning against the particularly intrusive search demands authorized by the PATRIOT Act.

Yet it’s possible that the library in this scenario is routinely giving out that kind of information, without the knowledge or consent of librarians or patrons, via its web site.  These days, many sites, including those of libraries, invoke a variety of third-party services to construct their web pages.  For instance, some library sites use Google services to analyze site usage trends or to display book covers.  Those third party services often know what web page has been visited when they’re invoked, either through an identifier in the HTML or Javascript code used to invoke the service, or simply through the Referer information passed from the user’s web browser.

Patron privacy is particularly at risk when the third party also knows the identity of users visiting sensitive pages (like pages disclosing books they’re interested in).  The social networking sites that many library patrons use, for instance, can often track where their users go on the Web, even after they’ve left the social sites themselves.

For instance, if you go to the website of the Farmington Public Library (a library I used a lot when growing up in Connecticut), and search through their catalog, you may see Facebook “Like” buttons on the results.  On this page, for example, you may see that four people (possibly more by the time you read this) have told Facebook they Liked the book Indistinguishable from Magic.  Now, you can probably easily guess that if you click the Like button, and have a Facebook account, then Facebook will know that you liked the book too.  No big surprise there.

But what you can’t easily tell is that  Facebook is informed you’ve looked at this book page, even if you don’t click on anything.  If you’re a Facebook user and haven’t logged out– and for a while recently, even if you have logged out– Facebook knows your identity.  And if Facebook knows who you are and what you’re looking at, it has the power to pass along this information. It might do it through a “frictionless sharing” app you decided to try.  Or it might quietly provide it to organizations that it can sell your data to as permitted in its frequently changing data use policies.  (Which for a while even included tracking non-members.)

For some users, it might not be a big deal if it’s generally known what books they’re looking at online. But for others it definitely is a big deal, at least some of the time.  The problem with third-party inclusions like the Facebook “Like” button in catalogs is that library patrons may be denied the opportunity to give informed consent to sharing their browsing with others.  Libraries committed to protecting their patron’s privacy as part of their freedom to read need to carefully consider what third party services they invite to “tag along” when patrons browse their sites.

This isn’t just a Facebook issue.  Similar issues come up with other third-party services that also track individuals, as for instance Google does.  Libraries also have good reasons to partner with third party sites for various purposes.  For some of these purposes, like ebook provision, privacy concerns are fairly well understood and carefully considered by most libraries.  But librarians might not keep as close track of the development of their own web sites, where privacy leaks can spring up unnoticed.

So if any of your web sites (especially your online catalogs or other discovery and delivery services) use third party web services, consider carefully where and how they’re being invoked.  For each third party, you should ask what information they can get from users browsing your web site, what other information they have from other sources (like the “real names” and exact birthdates that sites like Facebook and Google+ demand), and what real guarantees, if any, they make about the privacy of the information.  If you can’t easily get satisfactory answers to these questions, then reconsider your use of these services.

September 17, 2010

July 31, 2010

Keeping subjects up to date with open data

Filed under: data,discovery,online books,open access,sharing,subjects — John Mark Ockerbloom @ 11:51 pm

In an earlier post, I discussed how I was using the open data from the Library of Congress’ Authorities and Vocabularies service to enhance subject browsing on The Online Books Page.  More recently, I’ve used the same data to make my subjects more consistent and up to date.  In this post, I’ll describe why I need to do this, and why doing it isn’t as hard as I feared that it might be.

The Library of Congress Subject Headings (LCSH) is a standard set of subject names, descriptions, and relationships, begun in 1898, and periodically updated ever since. The names of its subjects have shifted over time, particularly in recent years.  For instance, recently subject terms mentioning “Cookery”, a word more common in the 1800s than now, were changed to use the word “Cooking“, a term that today’s library patrons are much more likely to use.

It’s good for local library catalogs that use LCSH to keep in sync with the most up to date version, not only to better match modern usage, but also to keep catalog records consistent with each other.  Especially as libraries share their online books and associated catalog records, it’s particularly important that books on the same subject use the same, up-to-date terms.  No one wants to have to search under lots of different headings, especially obsolete ones, when they’re looking for books on a particular topic.

Libraries with large, long-standing catalogs often have a hard time staying current, however.  The catalog of the university library where I work, for instance, still has some books on airplanes filed under “Aeroplanes”, a term that recalls the long-gone days when open-cockpit daredevils dominated the air.  With new items arriving every day to be cataloged, though, keeping millions of legacy records up to date can be seen as more trouble than it’s worth.

But your catalog doesn’t have to be big or old to fall out of sync.  It happens faster than you might think.   The Online Books Page currently has just over 40,000 records in its catalog, about 1% of the size of my university’s.   I only started adding LC subject headings in 2006.  I tried to make sure I was adding valid subject headings, and made changes when I heard about major term renamings (such as “Cookery” to “Cooking”).  Still, I was startled to find out that only 4 years after I’d started, hundreds of subject headings I’d assigned were already out of date, or otherwise replaced by other standardized headings.  Fortunately, I was able to find this out, and bring the records up to date, in a matter of hours, thanks to automated analysis of the open data from the Library of Congress.  Furthermore, as I updated my records manually, I became confident I could automate most of the updates, making the job faster still.

Here’s how I did it.  After downloading a fresh set of LC subject headings records in RDF, I ran a script over the data that compiled an index of authorized headings (the proper ones to use), alternate headings (the obsolete or otherwise discouraged headings), and lists of which authorized headings were used for which alternate headings. The RDF file currently contains about 390,000 authorized subject headings, and about 330,000 alternate headings.

Then I extracted all the subjects from my catalog.  (I currently have about 38,000 unique subjects.)  Then I had a script check each subject see if it was listed as an authorized heading in the RDF file.  If not, I checked to see if it was an alternate heading.  If neither was the case, and the subject had subdivisions (e.g. “Airplanes — History”) I removed a subdivision from the end and repeated the checks until a term was found in either the authorized or alternate category, or I ran out of subdivisions.

This turned up 286 unique subjects that needed replacement– over 3/4 of 1% of my headings, in less than 4 years.  (My script originally identified even more, until I realized I had to ignore the simple geographic or personal names.  Those aren’t yet in LC’s RDF file, but a few of them show up as alternate headings for other subjects.)  These 286 headings (some of them the same except for subdivisions) represented 225 distinct substitutions.  The bad headings were used in hundreds of bibliographic records, the most popular full heading being used 27 times. The vast majority of the full headings, though, were used in only one record.

What was I to replace these headings with?  Some of the headings had multiple possibilities. “Royalty” was an alternate heading for 5 different authorized headings: “Royal houses”, “Kings and rulers”, “Queens”, “Princes” and “Princesses”.   But that was the exception rather than the rule.  All but 10 of my bad headings were alternates for only one authorized heading.  After “Royalty”, the remaining 9 alternate headings presented a choice between two authorized forms.

When there’s only 1 authorized heading to go to, it’s pretty simple to have a script do the substitution automatically.  As I verified while doing the substitutions manually, nearly all the time the automatable substitution made sense.  (There were a few that didn’t: for instance. when “Mind and body — Early works to 1850″ is replaced by “Mind and body — Early works to 1800“, works first published between 1800 and 1850 get misfiled.  But few substitutions were problematic like this– and those involving dates, like this one, can be flagged by a clever script.)

If I were doing the update over again, I’ll feel more comfortable letting a script automatically reassign, and not just identify, most of my obsolete headings.  I’d still want to manually inspect changes that affect more than one or two records, to make sure I wasn’t messing up lots of records in the same way; and I’d also want to manually handle cases where more than one term could be substituted.  The rest– the vast majority of the edits– could be done fully automatically.  The occasional erroneous reassignment of a single record would be more than made up by the repair of many more obsolete and erroneous old records.  (And if my script logs changes properly, I can roll back problematic ones later on if need be.)

Mind you, now that I’ve brought my headings up to date once, I expect that further updates will be quicker anyway.  The Library of Congress releases new LCSH RDF files about every 1-2 months.  There should be many fewer changes in most such incremental updates than there would be when doing years’ worth of updates all at once.

Looking at the evolution of the Library of Congress catalog over time, I suspect that they do a lot of this sort of automatic updating already.  But many other libraries don’t, or don’t do it thoroughly or systematically.  With frequent downloads of updated LCSH data, and good automated procedures, I suspect that many more could.  I have plans to analyze some significantly larger, older, and more diverse collections of records to find out whether my suspicions are justified, and hope to report on my results in a future post.  For now, I’d like to thank the Library of Congress once again for publishing the open data that makes these sorts of catalog investigations and improvements feasible.

June 20, 2010

How we talk about the president: A quick exploration in Google Books

Filed under: data,online books,sharing — John Mark Ockerbloom @ 10:28 pm

On The Online Books Page, I’ve been indexing a collection of memorial sermons on President Abraham Lincoln, all published shortly after his assassination, and digitized by Emory University.  Looking through them, I was struck by how often Lincoln was referred to as “our Chief Magistrate”.  That’s a term you don’t hear much nowadays, but was once much more common. Lincoln himself used the term in his first inaugural address, and he was far from the first person to do so.

Nowadays you’re more likely to hear the president referred to in different terms, with somewhat different connotations, such as “chief executive” or “commander-in-chief”.  The Constitution uses that last term in reference to the president’s command over the Armed Forces. Lately, though, I’ve heard “commander in chief” used as if it referred to the country in general.  As someone wary of the expansion of executive power in recent years, I find that usage unsettling.

I wondered, as I went through the Emory collection, whether the terms we use for the president reflect shifts in the role he has played over American history.  Is he called “commander in chief” more in times of war or military buildup, for instance?  How often was he instead called “chief magistrate” or “chief executive” over the course of American history?  And how long did “chief magistrate” stay in common use, and what replaced it?

Not too long ago, those questions would have simply remained idle curiosity.  Perhaps, if I’d had the time and patience, I could have painstakingly compiled a small selection of representative writings from various points in US history, read through them, and tried to draw conclusions from them.  But now I– and anyone else on the web– also have a big searchable, dated corpus of text to query: the Google Books collection.  Could that give me any insight into my questions?

It looks like it can, and without too much expenditure of time.  I’m by no means an expert on corpus analysis, but in a couple of hours of work, I was able to assemble promising-looking data that turned up some unexpected (but plausible) results.  Below, I’ll describe what I did and what I found out.

I started out by going to the advanced book search for Google.  From there, I specified particular decades for publications over the last 200 years: 1810-1819, 1820-1829, and so on up to 2000-2009.  For each decade, I recorded how many hits Google reported for the phrases “chief magistrate”, “chief executive”, and “commander in chief”, in volumes that also contained the word “president”.  Because the scope of Google’s collection may vary in different decades, I also recorded the total number of volumes in each decade containing the word “president”.  I then divided the number of phrase+”president” hits by the number of “president” hits, and graphed the proportional occurrences of each phrase in each decade.

The graph below shows the results.  The blue line tracks “chief magistrate”, the orange line tracks “chief executive”, and the green line tracks “commander in chief”.  The numbers in the horizontal axis refer to the decade+1800s; e.g. 1 is the 1810s, 2 is 1820s, all the way up to 20 being the 2000s.

Relative frequencies of "chief magistrate", "chief executive", and "commander in chief" used along with "president", by decade, 1810s-2000s

You can see a larger view of this graph, and the other graphs in this post, by clicking on it.

The graph suggests that “chief magistrate” was popular in the 19th century, peaking in the 1830s.  “Chief executive” arose from obscurity in the late 19th century,  overtook “chief magistrate” in the early 20th century, and then became widely used, apparently peaking in the 1980s.  (Though by then some corporate executives– think “chief executive officer”– are in the result set along with the US president.)

We don’t see a strong trend with “commander in chief”.  There are some peaks in usage in the 1830s, and the 1960s and 1970s, but they’re not dominant, and they don’t obviously correspond to any particular set of events.  What’s going on?  Was I just imagining a relation between its usage and military buildups?  Is the Google data skewed somehow?  Or is something else going on?

It’s true that the Google corpus is imperfect, as I and others have noted before.  The metadata isn’t always accurate; the number of reported hits is approximate when more than 100 or so, and the mix of volumes in Google’s corpus varies in different time periods.  (For instance, recent years of the corpus may include more magazine content than earlier years; and reprints can make texts reappear decades after they were actually written.  The rise of print-on-demand scans of old public-domain books in the 2000s may be partly responsible for the uptick in “chief magistrate” that decade, for instance.)

But I might also not be looking at the right data.  There are lots of reasons to mention “commander-in-chief” at various times.  The apparent trend that concerned me, though, was the use of “commander in chief” as an all-encompassing term.  Searching for the phrase “our commander in chief” with “president” might be better at identifying that. That search doesn’t distinguish military from civilian uses of that phrase, but an uptick in usage would indicate either a greater military presence in the published record, or a more militarized view among civilians.  So either way, it should reflect a more militaristic view of the president’s role.

Indeed, when I graph the relative occurrences of “our commander in chief” over time, the trend line looks rather different than before.  Here it is below, with the decades labeled the same way as in the first graph:

Scaled frequency of "Our commander in chief" used along with "President", by decade

Scaled frequency of "our commander in chief" used along with "president", by decade, 1810s-2000s

Here we see increases in decades that saw major wars, including 1812, the Mexican war of the 1840s, the civil war of the 1860s, and the Vietnam war expanding in the 1970s.  This past decade had the second most-frequent usage (by a small margin) of “our commander in chief” in the last 200 years of this corpus.  But it’s dwarfed by the use during the 1940s, when Americans fought in World War II.  That’s not something I’d expected, but given the total mobilization that occurred between 1941 and 1945, it makes sense.

If we look more closely at the frequency of “our commander in chief” in the last 20 years, we also find interesting results. The graph below looks at 1991 through 2009 (add 1990 to each number on the horizontal axis; and as always, click on the image for a closer look):

Scaled frequency of "our commander in chief" used along with "president", by year, 1991-2009

Not too surprisingly, after the successful Gulf War in early 1991, usage starts to decrease.  And not long after 9/11, usage increases notably, and stays high in the years to follow.  (Books take some time to go from manuscript to publication, but we see a local high by 2002, and higher usage in most of the subsequent years.)  I was a bit surprised, though, to find an initial spike in usage in 1999.  As seen in this timeline, Bill Clinton’s impeachment and trial took place in late 1998 and early 1999, and a number of the hits during this time period are in the context of questioning Clinton’s fitness to be “our commander in chief” in the light of the Lewinsky scandal.  But once public interest moved on to the 2000 elections, in which Clinton was not a candidate, usage dropped off again until the 9/11 attacks and the wars that followed.

I don’t want to oversell the importance of these searches.  Google Books search is a crude instrument for literary analysis, and I’m still a novice at corpus analysis (and at generating Excel graphs).  But my searches suggest that the corpus can be a useful tool for identifying and tracking large-scale trends in certain kinds of expression.  It’s not a substitute for the close reading that most humanities scholarship requires.  And even with the “distant reading” of searches, you still need to look at a sampling of your results to make sure you understand what you’re finding, and aren’t copying down numbers blindly.

But with those caveats, the Google Books corpus supports an enlightening high-altitude perspective on literature and culture.  The corpus is valuable not just for its size and searchability, but also for its public accessibility.  When I report on an experiment like this, anyone else who wants to can double-check my results, or try some followup searches of their own.  (Exact numbers will naturally shift somewhat over time as more volumes get added to the corpus.)  To the extent that searching, snippets, and text are open to all, Google Books can be everybody’s literary research laboratory.

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 84 other followers