Everybody's Libraries

October 7, 2011

My mother’s orphan

Filed under: copyright,findingada,online books,open access,people,preservation,sharing,teaching — John Mark Ockerbloom @ 5:06 pm

Before my mother was pregnant with me, she was working on a book.

The book had begun its gestation at least a year before. She had been teaching math in Massachusetts, and was involved with the Madison Project, one of the initiatives that arose from the “new math” movement of the 1960s.  What excited her, and what I caught from her not long after I was born, was the sense of discovery and play that was encouraged in the Madison teaching style.  The primary focus wasn’t so much on imparting and drilling facts and rules, or on mundane applications, but on finding patterns, solving puzzles, and figuring out the secrets of numbers and geometry and the other mathematical constructs that underlie our world. Some project participants planned a series of books that would help bring out this sense of discovery and exploration in math classes.

Two small children in the house may have delayed my mother’s ambitions, but we didn’t stop her.  When I was in kindergarten, the piles of papers in my parents’ bedroom went away, and my mother proudly showed me her new book.  The book, Discoveries in Essential Mathematics, was co-written with Ramon Steinen, and published by Charles E. Merrill. Though the textbook was written for middle schoolers, I remember reading through the book after my mother showed it to me, solving the simpler problems, and smiling when I saw my name or my sister’s in an example.

She got small royalty checks for a few years, but the book was out of print by the late 1970s, never reaching a second edition.  We kept some copies in our basement, but I didn’t know of any library that held it.  When I visited the Library of Congress as a middle schooler, wrongly convinced that they had every book ever published, I remember my disappointment when I couldn’t find Mom’s book in their card catalog.

My mother eventually retired from teaching, and the enthusiasm and talent I’d gotten from Mom for math shifted into computing, and then into digital libraries.  And when my kids reached school age, I decided to try putting her book online.  In an era of large classes, detailed state standards, and high-stakes standardized tests, it might not be a viable standard textbook any more, but I think it’s still great for curious kids who show an interest in math.

Mom thought that was a great idea.  But she didn’t know if she could grant permission on her own.  Although long out of print, the book’s copyright had automatically renewed in 2000 under US copyright law, and she wasn’t sure if she had to get the consent of her publisher or co-author before she could give me the go-ahead. She didn’t know how to reach her co-author, and her old imprint was long gone.  Even its acquirer had itself been acquired by a large conglomerate some time ago.  So I let the idea drop, thinking I’d come back to it later when I had a little time to research the copyright.

But not long after, she started a long slide into dementia, and was soon in no position to give permission to anyone.  If her book had been practically an “orphan work” before, due to uncertainty over rights, it was even more so now.  There was no trouble locating the author; but no way of getting valid permission from someone definitely known to hold the rights.

Mom died this past winter, four years after my Dad had reluctantly moved her into the nursing home for good, and four weeks after he’d made his usual daily visit, gone back home, and had a fatal heart attack.  After we paid the last of the bills, and threw out the contents of the basement (where a burst pipe ruined all the books, papers, and other things they kept down there), what remained of what they had would now go to me and my siblings.

I still had a copy at home of the teacher’s edition of Mom’s book that she had once given to Grandma.  And between my mother’s funeral and the burst pipe, I’d taken a student edition out of their basement for my kids to read.  But any faint hope of finding publishing contracts or rights assignment documents was obliterated after the pipe burst.  The basic questions were: had Mom signed her rights to the book away, as many academic authors do? If so, had she gotten them back at some point?  Or had she never had the rights in the first place, as sometimes happens with textbook authors under “work for hire” contracts?

The copyright page of the book, and the record in the 1972 Catalog of Copyright Entries, show the publisher as the copyright claimant, so I couldn’t assume she had the rights.   But I also doubted whether I could get a clear answer, or reasonable licensing terms, from the company that had eventually acquired the assets of Mom’s original publisher.

I eventually found what I needed to know on a trip to Washington, DC.  While attending a meeting on digital format registries, I realized that I was in the same building as the Copyright Office.   So after the meeting, I got a reader’s card, went upstairs, and consulted the librarians there.  We confirmed that, under the automatic renewal laws of the time, the copyright to Mom’s book would have reverted in 2000 to whoever had been declared the “author” in the book in the original registration record.   Moreover, in the absence of any contrary arrangement, any co-owner of a copyright can authorize publication, as long as they split any proceeds with the other copyright owners.

Since I was planning just to put the book online for free, the only question remaining was: who was listed as the author on the original registration: the publisher who claimed the copyright, or my mother and Dr. Steinen?  It’s not clear from the Catalog of Copyright Entries, but the original registration certificate would state it.  And the one copy known to exist of that certificate was in the archives of the Copyright Office where I was sitting.

Twenty minutes later, I had the certificate in front of me.  The name on the “claimant” line was indeed the publisher’s, but the names on the “author” line were Steinen and Ockerbloom.  My mother’s orphan was mine to claim.

There are a lot more books out there like hers.  Since I added records for Hathi Trust‘s public domain books to The Online Books Page, I’ve gotten requests to curate hundreds of out of print, largely forgotten books that are still meaningful to readers online.  Many of the people who opt to leave contact information  live in places where  books tend to be hard to get or pay for. Many others, judging from their names, seem to be related to the authors of the books they suggest. These readers have found the books after Hathi, or Google, or the Internet Archive, has resurfaced them online, and the readers want these books to live on.  If there were an easy, inexpensive, uncontroversially legal way to also bring back books that are still in copyright, but no longer commercially exploited, I’m sure I could fulfill a lot of requests for those books too.

For now, though, I’ll bring back the one orphan book I’ve been given. And I thank my mother for writing it, and the other women and men who have poured so much of their energy and teaching into their books, and the librarians of all kinds who help ensure those books stay accessible to readers who value them.  I’ll try my best to keep your legacies alive.

September 23, 2011

Early journals from JSTOR and others

Filed under: copyright,open access,serials,sharing — John Mark Ockerbloom @ 11:26 am

Earlier this month,  JSTOR announced that it would provide  free open access to their earliest scholarly journal content, published before 1923.  All of this material should be old enough to be in the public domain.  (Or at least it is in the US.  Since copyrights can last longer elsewhere, JSTOR is only showing pre-1870 volumes openly outside the US.)  I was very pleased to hear they would be opening up this content; it’s something I’d asked them to consider ever since they ended a small trial of open, public domain volumes in their early years.

Lots of early  journal content now openly readable online

The time was ripe to open access at JSTOR.  (And not just because of growing discontent over limited access to public domain and publicly funded research.) Thanks to mass-digitization initiatives and other projects, much of the early journal content found in JSTOR is now also available from other sources.  For instance, after Gregory Maxwell posted a torrent of pre-1923 JSTOR volumes of the Philosophical Transactions of the Royal Society of London, I surveyed various free digital text sites and found nearly all the same volumes, and more, available for free from Hathi Trust, Google, the Internet Archive, Gallica, PubMed Central, and the Royal Society itself.  The content needed to be organized to be usefully browsable across sites, but that required a bit of basic librarianship and a bit of time.

Philosophical Transactions is not an anomaly.  After collating volumes of this journal, I looked at the first ten journals that signed on to JSTOR back in the mid-1990s.  (The list can be found below.)  I again found that nearly all of pre-1923 content of these journals was also available from various free online sites.  Now, when you look them up on The Online Books Page, you’ll find links to both the JSTOR copies and the copies at other sites.

Comparing the sites that provide this content is enlightening.  In general, the JSTOR copies are better presented,  with article-level tables of contents, cross-volume searching, article downloads, and consistently high scan quality.  But the copies at other sites are generally usable as well, and sometimes include interesting non-editorial material, such as advertisements, that might not be present in JSTOR’s archive.  By opening up access to its early content now, though, JSTOR will remain the preferred access point to this early content for most researchers — and that, hopefully, will help attract and sustain paid support for the larger body of scholarly content that JSTOR provides and preserves for its subscribers.

And there’s a lot more in the public domain

JSTOR currently only provides open access for volumes up to 1922 (or up to 1869, if you’re not in the US).   But there’s lots more public domain journal content that can be made available.  Looking again at the initial ten JSTOR journals, I found that all of them have additional public domain content that is currently not available as open access on JSTOR, or as of yet on other sites.  That’s because journals published in the US before 1964 had to renew their copyrights after 28 years or enter the public domain.  But most scholarly journals, including these 10, did not renew the copyrights to all their issues.  Here’s a list of the 10 journals, and their first issue copyright renewals:

  1. The American Historical Review – began 1895; issues first renewed in 1931
  2. Econometrica - began 1933; issues first renewed in 1942
  3. The American Economic Review – began 1911; issues not renewed before 1964 (when renewal became automatic)
  4. Journal of Political Economy – began 1892; issues first renewed in 1953
  5. Journal of Modern History - began 1929, issues first renewed in 1953
  6. The William and Mary Quarterly – began 1892; issues first renewed in 1946
  7. The Quarterly Journal of Economics – began 1886; issues first renewed in 1934
  8. The Mississippi Valley Historical Review (now the Journal of American History) – began 1914; issues first renewed in 1939
  9. Speculum – began 1926; issues first renewed in 1934
  10. Review of Economic Statistics (now the Review of Economics and Statistics) – began 1919; issues first renewed in 1935

This list reflects more proactive renewal policies than were typical for scholarly journals. A few years ago, I did a survey of JSTOR journals (summarized in this presentation) that were publishing between 1923 and 1950, and found that only 49 out of 298, or about 1/6, renewed any of their issue copyrights for that time period.  (JSTOR has since added more journals covering this time period, so the numbers will be different now, but I suspect the renewal rate won’t be any higher now than it was then.)

Currently JSTOR has no plans to open up access to post-1922 journal volumes.  But many of those volumes have been digitized, and are in Google’s or Hathi Trust’s collections; or they could be digitized by contributors to the Internet Archive or similar text archives.

If someone does want to open up these volumes, they should re-check their copyright status.   In particular, I have not yet checked the copyright status of individual articles in these journals, which can in theory be renewed separately.  In practice, I’ve found this rarely done for scholarly articles, but not completely unknown.  It might be feasible for me to do a “first article renewal” inventory for journals, like I’ve done for first issue renewal, which could speed up clearances.

Opportunities for open librarianship

JSTOR’s recent open access release of early journals, then, is just the beginning of the open access historic journal content that can be available online.  JSTOR provides a valuable service to libraries in providing and preserving comprehensive digital back runs of major scholarly journals, both public domain and copyrighted.  But while our libraries pay for that service, let’s also remember our mission to provide access to knowledge for all whenever possible.  JSTOR’s contribution in opening  its pre-1923 journal volumes is a much-appreciated contribution to a high-quality open record of early scholarship.  We can build on that further, with copyright research, digitization, and some basic public librarianship.  (I’ve discussed the basics of journal liberation in previous posts.)

For my part, I plan to start by gradually incorporating the open access JSTOR offerings into the serial listings of the Online Books Page, as time permits.  I can also gather further copyright information on these and other journals as I bring them in.  I’m also happy to hear about more journals that are or can go online (whether they’re JSTOR journals or not); you can submit them via my suggestion interface.

How about you?  What would you like to see from the early scholarly record, and what can you do to help open it up?

June 15, 2011

A digital public library we still need, and could build now

Filed under: citizen librarians,copyright,libraries,people,sharing — John Mark Ockerbloom @ 12:39 pm

It’s been more than half a year since the Digital Public Library of America project was formally launched, and I’m still trying to figure out what the project organizers really want it to be.  The idea of “a digital library in service of the American public” is a good one, and many existing digital libraries already play that role in a variety of ways.  As I said when I christened this blog, I’m all for creating a multitude of libraries to serve a diversity of audiences and information needs.

At a certain point after an enthusiastic band of performers says “Let’s put on a show!”, though, someone has to decide what their show’s going to be about, and start focusing effort there.  So far, the DPLA seems to be taking an opportunistic approach.  Instead of promulgating a particular blueprint for what they’ll do, they’re asking the community for suggestions, in a “beta sprint” that ends today.   Whether this results in a clear distinctive direction for the project, or a mishmash of ideas from other digitization, aggregation, preservation, and public service initiatives, remains to be seen.

Just about every digital project I’ve seen is opportunistic to some extent.   In particular, most of the big ones are opportunistic when it comes to collection development.  We go after the books, documents, and other knowledge resources that are close to hand in our physical collections, or that we find people putting on the open web, or that our users suggest, or volunteer to provide on their own.

There are a number of good reasons for this sort of opportunism.  It lets us reuse work that we don’t have to redo ourselves.  It can inform us of audience interests and needs (at least as far as the interests of the producers we find align with the interests of the consumers we serve).  And it’s cheap, and that’s nothing to sneer at when budgets are tight.

But the public libraries that my family prefers to use don’t, on the whole, have opportunistically built collections.  Rather, they have collections shaped primarily by the needs of their patrons, and not primarily by the types of materials they can easily acquire.   The “opportunistic” community and school library collections I’ve seen tend to be the underfunded ones, where books in which we have yet to land on the Moon, the Soviet Union is still around, or Alaska is not yet a state may be more visible than books that reflect current knowledge or world events.  The better libraries may still have older titles in their research stacks, but they lead with books that have current relevance to their community, and they go out of their way to acquire reliable, readable resources for whatever information needs their users have.  In other words, their collections and services are driven by  demand, not supply.

In the digital realm, we have yet to see a library that freely provides such a digital collection at large scale for American public library users.   Which is not to say we don’t have large digital book collections– the one I maintain, for instance, has over a million freely readable titles, and Google Books and lots of other smaller digital projects have millions more.  But they function more as research or special-purpose collections than as collections for general public reference, education, or enjoyment.

The big reason for this, of course, is copyright.  In the US, anyone can freely digitize books and other resources published before 1923, but providing anything published after that requires copyright research and, usually, licensing, that tends to be both complex and expensive.  So the tendency of a lot of digital library projects is to focus on the older, obviously free material, and have little current material.  But a generally useful digital public library needs to be different.

And it can be, with the right motivation, strategy, and support.  The key insight is that while a strong digital public library needs to have high-quality, current knowledge resources, it doesn’t need to have all such resources, or even the most popular or commercially successful ones.  It just needs to acquire and maintain a few high-quality resources for each of the significant needs and aptitudes of its audience. Mind you, that’s still a lot of ground to cover, especially when you consider all the ages, education levels, languages, physical and mental abilities, vocational needs, interests, and demographic backgrounds that even a midsized town’s public library serves.  But it’s still a substantially smaller problem, and involves a smaller cost, than the enticing but elusive idea of providing instant free online access to everything for everyone.

There are various ways public digital libraries could acquire suitable materials proactively.  The America.gov books collection provides one interesting example.  The US State Department wanted to create a library of easy-to-read books on civics and American culture and history for an international audience.  Some of these books were created in-house by government staff.  Others were commissioned to outside authors.  Still others were adapted from previously published works, for which the State Department acquired rights.

A public digital library could similarly create, commission, solicit, or acquire rights to books that meet unfilled information needs of its patrons.  Ideally it would aim to acquire rights not just to distribute a work as-is, but also to adapt and remix into new works, as many Creative Commons licenses allow.  This can potentially greatly increase the impact of any given work.  For instance, a compellingly written,  beautifully illustrated book on dinosaurs might be originally written for 9-12 year old English speakers, and be noticeably obsolete due to new discoveries after 5 or 10 years.  But if a library’s community has reuse and adaptation rights, library members can translate, adapt, and update the book, so it becomes useful to a larger audience over a longer period of time.

This sort of collection building can potentially be expensive; indeed, it’s sobering that America.gov has now ceased being updated, due to budget cuts.  But there’s a lot that can be produced relatively inexpensively.  Khan Academy, for example, contains thousands of short, simple educational videos, exercises, and assessments created largely by one person, with the eventual goal of systematically covering the entire standard K-12 curriculum.  While I think a good educational library will require the involvement of many more people, the Khan example shows how much one person can get accomplished with a small budget, and projects like Wikipedia show that there’s plenty of cognitive surplus to go around, that a public library effort might usefully tap into.

Moreover, the markets for rights to previously authored content can potentially be made much more efficient than they are now.  Most books, for instance, go out of print relatively quickly, with little or no commercial exploitation thereafter.  And as others have noted, just trying to get permission to use  a work digitally, even apart from any royalties, can be very expensive and time-consuming.  But new initiatives like Gluejar aim to make it easier to match up people who would be happy to share their book rights with people who want to reuse them. Authors can collect a small fee (which could easily be higher than the residual royalties on an out-of-print book); readers get to share and adapt books that are useful to them.   And that can potentially be much cheaper than acquiring the rights to a new work, or creating one from scratch.

As I’ve described above, then, a digital public library could proactively build an accessible collection of high-quality, up to date online books and other knowledge resources, by finding, soliciting, acquiring, creating, and adapting works in response to the information needs of its users.  It would build up its collection proactively and systematically, while still being opportunistic enough to spot and pursue fruitful new collection possibilities.  Such a digital library could be a very useful supplement to local public libraries, would be open any time anywhere online, and could provide more resources and accessibility options than a local public library could provide on its own.  It would require a lot of people working together to make it work, including bibliographers, public service liaisons, authors, technical developers, and volunteers, both inside and outside existing libraries.  And it would require ongoing support, like other public libraries do, though a library that successfully serves a wide audience could also potentially tap into a wide base of funds and in-kind contributions.

Whether or not the DPLA plans to do it, I think a large-scale digital free public library with a proactively-built, high-quality, broad-audience general collection is something that a civilized society can and should build.  I’d be interested in hearing if others feel the same, or have suggestions, critiques, or alternatives to offer.

May 24, 2011

November 11, 2010

You do the math

Filed under: open access,publishing,serials,sharing — John Mark Ockerbloom @ 6:02 pm

I recently heard from Peter Murray-Rust that the Central European Journal of Mathematics (CEJM) is looking for graduate students to edit the language of papers they publish.  CEJM is co-published by Versita and Springer Science+Business Media.

Would-be editors are promised their name on the masthead, and references and recommendations from the folks who run the journal.  These perks are tempting to a student (or postdoc) hoping for stable employment, but you can get such benefits working with just about any scholarly journal.  There’s no mention of actual pay for any of this editing work.  (Nor is there any pay for the associate editors they also seek, though those editors are also promised access to the journal’s content.)

The reader’s side of things looks rather different, when it comes to paying. If we look at Springer’s price lists for 2011, for instance, we see that the list price for a 1-year institutional subscription to CEJM is $1401 US for “print and free access or e-only”, or $1681 US for “enhanced access”.  An additional $42 is assessed for postage and handling, presumably waived if you only get the electronic version, but charged otherwise.

This is a high subscription rate even by the standards of commercial math journals.  At universities like mine, scholars don’t pay for the journal directly, but the money the library uses for the subscription is money that can’t be used to buy monographs, or to buy non-Springer journals, or to improve library service to our mathematics scholars.  Mind you, many universities get this journal as part of a larger package deal with Springer.  This typically lowers the price for each journal, but the package often includes a number of lower-interest journals that wouldn’t otherwise be bought.  Large amounts of money are tied up in these “big deals” with large for-profit publishers such as Springer.

If you can’t, or won’t, lay out the money for a subscription or larger package, readers can pay for articles one at a time.  When I tried to look at a recent CEJM article from home, for instance, I was asked to pay $34 before I could read it.  Another option is author-paid open access.  CEJM authors who want to make their papers available through the journal without a paywall can do so through Springer’s Open Choice program.  This will cost the author $3000 US.

So there’s plenty of money involved in this journal.  It’s just that none of it goes to the editors they’re seeking.  Or to the authors of the papers, who submit them for free (or with a $3000 payment).  Or to the peer reviewers of the papers, if this journal works like most other scholarly journals and uses volunteer scholars as referees.  A scholar might justifiably wonder all this money is going, or what value they get in return for it.

As the editor job ads imply, much of what scholars get out of editing and publishing in journals like these is recognition and prestige.  That, indeed, has value, but the cost-value function can be optimized much better than in this case.  CEJM’s website mentions that it’s tracked by major citation services, and has a 0.361 impact factor (a number often used, despite some notable problems, to give a general sense of a journal’s prestige).  Looking through the mathematics section of the Directory of Open Access Journals, I find a number of scholarly journals that are also tracked by citation services, but don’t charge anything to readers, and as far as I can tell don’t charge anything to authors either.   Here are some of them:

Central Europe, besides being the home of CEJM, is also the home of several open access math journals such as Documenta Mathematica (Germany), the Balkan Journal of Geometry and its Applications (Romania), and the Electronic Journal of Qualitative Theory of Differential Equations (Hungary).  For what it’s worth, all of these journals, and all the other open access journals mentioned in this post, currently show higher impact factors in Journal Citation Reports than CEJM does.

Free math journals aren’t limited to central Europe.  Here in the US, the American Mathematical Society makes the Bulletin of the American Mathematical Society free to read online, through the generosity of its members.  And on the campus where I work, Penn’s math department sponsors the Electronic Journal of Combinatorics.

A number of other universities also sponsor open-access journals, promoting their programs, and the findings of scholars worldwide, with low overhead.  For instance, there are two relatively high-impact math journals from Japanese universities: the Kyushu Journal of Mathematics and the Osaka Journal of Mathematics.  The latter journal’s online presence is provided by Project Euclid, a US-based initiative to support low-cost, non-profit mathematics publishing.

Ad-hoc groups of scholars can also organize their own open access journals in their favored specialty.  For instance, Homology, Homotopy and Applications is founded and entirely run by working mathematicians.  Some journals, such as the open access Discrete Mathematics and Theoretical Computer Science, use Open Journal Systems, a free open source publishing software package, to produce high-quality journal websites with little expenditure.

The Proceedings of the Indian Academy of Sciences: Mathematical Sciences is an interesting case.  Like many scholarly societies, the Indian Academy has recently made a deal with a for-profit publisher (Springer, as it turns out) to distribute their journals in print and electronic form.  Unlike many such societies, though, the Academy committed to continuing a free online version of this journal on their own website.

This is a fortunate decision for readers, because libraries that acquire the commercially published version will have to pay Springer $280 per year for basic access and $336 for “enhanced access”, according to their 2011 price list.  True, libraries get a print copy with this more expensive access (if they’re willing to pay Springer another $35 in postage and handling charges).  But the Academy sends out print editions within India for a total subscription price (postage included) of 320 rupees per year.   At today’s exchange rates, that’s less than $8 US.

Virtually all journals, whether in mathematics or other scholarly fields, depend heavily on unpaid academic labor for the authorship, refereeing, and in some cases editing of their content.  But, as you can see with CEJM and the no-fee open access journals mentioned above, journals vary widely in the amount of money they also extract from the academic community.  In between these two poles, there are also lots of other high-impact math journals with lower subscription prices, as well as commercial open access math journals with much lower author fees than Springer’s Open Choice.  These journals further diversify the channels of communication among mathematicians, without draining as much of  their funds.

I certainly hope mathematicians and other scholars will continue to volunteer their time and talents to the publication process, both for their benefit and for ours.  But if we optimize where and how we give our time and talent (and our institutional support), both scholars and the public will be better off.  As I’ve shown above, with a little bit of information and attention, there’s no shortage of low-cost, high-quality publishing venues that scholars can use as alternatives to overpriced journals.

October 15, 2010

Journal liberation: A community enterprise

Filed under: copyright,discovery,open access,publishing,serials,sharing — John Mark Ockerbloom @ 2:53 pm

The fourth annual Open Access Week begins on Monday.  If you follow the official OAW website, you’ll be seeing a lot of information about the benefits of free access to scholarly research.  The amount of open-access material grows every day, but much of the research published in scholarly journals through the years is still practically inaccessible to many, due to prohibitive cost or lack of an online copy.

That situation can change, though, sometimes more dramatically than one might expect.  A post I made back in June, “Journal liberation: A Primer”, discussed the various ways in which people can open access to journal content, past and present,  one article or scanned volume at a time.  But things can go much faster if you have a large group of interested liberators working towards a common goal.

Consider the New England Journal of Medicine (NEJM), for example.  It’s one of the most prominent journals in the world, valued both for its reports on groundbreaking new research, and for its documentation, in its back issues, of nearly 200 years of American medical history.  Many other journals with lesser value still cannot be read without paying for a subscription, or visiting a research library that has paid for a subscription.  But you can find and read most of NEJM’s content freely online, both past and present. Several groups of people made this possible.  Here are some of them.

The journal’s publisher has for a number of years provided open access to all research articles more than 6 months old, from 1993 onward.  (Articles less than 6 months old are also freely available to readers in certain developing countries, and in some cases for readers elsewhere as well.)  A registration requirement was dropped in 2007.

Funders of medical research, such as the National Institutes of Health, the Wellcome Trust, and the Howard Hughes Medical Institute, have encouraged publishers in the medical field to maintain or adopt such open access policies, by requiring their grantees (who publish many of the articles in journals like the NEJM) to make their articles openly accessible within months of publication.  Some of these funders also maintain their own repositories of scholarly articles that have appeared in NEJM and similar journals.

Google Books has digitized most of the back run of the NEJM and its predecessor publications as part of its Google Books database.  Many of these volumes are freely accessible to the public.  This is not the only digital archive of this material; there’s also one on NEJM’s own website, but access there requires either a subscription or a $15 payment per article.   Google’s scans, unlike the ones on the NEJM website, include the advertisements that appeared along with the articles.  These ads document important aspects of medical history that are not as easily seen in the articles, on subjects ranging from the evolving requirements and curricula of 19th-century medical schools to the early 20th-century marketing of heroin for patients as young as 3 years old.

It’s one thing to scan journal volumes, though; it’s another to make them easy to find and use– which is why NEJM’s for-pay archive got a fair bit of publicity when it was released this summer, while Google’s scans went largely unnoticed.  As I’ve noted before, it can be extremely difficult to find all of the volumes of a multi-volume work in Google Books; and it’s even more difficult in the case of NEJM, since issues prior to 1928 were published under different journal titles.  Fortunately, many of the libraries that supplied volumes for Google’s scanners have also organized links to the scanned volumes, making it easier to track down specific volumes.  The Harvard Libraries, for instance, have a chronologically ordered list of links to most of the volumes of the journal from 1828 to 1922, a period when it was known as the Boston Medical and Surgical Journal.

For many digitized journals, open access stops after 1922, because of uncertainty about copyright.  However, most scholarly journals have public domain content after that date, so it’s possible to go further if you research journal copyrights.  Thanks to records provided by the US Copyright Office and volunteers for The Online Books Page, we can determine that issues and articles of the NEJM prior to the 1950s did not have their copyrights renewed.  With this knowledge, Hathi Trust has been able and willing to open access to many volumes from the 1930s and 1940s.

We at The Online Books Page can then pull together these volumes and articles from various sources, and create a cover page that allows people to easily get to free versions of this journal and its predecessors all the way back to 1812.

Most of the content of the New England Journal of Medicine has thus been liberated by the combined efforts of several different organizations (and other interested people).  There’s still more than can be done, both in liberating more of the content, and in making the free content easier to find and use.  But I hope this shows how widespread  journal liberation efforts of various sorts can free lots of scholarly research.  And I hope we’ll hear about many more  free scholarly articles and journals being made available, or more accessible and usable, during Open Access Week and beyond.

I’ve also had another liberation project in the works for a while, related to books, but I’ll wait until Open Access Week itself to announce it.  Watch this blog for more open access-related news, after the weekend.

September 17, 2010

August 31, 2010

As living arrows sent forth

Filed under: discovery,sharing — John Mark Ockerbloom @ 9:52 pm

It’s that time of year when offspring start to leave home and strike out on their own.  Young children may be starting kindergarten.  Older ones may be heading off to university.  And in between, children slowly gain a little more independence every year.  If parents are fortunate, and do our job well, we set our children going in good directions, but they then make paths for themselves.

Standards are a little like children that way.  You can invest lots of time, thought, and discussion into specifying how some set of interactions, expressions, or representations should work.  But, if you do well, what you specified will take on a life apart from you and its other parents, and make its own way in the world.  So it’s rather gratifying for me to see a couple of specifications that I’d helped parent move out into the world that way.

I’ve mentioned them both previously on this blog.  One was a fairly traditional committee effort: the DLF ILS-Discovery Interface recommendation.  After the original DLF group finished its work, a new group of folks affiliated with OCLC and the Code4lib community formed to implement the types of interfaces we’d recommended.  The new group has recently announced they’ll be supporting and contributing code to the Extensible Catalog NCIP toolkit.  This is an important step towards realizing the goal of standardized patron interaction with integrated library systems.  I’m looking forward to seeing how the project progresses, and hope I’ll hear more about it at the upcoming Digital Library Federation forum.

The other specification I’ve worked on that’s recently taken on a life of its own is the Free Decimal Correspondence (FDC).   This was a purely personal project of mine to develop a simple, freely reusable classification that was reasonably compatible with the Dewey Decimal System and the Library of Congress Subject Headings.  I created it for Public Domain Day last year, and did a few updates on it afterwards, but have largely left it on the shelf for the last while.  Now, however, it’s being used as one of the bases of the “Melvil Decimal System“, part of the Common Knowledge metadata maintained at LibraryThing.

It’s nice to see both of these efforts start to make their mark in the larger world.  I’ve seen the ILS-DI implementation work develop in good hands for a while, and I’m content at this point to watch its progress from a distance.  The Free Decimal Correspondence adoption was a bit more of a surprise, though one that was quite welcome.  (I put FDC in the public domain in part to encourage that sort of unexpected reuse.)  When the Melvil project’s use of FDC was announced, I quickly put out an update of the specification, so that recent additions and corrections I’d made could be easily reused by Melvil.

I’m still trying to figure out what further updating, if any, I should do for FDC.  Melvil already goes into more detail than FDC in many cases, and as a group project, it will most likely further outstrip FDC in size as time passes.  On the other hand, keeping in sync specifically with LC Subject Headings terminology is not necessarily a goal of Melvil’s, as it has been for FDC.  Though I’m not sure at this point if that specific feature of FDC is important to any existing or planned project out there.  And as I stated in my FDC FAQ, I don’t intend to spend a whole lot of time maintaining or supporting FDC over the long term.

But since it is getting noticeable outside use, I’ll probably spend at least some time working up to a 1.0 release.  This might simply involve making a few corrections and then declaring it done.  Or it could involve incorporating some of the information from Melvil back into FDC, to the extent that I can do so while keeping FDC in the public domain.  Or it could involve some further independent development.  To help me decide, I’d be interested in hearing from anyone who’s interested in using or developing FDC further.

Projects are never really finished until you let them go.  I’m glad to see these particular ones take flight, and hope that we in the online library community will release lots of other creations in the years to come.

July 31, 2010

Keeping subjects up to date with open data

Filed under: data,discovery,online books,open access,sharing,subjects — John Mark Ockerbloom @ 11:51 pm

In an earlier post, I discussed how I was using the open data from the Library of Congress’ Authorities and Vocabularies service to enhance subject browsing on The Online Books Page.  More recently, I’ve used the same data to make my subjects more consistent and up to date.  In this post, I’ll describe why I need to do this, and why doing it isn’t as hard as I feared that it might be.

The Library of Congress Subject Headings (LCSH) is a standard set of subject names, descriptions, and relationships, begun in 1898, and periodically updated ever since. The names of its subjects have shifted over time, particularly in recent years.  For instance, recently subject terms mentioning “Cookery”, a word more common in the 1800s than now, were changed to use the word “Cooking“, a term that today’s library patrons are much more likely to use.

It’s good for local library catalogs that use LCSH to keep in sync with the most up to date version, not only to better match modern usage, but also to keep catalog records consistent with each other.  Especially as libraries share their online books and associated catalog records, it’s particularly important that books on the same subject use the same, up-to-date terms.  No one wants to have to search under lots of different headings, especially obsolete ones, when they’re looking for books on a particular topic.

Libraries with large, long-standing catalogs often have a hard time staying current, however.  The catalog of the university library where I work, for instance, still has some books on airplanes filed under “Aeroplanes”, a term that recalls the long-gone days when open-cockpit daredevils dominated the air.  With new items arriving every day to be cataloged, though, keeping millions of legacy records up to date can be seen as more trouble than it’s worth.

But your catalog doesn’t have to be big or old to fall out of sync.  It happens faster than you might think.   The Online Books Page currently has just over 40,000 records in its catalog, about 1% of the size of my university’s.   I only started adding LC subject headings in 2006.  I tried to make sure I was adding valid subject headings, and made changes when I heard about major term renamings (such as “Cookery” to “Cooking”).  Still, I was startled to find out that only 4 years after I’d started, hundreds of subject headings I’d assigned were already out of date, or otherwise replaced by other standardized headings.  Fortunately, I was able to find this out, and bring the records up to date, in a matter of hours, thanks to automated analysis of the open data from the Library of Congress.  Furthermore, as I updated my records manually, I became confident I could automate most of the updates, making the job faster still.

Here’s how I did it.  After downloading a fresh set of LC subject headings records in RDF, I ran a script over the data that compiled an index of authorized headings (the proper ones to use), alternate headings (the obsolete or otherwise discouraged headings), and lists of which authorized headings were used for which alternate headings. The RDF file currently contains about 390,000 authorized subject headings, and about 330,000 alternate headings.

Then I extracted all the subjects from my catalog.  (I currently have about 38,000 unique subjects.)  Then I had a script check each subject see if it was listed as an authorized heading in the RDF file.  If not, I checked to see if it was an alternate heading.  If neither was the case, and the subject had subdivisions (e.g. “Airplanes — History”) I removed a subdivision from the end and repeated the checks until a term was found in either the authorized or alternate category, or I ran out of subdivisions.

This turned up 286 unique subjects that needed replacement– over 3/4 of 1% of my headings, in less than 4 years.  (My script originally identified even more, until I realized I had to ignore the simple geographic or personal names.  Those aren’t yet in LC’s RDF file, but a few of them show up as alternate headings for other subjects.)  These 286 headings (some of them the same except for subdivisions) represented 225 distinct substitutions.  The bad headings were used in hundreds of bibliographic records, the most popular full heading being used 27 times. The vast majority of the full headings, though, were used in only one record.

What was I to replace these headings with?  Some of the headings had multiple possibilities. “Royalty” was an alternate heading for 5 different authorized headings: “Royal houses”, “Kings and rulers”, “Queens”, “Princes” and “Princesses”.   But that was the exception rather than the rule.  All but 10 of my bad headings were alternates for only one authorized heading.  After “Royalty”, the remaining 9 alternate headings presented a choice between two authorized forms.

When there’s only 1 authorized heading to go to, it’s pretty simple to have a script do the substitution automatically.  As I verified while doing the substitutions manually, nearly all the time the automatable substitution made sense.  (There were a few that didn’t: for instance. when “Mind and body — Early works to 1850″ is replaced by “Mind and body — Early works to 1800“, works first published between 1800 and 1850 get misfiled.  But few substitutions were problematic like this– and those involving dates, like this one, can be flagged by a clever script.)

If I were doing the update over again, I’ll feel more comfortable letting a script automatically reassign, and not just identify, most of my obsolete headings.  I’d still want to manually inspect changes that affect more than one or two records, to make sure I wasn’t messing up lots of records in the same way; and I’d also want to manually handle cases where more than one term could be substituted.  The rest– the vast majority of the edits– could be done fully automatically.  The occasional erroneous reassignment of a single record would be more than made up by the repair of many more obsolete and erroneous old records.  (And if my script logs changes properly, I can roll back problematic ones later on if need be.)

Mind you, now that I’ve brought my headings up to date once, I expect that further updates will be quicker anyway.  The Library of Congress releases new LCSH RDF files about every 1-2 months.  There should be many fewer changes in most such incremental updates than there would be when doing years’ worth of updates all at once.

Looking at the evolution of the Library of Congress catalog over time, I suspect that they do a lot of this sort of automatic updating already.  But many other libraries don’t, or don’t do it thoroughly or systematically.  With frequent downloads of updated LCSH data, and good automated procedures, I suspect that many more could.  I have plans to analyze some significantly larger, older, and more diverse collections of records to find out whether my suspicions are justified, and hope to report on my results in a future post.  For now, I’d like to thank the Library of Congress once again for publishing the open data that makes these sorts of catalog investigations and improvements feasible.

June 20, 2010

How we talk about the president: A quick exploration in Google Books

Filed under: data,online books,sharing — John Mark Ockerbloom @ 10:28 pm

On The Online Books Page, I’ve been indexing a collection of memorial sermons on President Abraham Lincoln, all published shortly after his assassination, and digitized by Emory University.  Looking through them, I was struck by how often Lincoln was referred to as “our Chief Magistrate”.  That’s a term you don’t hear much nowadays, but was once much more common. Lincoln himself used the term in his first inaugural address, and he was far from the first person to do so.

Nowadays you’re more likely to hear the president referred to in different terms, with somewhat different connotations, such as “chief executive” or “commander-in-chief”.  The Constitution uses that last term in reference to the president’s command over the Armed Forces. Lately, though, I’ve heard “commander in chief” used as if it referred to the country in general.  As someone wary of the expansion of executive power in recent years, I find that usage unsettling.

I wondered, as I went through the Emory collection, whether the terms we use for the president reflect shifts in the role he has played over American history.  Is he called “commander in chief” more in times of war or military buildup, for instance?  How often was he instead called “chief magistrate” or “chief executive” over the course of American history?  And how long did “chief magistrate” stay in common use, and what replaced it?

Not too long ago, those questions would have simply remained idle curiosity.  Perhaps, if I’d had the time and patience, I could have painstakingly compiled a small selection of representative writings from various points in US history, read through them, and tried to draw conclusions from them.  But now I– and anyone else on the web– also have a big searchable, dated corpus of text to query: the Google Books collection.  Could that give me any insight into my questions?

It looks like it can, and without too much expenditure of time.  I’m by no means an expert on corpus analysis, but in a couple of hours of work, I was able to assemble promising-looking data that turned up some unexpected (but plausible) results.  Below, I’ll describe what I did and what I found out.

I started out by going to the advanced book search for Google.  From there, I specified particular decades for publications over the last 200 years: 1810-1819, 1820-1829, and so on up to 2000-2009.  For each decade, I recorded how many hits Google reported for the phrases “chief magistrate”, “chief executive”, and “commander in chief”, in volumes that also contained the word “president”.  Because the scope of Google’s collection may vary in different decades, I also recorded the total number of volumes in each decade containing the word “president”.  I then divided the number of phrase+”president” hits by the number of “president” hits, and graphed the proportional occurrences of each phrase in each decade.

The graph below shows the results.  The blue line tracks “chief magistrate”, the orange line tracks “chief executive”, and the green line tracks “commander in chief”.  The numbers in the horizontal axis refer to the decade+1800s; e.g. 1 is the 1810s, 2 is 1820s, all the way up to 20 being the 2000s.

Relative frequencies of "chief magistrate", "chief executive", and "commander in chief" used along with "president", by decade, 1810s-2000s

You can see a larger view of this graph, and the other graphs in this post, by clicking on it.

The graph suggests that “chief magistrate” was popular in the 19th century, peaking in the 1830s.  “Chief executive” arose from obscurity in the late 19th century,  overtook “chief magistrate” in the early 20th century, and then became widely used, apparently peaking in the 1980s.  (Though by then some corporate executives– think “chief executive officer”– are in the result set along with the US president.)

We don’t see a strong trend with “commander in chief”.  There are some peaks in usage in the 1830s, and the 1960s and 1970s, but they’re not dominant, and they don’t obviously correspond to any particular set of events.  What’s going on?  Was I just imagining a relation between its usage and military buildups?  Is the Google data skewed somehow?  Or is something else going on?

It’s true that the Google corpus is imperfect, as I and others have noted before.  The metadata isn’t always accurate; the number of reported hits is approximate when more than 100 or so, and the mix of volumes in Google’s corpus varies in different time periods.  (For instance, recent years of the corpus may include more magazine content than earlier years; and reprints can make texts reappear decades after they were actually written.  The rise of print-on-demand scans of old public-domain books in the 2000s may be partly responsible for the uptick in “chief magistrate” that decade, for instance.)

But I might also not be looking at the right data.  There are lots of reasons to mention “commander-in-chief” at various times.  The apparent trend that concerned me, though, was the use of “commander in chief” as an all-encompassing term.  Searching for the phrase “our commander in chief” with “president” might be better at identifying that. That search doesn’t distinguish military from civilian uses of that phrase, but an uptick in usage would indicate either a greater military presence in the published record, or a more militarized view among civilians.  So either way, it should reflect a more militaristic view of the president’s role.

Indeed, when I graph the relative occurrences of “our commander in chief” over time, the trend line looks rather different than before.  Here it is below, with the decades labeled the same way as in the first graph:

Scaled frequency of "Our commander in chief" used along with "President", by decade

Scaled frequency of "our commander in chief" used along with "president", by decade, 1810s-2000s

Here we see increases in decades that saw major wars, including 1812, the Mexican war of the 1840s, the civil war of the 1860s, and the Vietnam war expanding in the 1970s.  This past decade had the second most-frequent usage (by a small margin) of “our commander in chief” in the last 200 years of this corpus.  But it’s dwarfed by the use during the 1940s, when Americans fought in World War II.  That’s not something I’d expected, but given the total mobilization that occurred between 1941 and 1945, it makes sense.

If we look more closely at the frequency of “our commander in chief” in the last 20 years, we also find interesting results. The graph below looks at 1991 through 2009 (add 1990 to each number on the horizontal axis; and as always, click on the image for a closer look):

Scaled frequency of "our commander in chief" used along with "president", by year, 1991-2009

Not too surprisingly, after the successful Gulf War in early 1991, usage starts to decrease.  And not long after 9/11, usage increases notably, and stays high in the years to follow.  (Books take some time to go from manuscript to publication, but we see a local high by 2002, and higher usage in most of the subsequent years.)  I was a bit surprised, though, to find an initial spike in usage in 1999.  As seen in this timeline, Bill Clinton’s impeachment and trial took place in late 1998 and early 1999, and a number of the hits during this time period are in the context of questioning Clinton’s fitness to be “our commander in chief” in the light of the Lewinsky scandal.  But once public interest moved on to the 2000 elections, in which Clinton was not a candidate, usage dropped off again until the 9/11 attacks and the wars that followed.

I don’t want to oversell the importance of these searches.  Google Books search is a crude instrument for literary analysis, and I’m still a novice at corpus analysis (and at generating Excel graphs).  But my searches suggest that the corpus can be a useful tool for identifying and tracking large-scale trends in certain kinds of expression.  It’s not a substitute for the close reading that most humanities scholarship requires.  And even with the “distant reading” of searches, you still need to look at a sampling of your results to make sure you understand what you’re finding, and aren’t copying down numbers blindly.

But with those caveats, the Google Books corpus supports an enlightening high-altitude perspective on literature and culture.  The corpus is valuable not just for its size and searchability, but also for its public accessibility.  When I report on an experiment like this, anyone else who wants to can double-check my results, or try some followup searches of their own.  (Exact numbers will naturally shift somewhat over time as more volumes get added to the corpus.)  To the extent that searching, snippets, and text are open to all, Google Books can be everybody’s literary research laboratory.

Next Page »

Theme: Rubric. Clone this site at WordPress.com

Follow

Get every new post delivered to your Inbox.