Improving the millions

Michigan’s announcement earlier this month that they had over one million volumes from their collection digitized was widely hailed online. That million books includes both copyrighted and public domain content. According to Michigan’s John Wilkin, who I talked with shortly after the announcement, about 15-20% of what’s digitized is certified as public domain, and made freely accessible.

Those familiar with the OAI-PMH protocol can download catalog records for the public-domain-certified books; as of today, there were over 116,000 such records. (According to John, there are about 3 catalog items for every 4 volumes, due to things like multi-volume works, so one million volumes translates to about 750,000 works, and about 15% of those are in the public domain OAI feed.)

Most of Michigan’s million volumes were digitized by Google. Jessamyn West’s post on the milestone included a link to an interesting article in Campus Technology about the workflow of the Google project. Jessamyn pulls out an interesting quote from the article:

When it comes down to it, then, this brave new world of book search probably needs to be understood as Book Search 1.0. And maybe participants should not get so hung up on quality that they obstruct the flow of an astounding amount of information. Right now, say many, the conveyor belt is running and the goal is to manage quantity…

Quantity is certainly an important aspect of Google Books, but even Google knows it can’t ignore quality. Indeed, it’s worth remembering that Michigan’s announcement wasn’t the first million-book milestone to be announced. Back in November, the Universal Digital Library announced that it had over 1.5 million volumes available digitally. But that project doesn’t seem to have made as big a splash as Google or Michigan. And part of the reason, I think, is quality. Though it pains me to say it (I’ve worked with them in the past, and know and like many of the folks involved) their digitized editions are so often unusable or unreliable that I don’t regularly list them. Google’s not perfect either– they have their share of cut-off, missing and illegible pages– but in my experience, their books are often good enough to be usable, (though some others disagree). Michigan also reportedly has its own review process to weed out or improve the worst scans. (Google also has a bad-page-reporting feature, though I don’t have a sense yet of how responsive they are to fix reported errors.)

The mass digitization I’ve seen with the highest consistent quality is in the American Libraries and Canadian Libraries collections of the Internet Archive text repository, which between them now provide nearly a quarter million volumes online. Since these are all publicly accessible, this represents even more books freely readable by the public than Michigan’s scanned million– and they tend to be of considerably higher quality than Google’s offerings. If you’ve been following the new books listings of The Online Books Page, you may notice their editions are often the ones we’ve been picking to fill open reader requests. These books were produced through the Open Content Alliance, often with help from Microsoft or Yahoo.

There’s still room for improvement. The Internet Archive doesn’t seem to be able to handle heavy loads as well as Google does. They don’t have very good ways of handling multi-volume works (whether monographs or serials), though they at least usually make volume numbers more visible than Google does. And no one yet in the mass-digitization projects seems to be doing a good job at consistently providing readable transcribed text along with the page images. (The text is often good enough to search, but not to read.)

I’m hopeful, though, that we’ll continue to see the quality standard rise over time, as the UDL, Google, Michigan, the OCA, and others all digitize free content, and have to compete for the attention of readers seeking the best online books. In the meantime, there’s much that individuals can do to improve on the scans the giants are providing, whether it’s organizing disparate volumes, putting works into context, producing high quality transcriptions, repackaging them into convenient reader formats, or providing tags and reviews to help people find the most suitable books among the millions.

In libraries, size of your collection is important, but even more important is what your readers can do with the collection. In the early going of mass digitization, quantity makes the big headlines; in the long run, improving quality may well have the greatest impact.

About John Mark Ockerbloom

I'm a digital library strategist at the University of Pennsylvania, in Philadelphia.
This entry was posted in copyright, online books, open access. Bookmark the permalink.