Everybody's Libraries

Promoting access to the best literature of the past

Posted on October 26, 2009 by John Mark Ockerbloom

Last week saw widespread observance of Open Access Week 2009 . The week primarily focused on opening access to current research and scholarship (though there’s also been a growing community working on opening access to teaching and learning content). You can find lots of open access resources at the Open Access Directory.

Current scholarship is not spontaneously generated from the brain or lab of the writer. Useful scholarship must understand and interpret past work, to be effective in the present. In many fields, and not just the classical humanities, the relevant past work may stretch back hundreds or even thousands of years. Current scholarship and study will be more effective if its source material is also made openly accessible, and if proper attention is drawn to the most useful sources. And now is an especially opportune time for scholars of all sorts, professional and amateur, to get involved in the process.

This may seem a strange thing to say at a time when the digitization of old books and other historic materials is increasingly dominated by large-scale projects like Google and the Internet Archive. With mass digitizers putting millions of public domain book and journal volumes online, and with a near-term possibility of millions more copyrighted volumes going online as well, how much of a role is left for individual scholars and readers?

A very important role, as it turns out. Mass digitization projects can quickly produce large scale aggregations of pass content, but as many have pointed out, aggregation is not the same as curation, and as aggregations grow larger, being able to find the right items in a growing collection becomes increasingly important. That’s what curation helps us do, and the large-scale digitizers are not doing a very effective job of it themselves. Google’s PageRank algorithm may take advantage of implicit curation of web pages (through the choices of authors’ page links), but Google and other aggregators have had a much harder time drawing attention to the most useful books, scholarly articles, or other works created without built-in hyperlinks.

Sometimes this is because they haven’t digitized them, even as they’ve digitized inferior substitutes. Over three years after Paul Duguid lamented the republication of a bowdlerized translation of Knut Hamsun’s Pan by Project Gutenberg, that version remains the only freely available one of this book available there, or at Google Books, or anywhere else online that I’ve found. Even though an unexpurgated version of this translation was published before the bowdlerized version, no digitizer that I know of has gotten around to finding and digitizing it; and countless readers may have used the existing online copies without even knowing that they’ve been censored. Extra bibliographic and copyright research may be necessary to determine whether a better resource is available for digitization, as it in this case.

Sometimes the content is digitized, but can’t be found easily. Geoff Nunberg’s post on Google Books’ “metadata train wreck” shows plenty of examples of how difficult it can be to find and properly identify a particular edition in Google Books, much less figure out which edition is the best one to use. I’ve commented in the past about the challenges of finding multi-volume works in that corpus. And Peter Jacso has pointed out Google’s problems indexing current scholarship. If you can’t find the paper or book you need for your research, your work will be no better than it would be if the source had never existed.

This is where scholars can potentially play a useful role. We don’t individually digitize books by the thousands, but we do individually find, cite, and recommend useful sources, down to the particular edition, as we find them and use them in our own writings and teaching. These citations and recommendations now often go online, in various locations. It would be very useful to have these recommendations made more visible, and tied to freely available online copies of the sources cited, whenever legally possible. Sometimes, we also create or digitize our own editions of past works, with useful annotations, for our classes or our own work. It would be very useful to have these made visible and persistent as well, whenever appropriate.

I hope that large resource aggregations will make it easier for scholars and others to curate the collections to make them more useful to their readers. In the meantime, we can start with resources we have. For example, on The Online Books Page, my catalog entry for Hamsun’s Pan notes its limitations. My public requests page includes information on a better edition that could be digitized, by someone who has access to the edition and has some time to spare. And my suggestion form is ready to accept links to better editions of this book, or to other online books that merit special attention. Indeed, most of the books that I now add to my catalog derive from submissions made by various readers on this form, and I invite scholars to suggest the freely accessible books and serials that they find most useful for my catalog.

As the Little Professor notes in a recent post, the sort of bibliographic work I’ve described can be time-consuming but vitally important for making effective use of old sources, and that work has often not been done by anyone for many books outside the usual classical canons. Yet it’s the sort of thing that scholars do, bit by bit, as part of their everyday work. The aggregate effect of their curation and digitization, appropriately harnessed in open-access form, could greatly improve our ability to build upon the work of the past.

Posted in online books, open access | 2 Comments

Remember this

Posted on October 5, 2009 by John Mark Ockerbloom

I am eating a sandwich at the end of Pier 14 in San Francisco. The sun has set behind the downtown skyscrapers, and the colors in the sky are slowly fading to grey. I’m not the only diner out here. Pelicans soar close off the pier, about 100 feet above the water, and one by one dive straight down with a loud splash, resurfacing in a moment, ruffling their feathers and jerking their beaks to get down the fish they’ve caught. Other splashes in the water come from seals surfacing for air. As an orange-tinted full moon comes up over the East Bay hills and under the span of the Bay Bridge, I see a pair of seals surface side by side, with their mouths meeting as they float at the water’s surface for a few seconds. I am delighted to see all this, so different from what I usually see at home, and at the same time I wish I could be back there with the people I love instead of alone here.

I don’t have a camera right now, or anything to draw with, so I can only record this scene in words and in memory. When there was still sun shining low on Yerba Buena Island and the coastline to the east, there were several people out here with tripods and light umbrellas, photographing human couples standing against the pier railings, in each other’s arms. Judging from the clothing and the poses, I suspect these shots are for wedding or engagement albums. And I can understand the motivation. When Mary and I were married, 14 years ago this month, we too had pictures taken of us against a striking background, in our case the bright orange and yellow trees of a Pennsylvania fall. I see one of those pictures every time I return home. Remember this, the picture says, and it brings back memories of the vows we made to each other that day. The words we said, and the way we looked when we said them, were not recorded in fixed form, but, God willing, will stay in our hearts as long as we live.

There are more memories recorded out on the pier. Plaques along the rails quote lines of poetry by Lawrence Ferlinghetti and Thomas Lovell Beddoes about the bay I’m looking out on. Ceramic tile art depicts boats that have plied its waters, from the early days of European exploration to the present. A display on the sidewalk in front relates the history of the pier, the ferries that ran (and still run, in smaller numbers) from the terminal nearby, the freeway that was built and then removed again from the water’s edge, and some of the people who played a part in all of these developments. Remember this, they say, and I bring bits back with me to record in words.

It’s a basic need that we have, as intelligent, reflective, and social creatures, to remember the things we’ve experienced, seen, and learned about. We make records of these things in various forms, to help us remember, and to prompt others to remember as well. They help us go beyond and above what’s immediately in front of us, telling us things we need to know, people we can relate to, pasts that were different, futures that can be better.

Technology can make it easier for us to record these things– and sometimes easier to lose them. We took many pictures of our kids on digital cameras as they grew up, and kept hundreds of them on my laptop, which let me easily recall them and show them to friends and family when I traveled. Then one day I was robbed of my laptop, without my having backed up my photo collection, and most of those pictures were lost. I’ve also seen many other personal and family memoirs posted on the Web, stay for a few years, and then vanish with the demise of the web site they were on. I kept paper tapes of early BASIC programs I wrote in middle school for years after I had access to any device that could read them. They’re gone now; I presume they were thrown out when my parents cleaned house sometime after I left home.

I know better now how to keep what remains. Apple’s Time Machine makes it easy for me to incrementally back up my laptop every time I come home from work and plug a cheap external drive into my USB port. The pictures of my kids that survived the laptop theft were mostly the ones that I had shared with others (either by copying them onto prints, or by putting them up on the Web). And the older family pictures that are most meaningful to us are ones where we know what the pictures represent, either because we are in them, or because others have told us, in person or in writing, who is in the pictures and the context in which they were taken.

I am here in San Francisco for Ipres 2009, a conference promoting the preservation of digital content. There are a lot of smart, dedicated people scheduled to speak, and I hope to learn about new technologies and methods to help us preserve the content we want our libraries and their users to remember.

While some of these techniques may be complex, many of them are essentially elaborations on basic principles I’ve touched on in what I’ve related above: Help people record what’s important to them. Make it easy for them to preserve these records in their everyday activity. Encourage them to copy and share what they record, and allow others to build on them. Make what they record easy to interpret, through informative description and straightforward formats. And finally, try to understand and appreciate the connection between the record and the people for whom the record is important.

Which is why I sit now with my laptop in my hotel room, looking out on a bay that is now as dark as the night sky overhead, and trying to connect my experiences with the preservation challenges and proposals to come. Remember this, I mean to say. It’s important.

Posted in people, preservation, sharing | 2 Comments

Google Book settlement: Alternatives and alterations

Posted on September 17, 2009 by John Mark Ockerbloom

In my previous post, I worried that the Google Books settlement might fall apart in the face of opposition from influential parties like the Copyright Office, and that such a collapse might deprive the public of meaningful access to millions of out of print books.

Not everyone sees it that way. I’ve seen various suggestions of alternatives to the settlement for making these books available. In this post, I’ll describe some of the suggested alternatives, explain why they don’t seem to me as likely to succeed on their own, and discuss how some of them could still go forward under a settlement.

Compulsory licenses

Both the Open Book Alliance’s court filings and the Copyright Office’s testimony mention the possibility of compulsory licensing, which essentially lets people use a copyrighted work without getting permission, provided that they meet standard conditions determined by the government. Compulsory licenses already exist in certain areas, such as musical performances and broadcasts. If I want to cover a Beatles song on my new record, I can, as long as meet some basic conditions, including paying a standard royalty. The (remaining) Beatles can’t hold out for a higher rate, or say that no one else is allowed to cover the recordings they’ve released.

The Google Books settlement has some similarities to a compulsory license, but with some important differences, including:

Book rightsholders can choose to deny public uses of their work, or hold out for higher compensation, which they generally can’t do under a compulsory license regime. (They have to explicitly request this, though. So it’s really what one might call a “default” license.)
The license has been negotiated through a court settlement rather than Congressional action. (This was one of the main complaints of the Copyright Office.)
The license given in the settlement is granted only to Google, not to other digitizers. (This has justifiably raised monopoly concerns.)

I do have a problem with the last difference as it stands. I’d like to see the license widened so that anyone, not just Google, could digitize and make available out of print books under the same terms as Google. But there are various ways we can get to that point from the settlement. The Book Rights Registry created by the settlement could extend Google-like rights to anyone else under the same terms, as the settlement permits them to do. The Justice Department could require them to do so as part of an antitrust supervision. Or Congress could decide to codify the license to apply generally. (They’ve done this sort of thing before with fair use and the first sale doctrine, both of which originated in the courts.)

If the settlement falls apart, though, negotiation over an appropriate license has to start over from scratch, and has to persuade Congress to loosen copyrights for benefits they might not clearly see. As I suggested in my previous post, Congress’ recent tendencies have heavily favored tightening, rather than loosening, copyright control. And I haven’t yet seen a strong coalition pushing for laws granting compulsory (or default) licenses that are as broad as would be needed.

For instance, the Open Books Alliance’s amicus brief suggests the possibility of a compulsory license, but only as “but one approach”, and that suggestion seems as much aimed at getting hold of Google’s scans as licensing the book copyrights themselves. Their front page at present shows no explicit advocacy of compulsory copyright licenses. Perhaps they will unite behind a workable Google Books-style compulsory license proposal in the future, but I’m not counting on that. (Update: Just after I posted this, I saw this statement of principles go up on the OBA site. We’ll see what develops from that.)

The Copyright Office’s congressional brief also mentions but tries to damp down the idea. It repeatedly characterizes compulsory licensing as something that Congress only does “reluctantly” and “in the face of marketplace failure”. But despite its strong words on other subjects, it does not appear concerned over whether we in fact have a marketplace failure around broad access to out-of-print books.

Orphan works legislation

The Copyright Office filing also suggests passing orphan works legislation (as have various other parties, including Google). An orphan works limitation on copyrights would be nice, but it’s not going to enable the sort of large, comprehensive historical corpus that the Google Books settlement would allow.

As Denise Troll Covey has pointed out, the orphan works certification requirements recommended in last year’s bill, like many other case-by-case copyright clearance procedures, are labor-intensive and slow, and may be legally risky. (In particular, the overhead for copyright clearance, not including license payment, can be several times the cost of digitization.) Hence, these methods are not likely to scale well. And they would not cover the many out-of-print books that aren’t, strictly speaking, orphans. I don’t consider it likely that a near-comprehensive library of millions of out-of-print 20th century books will come about by this route alone any time soon.

Even so, despite its limited reach, last year’s orphan works legislation was stopped in Congress after some creator organizations objected to it. Some of the objectors, including the National Writers Union and the American Society of Journalists and Authors, are now members of the Open Book Alliance, which makes me wonder how effectively that group would act as a united coalition for copyright reform.

Private negotiation

Some critics suggest that Google and other digitizers simply negotiate with each rightsholder, or a mediator designated by each rightsholder. It’s possible that this actually might work for many future books, if authors and publishers set up comprehensive clearinghouses (like ASCAP and Harry Fox mediate music licensing). If new books get registered with agents like these going forward, with simple, streamlined digital rights clearing, private arrangement could work well for future books both in-print and out-of-print. Indeed, Google’s default settlement license privileges don’t apply to new books from 2009 onward.

But it’s much less likely that this will be a practical solution to build a comprehensive collection of past out of print books from the 20th and early 21st century, because of the sheer difficulty and cost of determining and locating all the current rightsholders of books long out of print. The friction involved in such negotiation (involving high average cost for low average compensation) is too great. Without the settlement and/or legal reform, we risk having what James Boyle called a “20th century black hole” for books.

Copyright law reform

As James Boyle points out, it would solve a lot of the problems that keep old books in obscurity if books didn’t get exceedingly long copyrights purely by default. It would also help if fair use and public domain determination weren’t as risky as they are now. I’d love to see all that come to pass, but no one I know that’s knowledgeable on copyright issues is holding their breath waiting for it to happen any time soon.

Moving forward

As I’ve previously mentioned, the settlement is imperfect. It may well need antitrust supervision, and future elaboration and extension. (And I’ve suggested some ways that libraries and others can work to improve on it.) It’s still the most promising starting point I’ve seen for making comprehensive, widely usable, historic digital book collections possible. I hope that we get the chance to build on it, instead of throwing away the opportunity. In any case, I’d be happy to hear people’s thoughts and comments about the best way to move forward.

Posted in copyright, online books, open access | 5 Comments

Google Books, and missing the opportunities you don’t see

Posted on September 15, 2009 by John Mark Ockerbloom

The Google Books settlement fairness hearing is still a few weeks away, but in the last few weeks the deal has been talked and shouted about with ever-higher volume. Still, it wasn’t until the other day, in a House Judiciary Committee hearing where US Copyright Register Marybeth Peters came loaded for bear, that I started thinking there was a significant likelihood that the settlement might fall apart.

There are a number of people in different communities, including libraries, who hope this happens. I’m not one of them. I’m not a lawyer, so I can’t comment with authority on whether the settlement is sound law. But I’m quite confident that it advances good policy. In particular, it’s one of the best feasible opportunities to bring a near-comprehensive view of the knowledge and culture of the 20th and early 21st centuries into widespread use. And I worry that, should the settlement break down, we will not have another opportunity like it any time soon. The settlement has flaws, like the Google Books Project itself has, but at the same time, like Google Books itself, the deal the settlement offers is incredibly useful to readers, while also giving writers new opportunities to revive, and be paid for, their out of print work.

The potential

Under the status quo, millions of books are greatly under-utilized. It isn’t just that people don’t have easy access to them; it’s that people don’t know that particular books useful to them exist in the first place. I work in a library that has collected millions of volumes, many of which are hardly ever checked out. Not only would Google’s expanded books offerings give our users access to millions more books, but it would also make millions of books that we already own easier for our users to find and use effectively.

Want to know what books make mention of a particular event, ancestor, or idea? With existing libraries, and good search skills, you might be able to find books, if any, that are written primarily about those things. But you’ll probably miss much other information on those same topics, information in works that are primarily about something else. With expanded search, and the ability to preview old book content, it could be much easier to get a more comprehensive view on a topic, and find out which books are worth obtaining for learning more.

And if that’s a big advance for people in big universities like ours, it’s an even bigger step forward for people who have not had easy access to big research libraries. Once a search turns up a book of interest, Google Books would offer a searcher various ways of getting that book: buying online access; reading it at their library’s computer (either via a paid subscription, or via a free public access terminal); buying a print copy; or getting an inter-library loan. These options all involve various trade-offs of cost and convenience, as is the case with libraries today. While one could wish for better tradeoff terms, the ones proposed still represent big advances from what one can easily do today.

And as with other large online collections like Wikipedia or WorldCat, or the Web as a whole, the advantages to large book corpuses like Google’s aren’t just in the individual items, but in what can be done with the aggregation. I don’t know exactly what new kinds of things people will find to do with a near-comprehensive collection of 20th century books, but having seen all that people have done with other information aggregated on the Internet, I’m confident that there would be many great uses found, large and small.

The peril

If the Google settlement does fall apart, are we likely to see any collection like the one it envisions any time soon? I’m not at all confident we will. The basic problem is that, without some sort of blanket license, it’s impractical (and in the case of true orphan works, currently impossible) to clear all the copyrights that would be required to build such a collection. This represents a failure in copyright law. Instead of “promot[ing] the progress of science and useful arts”, as the Constitution requires, current US copyright law effectively keeps millions of out-of-print books in obscurity, not producing significant benefits either to their creators or to their potential users.

The current proposed Google Books settlement is, among other things, an attempt to get around this failure. If the settlement fails, would the parties make a new agreement that would allow a readable collection of millions of post-1922 online books? The divergence in the complaints I’ve seen (for instance, on one hand that the collection would cost readers too much, and on another hand that it would pay writers too little) suggest the difficulty of coming to a new consensus that satisfies all the parties, if negotiations have to start again from scratch. And, if the arguments of the Copyright Office and some of the other parties carry the day, even if such an agreement were reached, the agreement could not be ratified by a court anyway. Instead, it would require acts of Congress, and maybe even re-negotiations of international treaties.

Based on past history, there are two things that would make the government likely to reform copyright law to permit mass reuse of out-of-print books. Ether there needs to be a clear example of the benefits of such a reform, or there needs to be a strong coalition pushing for such a reform. Clear examples have usually come from businesses that are actually in operation; for example, the player piano roll industry that successfully persuaded Congress to streamline music copyright clearance in the previous century (or the Betamax that persuaded a slender majority of the Supreme Court to declare the VCR legal).

If the proposed Google Books library service goes online, even under a flawed initial settlement, it too could provide a compelling example to encourage general copyright reform. But without such an example, it can be hard to move Congress to act. It’s easy to undervalue the opportunities you don’t clearly see.

What about a strong coalition pushing for a reform in the law that would let anyone create the comprehensive online collections of out of print books I’d described? I’d like to see one, but I haven’t yet. (Yes, there’s the Open Book Alliance, but its members don’t seem to be distinctly allied in anything particular other than objecting to the settlement.) In my next post, I’ll discuss reforms that might do the job, and the reasons I believe they would be difficult to enact without the settlement.

Posted in copyright, online books, open access | 2 Comments

Why should reuse be hard?

Posted on August 31, 2009 by John Mark Ockerbloom

By far the most widely cited paper with my name on it is a 1995 paper on architectural mismatch. The journal version of the paper was subtitled “Why reuse is so hard”. It was a paper about failure, rather than success, which most researchers prefer to write about when they’re talking about their own work. We discussed the problems we’d encountered trying to build a new software system from existing parts, and analyzed some of the reasons for the failures, and how systems could be improved in the future to make reuse easier.

The paper was unexpectedly well received, and was recently named as one of the most influential papers to appear in IEEE Software. (I can’t claim too much credit for this myself; my adviser David Garlan and my fellow grad student Robert Allen rightly appear ahead of me in the author credits.) ISI Web of Knowledge, which tracks the journal version of the paper, reports it’s been cited over 100 times in other journal articles; Google Scholar, which tracks both the journal version and the conference version that was published earlier the same year, reports hundreds more citations.

Google Scholar also reports an unexpected statistic: even though the journal version of a computer science paper is generally considered more authoritative than the earlier conference version (and rightly so, in our case), the conference paper has been cited even more often than the journal version. Why is this? I can’t say for sure, but there’s one important difference between the two versions: the conference paper has been freely accessible on the web for years, and the journal paper hasn’t. It’s in a highly visible journal, mind you– pretty much anywhere with a CS department subscribes to IEEE Software, and many individual computer practitioners subscribe as well. So I suspect that most of the authors who cited our paper could have cited the journal paper (especially since it came out only a few months after the conference paper did). But the conference paper was that much more easily accessible, and it was the one that got the wider reuse.

We’ve recently published a followup to our paper, appearing in the July/August issue of IEEE Software. As we note in the followup, the problem of architectural mismatch has not gone away, but several developments have made it easier to avoid. One of them is the great proliferation of open source software that has occurred since the mid-1990s, which provide a wide selection of software components to choose from in many areas, and “a body of experience and examples that clarify which architectural assumptions and application domains go with a particular collection of software” (to quote from our paper).

Just as the growth of open source has made software easier to reuse, the growth of open access to research can make ideas and research results easier to reuse. We saw that with our initial paper, I think, and I hope we’ll see it again with the followup. I’ve made it available as open access, with IEEE’s blessing. Interested folks can check it out here.

Posted in architecture, open access, publishing | Comments Off

For those wanting more drama in their lives

Posted on July 14, 2009 by John Mark Ockerbloom

Thanks to the scanning services of Penn’s Schoenberg Center for Electronic Text and Image (SCETI), and a loan of over a dozen volumes from Stanford University Libraries, we have now posted online records of copyright renewals for drama (and works intended for oral delivery) up to 1968. They can be found in page image form at our Catalog of Copyright Entries information page.

We’ve had copyright renewals for normal books, and periodicals, online for some years now. But we’ve been missing renewal records for one important type of literary text: the kind primarily meant for performing or proclaiming, rather than reading. And it turns out that copyright renewal records for works that aren’t books are harder to find than you’d expect: I could find no copies at all in the Philadelphia area for the years 1955-1968. So I’m grateful to Stanford for the loan of their volumes. There are still a few more years of drama renewals left to digitize, but I’m hoping to find the later volumes in local libraries.

So far, we’ve only posted page images. But I’m hoping that they’ll help people to make transcriptions and structured data records, as has been done with earlier page images we’ve posted. I’ve tried to post easily readable but not-too-big copies of the pages online, to support manual searching and OCR. Folks who need to see the masters, which have higher resolution but are substantially larger, can contact me.

Because Google is also starting to post original copyright registrations, it’s also possible to correlate renewals with original registrations, and look for interesting statistical phenomena and trends. Already, for instance, one can see both that most copyrights were not renewed, and that the renewal rate can vary a lot by genre. For instance, in 1931, there were 552 “Class C” copyright registrations, covering lectures, sermons, addresses, and other works prepared for oral delivery. (See page 1659 of this volume.) There were also 5,993 drama registrations (“Class D”; see page 406 of this volume). In 1959, around the time these copyrights were up for renewal, there were 780 Class D renewals, a renewal rate of about 13%. There was only 1 Class C renewal in 1959, a renewal rate of well under 1%. (A few previous years have slightly higher Class C renewal rates, but not by much.)

After I get through with drama, there’s one more class of renewal I’d like to see online as soon as possible: image renewals. Only a few years worth of image renewals (from the early 1950s) are online, but what’s there suggests that the renewal rate for most published images was also quite low. Since a large number of books and periodicals include images of one kind or another, knowing which images are still under copyright will be very important for folks who want to look at complete facsimiles of older works.

The proposed Google Books settlement, for instance, proposes blanking out anything in a book that might be an image (unless rights for that image have been cleared). This could make many books substantially less useful than they would be if the images were cleared (especially since Google’s processing algorithms have to guess where images are, and may blank out not only images, but also text that can’t be easily auto-recognized.)

In the meantime, the drama renewals just posted will also help me look into some requests I’ve received for mid-20th-century plays and speeches whose status I’ve been unable to determine previously. I hope it will prove useful for others as well. Thanks again to SCETI and Stanford for these.

Posted in copyright | Comments Off

Learn more about ILS discovery interfaces

Posted on June 10, 2009 by John Mark Ockerbloom

I’m presenting today at a NISO webinar on interoperability, giving an overview of the work I did with a Digital Library Federation task group to produce recommendations for standard APIs for ILS’s supporting information discovery applications.

I’ll include a link to my presentation later today, after the webinar is over. I’m also happy to answer questions here about the ILS-DI work. (I’ve also covered that work here before in the blog.)

To help folks keep track of ILS-DI implementations and related activities, I’ve also created a new page on this site linking to the recommendation, implementations and followons, and related projects. I’ve started it with just the basics, but plan to fill in more information shortly.

Update: I’ve now posted my slides and speaker notes.

Posted in architecture, discovery, libraries | Comments Off

Getting bugs out of our systems

Posted on May 27, 2009 by John Mark Ockerbloom

Very soon after we start learning to program, we start learning to deal with bugs. Folks who have programmed for a while might forget that effective bug handling, like effective programming, is a skill that doesn’t come entirely naturally.

Many of us instinctively avoid criticism, ignore it, minimize it, or even argue against our critics. But our programs will almost invariably include bugs, and to handle them, we have to go against the grain of our instincts. If we’re smart, we make it as easy as possible to report bugs to us, so we minimize their impact. We respect and listen carefully to what our clients tell us, to understand the problems they’re encountering with our product. After we fix the bugs, we often review our code and our practices to avoid similar problems in the future.

It helps a lot if we can keep our egos out of the bug-fixing process. I know that my work will sometimes have bugs, and that a bug report should not be taken as a personal attack. Rather, I try to make it an opportunity to improve my products and my future work.

Bugs exist at various levels. Bugs that cause crashes are often the easiest to deal with: it’s clear that something is going wrong, and it usually isn’t hard to figure out what to do about it. But less obvious bugs can be worse. One product our library uses, for example, implemented boolean searches incorrectly, omitting important results. This kind of bug can mislead lots of people who never notice the problem. (And it can also take longer to address. I had to send multiple emails and examples to the developers of this product before they admitted that their implementation was buggy.)

Bugs at the overall system level can be the worst. The reservation system with interminable holds, the customer support service that never returns our calls, the open source effort that repels key constituencies it should be attracting: all of these are buggy systems, and they can drive people away just as surely as a crashing program. As Michael Bolton puts it, “a bug is something that bugs somebody who matters.” System-level bugs can be challenging to fix, but they can be the most essential to repair.

I hope none of these principles seems new or controversial. But I’ve recently seen a few bug reports concerning the Ruby on Rails community that drew many responses that ignored them. The reports concerned buggy systems, not buggy code. In particular, they noted a professional developer conference that attracted very few women, and an accepted presentation at that conference that included blatantly unprofessional themes, themes that one could easily predict would put off many of the people who could benefit from the talk. (They would be particularly problematic if you were one of the few women there, but I found it distinctly off-putting as well.)

The comments on those two posts include plenty of examples of denial, minimization, rationalization, and attacking the reporters of the bugs. (Indeed, some read as if they were cribbed right from this checklist of cliched defenses of in-group privilege.)

Assuming that the respondents are active members of the Ruby community, the responses suggest that there are still serious social bugs in that community. I recently came back from another open source-focused conference (one that had a significantly higher proportion of women, though still far from 50%), where there were some good things said about using Ruby on Rails for library application development. I like open source projects with good technical bases, but if I’m going to rely on a technology, I want its developer community to be healthy. Healthy communities generally provide more reliable, long lasting development support, and can be much easier and more pleasant to work with.

It can be positively uncomfortable for many of us to confront social problems, particularly ones in our own communities that we might be partly responsible for. (And Ruby is not the only community that’s had this kind of problem.) Perhaps if we get used to thinking of these problems as bugs, welcoming and paying close attention to reports, and getting our egos out of the way, we’ll find it easier to fix them.

Gender inequities are bugs in our systems. Bugs happen. But they can be fixed. As my library considers involvement in various community source development projects, I want to find out more about what these communities are doing, going forward, to fix and prevent these sorts of bugs.

Posted in crimes and misdemeanors, people | 3 Comments

Free Decimal Correspondence: Now with more detail and CC0 goodness

Posted on May 16, 2009 by John Mark Ockerbloom

I’ve just released a new version of the Free Decimal Correspondence (FDC). FDC is a set of associations between numbers and subject terms that can be used as the basis for linear shelving or subject browsing of a simple library collection or repository. Like the earlier releases, it’s intended to be reasonably compatible with existing library standards; in particular, with OCLC’s Dewey Decimal Classification (DDC) and the Library of Congress Subject Headings (LCSH). (Keep in mind, though, that both of those standards go into considerably more detail than FDC does; and neither OCLC nor the Library of Congress sponsors or endorses the FDC in any way.) This version of the FDC, like the previous releases, is dedicated to the public domain. More information about the FDC can be found here.

What’s new?

There are two noteworthy features of this release. First of all, I’ve now defined corresponding subjects for every whole number (from 000 to 999) that’s also defined (and non-optional) in the DDC. In addition, I’ve gone past the decimal point in some cases to include additional subjects that may be useful in some collections and repositories. This means that the FDC should now cover at least as many subjects as OCLC’s Dewey Decimal Classification summaries, the most detailed official free-to-read summary of the present-day DDC I know about.

In addition, I’ve now using Creative Commons‘ new CC0 dedication to dedicate the FDC to the public domain. I’m hoping this will provide more authoritative legalese for clarifying the public domain status of FDC, and also raise awareness of CC0.

OCLC claims copyright and trademark on the DDC, and while the summary of DDC can be freely read on the Web, its use appears to be governed by this license, which states, among other things, that users may “not copy, reproduce, alter, modify, create derivative works, or publicly display any content (except for your own non-commercial use)”. It further clarifies that “‘non-commercial use’ … is not to be construed as permitting use of the DDC that (although internal) supports or otherwise facilitates your commercial activities.” (Note also that the linked license applies only to the DDC summary, not to the complete DDC. If you want a license for the full DDC, you’ll need to contact OCLC.)

With FDC, as with all CC0 and other public domain works, none of the restrictions above apply. Do with it what you like.

What’s next?

This new release is version 0.05. As the number suggests, this is still essentially a working draft. My first draft was released a few months ago on Public Domain Day. I intend my work on FDC to be a lightweight, spare time project, and therefore hope to advance to a “1.0” release soon, and leave further development, if desired, to others.

Between now and then, I’m not planning to make FDC much bigger than it is now. But there are still various other subjects not defined at the unit level that are common in repositories and small library collections, and that might be useful to add. (I originally started this after someone said they wanted to have a Dewey-like classification system for their repository, without worrying about licensing issues.) I’m happy to take suggestions of additional subjects to include in the next release. (But use your own subjects of interest, and not terms taken from an OCLC publication.) I’d also be very interested in hearing about (and fixing) mistakes, inconsistencies, redundancies, and correspondences that could be made more compatible with current library practice than what I have now.

You can send me email about any of the above. I’ll also be at Open Repositories 2009 next week, and would be happy to talk about FDC there with anyone interested.

What else?

There are numerous classification schemes you can use for collections. My page about FDC includes some discussion of the better-known alternatives you might want to consider. And if you’d like to work on a new scheme without worrying about compatibility with other systems, you might be interested in the Open Shelves Classification, an ongoing project sponsored by Tim Spalding of LibraryThing.

If FDC sounds like something you’d like to use, adapt, or experiment with, go right ahead. There’s no need to wait for any further releases, or to ask me about what you want to do, unless you’d like to. Have at it, and have fun!

Posted in libraries, metadata, open access, repositories, sharing | Comments Off

Will OCLC move to a service-oriented business model for bibliographic data?

Posted on May 15, 2009 by John Mark Ockerbloom

As you may recall, there’s been a fair bit of criticism (including some of my own) against OCLC’s proposed records use policy, which asserts OCLC’s rights over libraries’ cataloging in ways that would make it very difficult for libraries and other organizations to make innovative uses of shared bibliographic data, if they use OCLC for their shared cataloging.

In response to criticism, OCLC convened a review board to recommend principles for a revised policy. Now Peter Murray tells us that the board will present their recommendations at Monday’s OCLC Member’s Council meeting. I’m told presentation materials should eventually be posted publicly.

I’m looking forward to seeing what they recommend. By now, large library organizations like the Association of Research Libraries and the International Coalition of Library Consortia have publicly criticized the policy as proposed, and the process through which it was developed. I was gratified to see ICOLC agree that “the policy hinders rather than encourages innovation”. And I hope that OCLC will adopt a policy that encourages sharing and innovative applications of bibliographic data, while maintaining a healthy, sustainable shared cataloging base.

As I mentioned at my talk at ALA in January, one of the ways they might do this is to move to a business model that’s based on charging for services rather than for bibliographic data. OCLC already offers a number of paid services, such as WorldCat Local. Recently, they’ve announced an ambitious suite of “Web-scale management services” , effectively offering a shared ILS service, using the shared bibliographic records that have been contributed to WorldCat. The announced services could be attractive to many libraries. They could introduce economies of scale, by having a single ILS installation serving many institutions. And the shared records in the service could promote new efficiencies in areas like acquisitions, collection development, and inter-library loan, as well as in information discovery. OCLC’s new paid services, then, could advance the interest both of libraries and OCLC’s shared catalog. They could raise the bar for competition in ILS-related services.

Or, at least, they could if everyone has the same opportunity to compete. For lots of reasons, different libraries may choose to use different ILS vendors and products, or even develop something on their own. (For instance, the library where I work uses Ex Libris’ Voyager ILS, and is also involved in the open source OLE Project.) Many of these libraries have also been adding and enhancing bibliographic records in OCLC’s WorldCat. If OCLC keeps these records locked up, so that its ILS can use them but competitors can’t, these contributing libraries will be giving a special advantage to one particular ILS by sharing their records through WorldCat. And that skews the competition in ways that, like monopolies in general, can be bad for libraries and for innovation.

So I think OCLC needs to make a choice here: Will they try to retain a proprietary interest in libraries’ shared cataloging, and stay neutral towards other entities (such as ILS vendors) that provide services on this cataloging? Or will they move into competing with others to provide services on our collective data, while relinquishing their proprietary claims on it? Though I’ve argued that we’re best off if they move to the latter, I’ve heard some thoughtful, conservative arguments for sticking with the former while they try to figure out how to sustain their shared catalog.

But they shouldn’t try to have it both ways. As we saw recently with a company that does both academic publishing and commercial PR, combining two valid but disparate interests can easily create worrisome conflicts. As I’ve heard it, OCLC justifies whatever proprietary rights they claim on shared cataloging data on behalf of the cataloguing “cooperative”: the libraries that contribute the data. The work of this cooperative should not be held captive in the interests of any single commercial venture (whether OCLC’s or anyone else’s) that wants to exploit it.

With open bibliographic data, OCLC can still enjoy important competitive advantages in their ILS and other services, simply by being the entity that coordinates WorldCat, and that has the deepest knowledge about how to work with the WorldCat knowledge base. There are a number of possible ways for OCLC to support itself, and encourage data contributions, in its cataloging coordination and catalog-based services. I hope that, in the upcoming Member’s Council meeting and in ensuing discussions, we’ll have some thoughtful conversations and proposals on how to do this with open data.

Posted in architecture, libraries, metadata, open access, sharing | 3 Comments

Everybody's Libraries

Promoting access to the best literature of the past

Google Book settlement: Alternatives and alterations

Google Books, and missing the opportunities you don’t see

Why should reuse be hard?

For those wanting more drama in their lives

Learn more about ILS discovery interfaces

Getting bugs out of our systems

Pages

Recent Posts

Recent Comments

Archives

Access for all

Copyrights and wrongs

General library-related news and comment

Interesting folks

Metadata and friends

Shiny tech

Tales from the repository

Writing and publishing