Surpassing all records

What will happen to all the White House emails after George W. Bush leaves office in January? Who will take charge of all the other electronic records of the government, after they’re no longer in everyday use? How can you archive 1 million journal articles a month from dozens of different publishers? Can the virtual world handle the Large Hadron Collider’s generation of 15 petabytes of data per year without being swallowed by a singularity? And how can we find what we need in all these bits, anyway?

These were some of the digital archiving challenges discussed this week at the Partnerships in Innovation II symposium in College Park, Maryland. Co-sponsored by the National Archives and Records Administration and the University of Maryland, the symposium brought together experts and practitioners in digital preservation for a day and a half of talks, panels, and demonstrations. It looked to me like over 200 people attended.

This conference was a sequel to an earlier symposium that was held in 2004. Many of the ideas and plans presented at the earlier forum have now grown into fruition. The symposium opened with an overview of NARA’s Electronic Records Archives (ERA), a long-awaited system for preserving massive amounts of records from all federal government agencies, that went live this summer. It’s still in pilot mode with a limited number of agencies, but will be importing lots of electronic records soon, including the Bush administration files after the next president is inaugurated.

The symposium also reviewed progress with older systems and concepts. The OAIS reference model, a framework for thinking about and planning long-term preservation repositories, influences not only NARA’s ERA, but many other initiatives and repositories, including familiar open source systems like Fedora and DSpace. Some of the developers of OAIS, including NASA’s Don Sawyer, reviewed their experiences with the model, and the upcoming revision of the standard. Fedora and DSpace themselves have been around long enough to be subjects of a “lessons learned” panel featuring speakers who have built ambitious institutional repositories around them.

The same panel also featured Evan Owens of Portico discussing the extensive testing and redesign they had to do to scale up their repository to handle the million articles per month mentioned at the top of this post. Heavily automated workflows were a big part of this scaling up, a strategy echoed by the ERA developers and a number of the other repository pracitioners, some of whom showed some interesting tools for automatically validating content, and for creating audit trails for certification and rollback of repository content.

Networks of interoperating repositories may allow digital preservation to scale up further still. That theme arose in a couple of the other panels, including the last one, dedicated to a new massive digital archiving initiative: the National Science Foundation‘s Datanet. NSF envisions large interoperating global networks of scientific data that could handle many Large Hadron Colliders worth of data, and would make the collection, sharing, reuse, and long-term preservation of scientific data an integral part of scientific research and education. The requirements and sizes of the grants are both prodigious– $20 million each to four or five multi-year projects that have to address a wide range of problems and disciplines– but NSF expects that the grants will go to wide-ranging partnerships. (This forum is one place interested parties can find partners.)

I gave a talk as part of the Tools and Technologies panel, where I stressed the importance of discovery as part of effective preservation and content, and discussed the design of architectures (and example tools and interfaces) that can promote discovery and use of repository content. My talk echoed in part a talk I gave earlier this year at a Palinet symposium, but focused on repository access rather than cataloging.

I’m told that all the presentations were captured on video, and hopefully those videos, and the slides from the presentations, will all be placed online by the conference organizers. In the meantime, my selected works site has a PDF of the slides and a draft of the script I used for my presentation. I scripted it to make sure I’d stay within the fairly short time slot while still speaking clearly. The talk as delivered was a bit different (and hopefully more polished) than this draft script, but I hope this file will let folks contemplate at leisure the various points I went through rather quickly.

I’d like to thank the folks at the National Archives and UMD (especially Ken Thibodeau, Robert Chadduck, and Joseph Jaja) for putting on such an interesting and well-run symposium, and giving me the opportunity to participate. I hope to see more forums bringing together large-scale digital preservation researchers and practitioners in the years to come.

Posted in architecture, discovery, preservation, repositories | Comments Off on Surpassing all records

What are the marketers of EndNote afraid of?

If you write papers on a regular basis, you’ll find it worthwhile to keep track of sources you might cite. When I was in grad school, I manually edited a BibTeX file to keep track of the references for my dissertation and other papers. Nowadays there are easier to use, Web-aware tools that let you automatically import citations as you do your research, organize, edit, and annotate them, and then include appropriate ones in your paper’s bibliography. One of the first products of this type was EndNote, a Mac and Windows application marketed by Thomson Reuters. It’s still widely used, but it’s hardly alone in this field. Also popular among scholars is the web-based RefWorks, marketed by ProQuest. And a new free entry, the open-source browser-plugin-based Zotero from George Mason University, is gaining popularity.

I don’t currently use any of these tools, but have lately been thinking about adopting one. And just now one of them, Zotero, got an unusual bit of marketing that makes me think it’s worth a try: its makers have just been sued for $10 million by Thomson Reuters, marketers of EndNote.

The text of the complaint filed in Virginia court is interesting both for what it says and what it doesn’t say. Thomson isn’t claiming that Zotero violated their copyrights or stole their trade secrets; they’re claiming rather that GMU violated the license of the software. The violation? Reverse-engineering the proprietary file format used by EndNote for style files, and allowing Zotero uses to import EndNote-formatted style files into Zotero and export them into an open format.

Style files specify how bibliographic references should be formatted for different publishers. They allow you to automatically format the same citation in different ways depending on, say, whether you’re writing for Urban Studies or the Journal of the Royal Society of Medicine. The citation formats are specified by the publishers, not the bibliographic software developers; the style files should simply be an encoding of the publisher’s guidelines in a machine-actionable format.

Thomson claims that the ability to read these publisher guidelines in their proprietary format is a grave threat to EndNote. As they put it in their complaint:

GMU is willfully and intentionally destroying Thomson’s customer base for the EndNote Software […] by allowing and encouraging users of Zotero to freely convert the EndNote Software’s proprietary .ens style files into open source Zotero .csl style files and further distributing such converted files to others.

(I should note that the facts of this allegation are in question. GMU has yet to make an official statement on the claims of the suit, but Peter Murray claims that the Zotero code simply reads and interprets .ens files (which can be created either by Thomson or by EndNote users), and does not export them into other formats.)

Does Thomson have a legal case? That may depend on the language of the license and its enforceability. (On the one hand, Virginia is one of two states that passed UCITA, a law that gives software vendors wide leeway in dictating and enforcing software license terms. On the other hand, the license as quoted prohibits “reverse engineering [the] Software”, and the only reverse engineering I’ve seen has been on the file formats the EndNote software produces, not the software itself.) Folks interested in commentary from legal experts might find these posts of interest.

Thomson’s actions suggest serious weakness of its marketing case, at any rate. As a potential customer, I look for organizations that provide the best software and services for what I need to do, and that empower me to get my work done in the way I see fit. I’m willing to pay a fair price for this software and service, and to respect developer’s copyrights, but I in turn want value for money, and respect for the customer’s needs.

If EndNote provides service and value that’s superior to its competitors, that should be enough to retain and grow its customer base. It shouldn’t need to try to lock the data the software outputs in proprietary formats, to impose license terms on its customers to keep them opaque, or to sue its customers when they nonetheless figure out how to decode them. Whatever their internal motivation, Thomson’s actions appear from the outside to be driven more by fear of competition than recovery of ill-gotten gains.

For users of Zotero, the suit is at worst an inconvenience. Even if the ability to read .ens files is removed from Zotero, users can simply create and share their own style files. (There’s some sweat of the brow involved, but much less than $10 million worth.) Zotero’s style repository has already grown to include styles for over 1100 journals, according to the Zotero blog, and instructions are available for anyone who wants to create and contribute additional styles. And if George Mason is forced or intimidated into stopping development of Zotero, anyone else is welcome to pick up where George Mason left off, thanks to Zotero’s open source license. (But if you want to develop without risking this kind of suit from Thomson, you might want to first make sure you’re not an EndNote customer when you start your work.)

Thomson is hardly the only software company to make its customers deal with proprietary formats, constrictive software licenses, and threats of legal action for disobeying license terms. One of the attractions of using free (“as in freedom“) software is not having to work under these burdens. But companies that sell software and services don’t necessarily have to impose them either. Copyright and trademark laws already prohibit users from misappropriating software and commercial brands. Lots of products do quite well in the marketplace with open formats.

So if you’re considering buying software or services from someone, and they use their own proprietary formats, or say that using their product requires assent to a complicated, onerous license agreement, you might want to ask yourself: “What are they afraid of?” Perhaps you might want to ask them as well.

Posted in citations, copyright, crimes and misdemeanors, formats, sharing | 1 Comment

Why Banned Books Week matters

It’s Banned Books Week again, and Amnesty International, the American Booksellers Foundation for Free Expression and the American Library Association are among the groups noting the occasion. I’ve also updated the links on my ongoing exhibit Banned Books Online in preparation for this week, a time when the exhibit gets an especially large volume of visits.

Banned Books Week is really about two different, but related, things. The first of these, the focus of sites like Amnesty’s and the “Books Suppressed or Censored by Legal Authorities” section of my exhibit, deals with attempts to restrict who is allowed to speak about what matters to them. And in a lot of the world, the right to speak out is severely and violently repressed. The other day I added to my online books collection a number of titles from Human Rights Watch, which has many books, press releases, and other publications about grave threats to freedom of the press and freedom to protest in places like Burma, Chile, China, Cuba, Pakistan, Turkey, Venezuela, various Middle Eastern and African countries, former Soviet republics, and many other places around the world.

Americans enjoy a country with a much freer press than the countries above (and indeed, a freer press than we had in my grandparents’ day). We’re not perfect; our legal system does sometimes suppress legitimate expression, for a time at least, in the name of security, copyright, or “the children”. (And sometimes the threat of criminal violence can suppress books when the law does not.) It is worth remembering the important books that can be published thanks to the free press, and not to take them for granted.

But the banned books lists you’ll find in many libraries and bookstores (or in dubious chain emails) doesn’t focus much on the political samizdat, security exposés, or portrayals of Mohammed that are the objects of forcible suppression today. Instead, they’re often full of classics and popular titles sold widely in bookstores and online– or dominated by books written for young readers, or assigned for school reading. Some of the titles in these lists have been the targets of publication suppression at some point, but many (like those in the Harry Potter series) have not.

So is it wrong to call these books banned? Are lists like these just “shameless propaganda”, as some conservatives charge, or a hapless attempt to market classic literature to teens, as satirized in an Onion piece?

Not if you take readers seriously. An unread book, after all, has as little impact as an unpublished book. The bans that dominate the ALA lists are the obverse of publication bans: they’re attempts to restrict who is allowed to hear about what matters to them. True, their reach may be smaller than the government bans that can keep a book out of an entire state or country. And it may often be easier to circumvent these kinds of bans. (Particularly if you have a driver’s license, a credit card, and easy Internet access, things that adults often take for granted but that many kids lack.) But censorship at the reader’s end can be just as injurious as censorship at the writer’s end.

Librarians and teachers necessarily select certain books, and not others, for their collections and classes, and decide where they will best work. And it’s right for patrons of the schools and libraries to have some say in these selections (even if the professionals should generally be allowed to do their jobs). So simply counting “challenges” to a book isn’t very informative. But there’s a world of difference between saying “isn’t this more appropriate for the YA shelves than for the early readers section?” or “Would this title be a better fourth-grade book on this topic than the one currently being used?”, and insisting “None of our kids should be reading about this kind of thing!” when “this kind of thing” is already on the minds of those kids, or something that they should be thinking about. The “Unfit for Schools and Minors?” section of my Banned Books online exhibit describes some of the more dubious attempts to keep books out of the hands of young readers.

My oldest child is only 8, but he’s already coming up with new and challenging questions on an almost-daily basis. By the time kids reach double-digit ages (which is the young end of the audience for most of the controversial books) they have lots of questions about life, death, sexuality, unfairness, hatred, violence, drugs, and religion. They deserve the chance to explore answers to these questions in their reading and in their conversations.

In the process, they may encounter some ideas they’re not ready to deal with fully. (But encounters with text are often naturally self-regulated. When I was a young precocious reader, I’d usually skim over difficult parts or lose interest in a book that had them. More than once I’ve been surprised going back to a book as an adult and seeing what I’d missed as a kid.) Kids will also certainly encounter lots of dubious ideas and counsels. But mainstream culture is full of these as well, and I hope that I and other parents will teach our kids how to evaluate those wisely, whether or not they come from sources we usually think of as “controversial”.

Banned Books Week is thus about twin freedoms: the freedom to write about what matters to you, and the freedom to read about what matters to you. In this week’s observance, I hope we grow to better appreciate these freedoms and the power of books and ideas.

Posted in censorship, crimes and misdemeanors, libraries, online books, reading | 8 Comments

Repositories: Benefits, costs, contingencies (with an example)

(This is the third post in a slow-cooking series on repositories.)

In my last repository post, I listed a variety of repository types that we maintain at our institution, each with different content, operation, and policies. At the end of the post, I wrote:

Once we have a clear understanding of why we would benefit from a particular repository, and what it would manage, we can consider various options for who would run it, where, and how. (And of course, what its costs would be, and how we can realistically expect those costs to be covered….)

Without a clear sense of benefits and costs, you won’t have a sensible repository strategy. And, as Dorothea Salo reminds us today, without a sensible strategy you’re likely to burn through a lot of money, labor, and goodwill with little to show for it at the end. You have to go in knowing what you want, and being realistic about what you’re willing to invest to produce it. (For instance, if you’re planning to build a repository of your community’s own scholarship, and hope to get lots of free help from your community just by doing some marketing, you really need to read Dorothea’s post for a reality check.)

Even when your initial plan is sound, you have to be prepared for change, and the unexpected. Technology changes quickly. Online tools, communities, and scholarly societies also change. Methods of scholarship also change, often more slowly, but sometimes in significant ways. Even if you’ve done your homework, you may eventually find that the repository that seemed just fine a few years ago doesn’t really meet your needs like it used to. Maybe the software hasn’t been updated as you’d like it, and there’s a better system available now. Maybe you’re storing different kinds of things, or you’ve found a new application that your scholars really want to use that’s not compatible with your existing setup. Maybe the formats you’re managing have gone out of date. Maybe it becomes more cost effective to move to a big externally managed repository that your scholars are flocking to already– or away from one that they’re not finding useful. Maybe you even decide it no longer makes sense for you to maintain a particular repository.

You need to start thinking about strategies for change (and for exit) the moment you start planning a repository. Remember, repositories ultimately don’t exist for themselves, but for their content (and for the people using that content). And the kind of content that libraries often care about is likely to remain relevant much longer than any particular repository configuration. You want to ensure that the content remains useable for as long as your patrons care about it, even as it moves and migrates between systems (and possibly, between caretakers).

An example: Planning for data repository services

What does it mean, practically, to plan with benefits, costs, and contingencies in mind? Well, at Penn, we’re starting to consider repository services for data sets. We have a general idea of the benefits of archiving data sets, because we’ve heard from faculty in various departments who want to analyze data previously collected by research groups (their own or others), who are having a hard time managing their own data, or who are required by their journals or support agencies to publish or maintain their data sets. Before we commit to providing a new data repository service, though, we need a better sense of these benefits. How broad and deep is the desire for data services among our faculty? Where is it most acute, in terms of disciplines and services? What would be gained from having our institution provide our own data repository services, rather than just having our scholars use someone else’s services, or fend for themselves? What are the benefits of introducing services specifically for data, rather than just, say, saving data sets alongside other files in existing repositories? If we’re considering a significant investment, we need more than just anecdotal answers to these questions. A survey of faculty in various disciplines can give us a better idea of how they could benefit from and support data repository services.

We also have to consider costs. What options do we have for creating, acquiring, or contracting with a data repository or repository service? What do they cost to install and run, both in monetary and staffing terms? What are the costs of acquiring content (again in money and labor, where the labor might come from librarians, scholars, or students)? How about costs of maintaining, accessing, and migrating the content? How will these costs be covered? What about costs associated specifically with this kind of content? Are there confidentiality, security, intellectual property, or liability concerns we have to consider? To help answer these questions, we should evaluate various data repository systems in existence and in development. The faculty survey mentioned above could also help us answer some of the questions about labor and support.

Contingencies, by their nature, tend not to be fully foreseeable. But there are a few obvious things we can ask about and plan for. Will our data still be readable for decades to come? Can we migrate it to new formats, and if so, what would be involved? Can we make sure we have good enough metadata and annotation to know how to read, use, and migrate the data in the future? Do we have clear identifiers for our content that will survive a move to a new platform (and leave a workable forwarding address, if necessary)? What happens to our content if our repository loses funding, our machine room is sucked into a mini-black-hole, or we simply decide it’s not worth the trouble of keeping the repository going? What do we do if we’re told to withdraw or change the data we’re maintaining, by the person who deposited it, by someone else using or mentioned in the data, or by the government? We won’t necessarily come up with definitive answers to all these questions, but brainstorming and thinking through possible and likely scenarios should help us know what to expect and reduce the chance of our getting caught unawares by a costly problem.

Is it worth it?

That’s a lot to do, you might be thinking, before you even get started. Can’t we just put this cool system up and see what happens? Well, you could, if you and your community will be satisfied with something that might be here today and gone tomorrow, and that doesn’t have any support or reliability guarantees. But if you have scholars to serve, and you’d like them to take the time and trouble to entrust their content to your repository, they’re probably going to want some reassurance that the repository will have staying power, and give them benefits worth their time. Otherwise, they have plenty of other, more important things to do.

Running a large, successful, long-lasting repository takes a lot of work over its lifespan. Better to do some planning work up front than get stuck with a lot of costly and unnecessary work later on.

Posted in repositories | Comments Off on Repositories: Benefits, costs, contingencies (with an example)

Getting fair use right: Maximize what you give, minimize what you take

This week’s Harry Potter court decision in New York is well worth reading for anyone who’s interested in knowing whether something is fair use or an illegal copyright infringement. The case involved an unauthorized lexicon of the Harry Potter books that was to have been published as a book last fall. (The book was adapted from a free web site edited by the lexicon’s writer.) J. K. Rowling and Warner Brothers, who created the Harry Potter books and movies, sued. The PDF of the resulting court decision has been posted at Groklaw, which also has a text version with commentary. There’s also some interesting discussion on Teleread, including a long comment from someone involved in a similar case with a different outcome.

Neither side got all they wanted in the Potter case. The judge, Robert Patterson, ruled that the lexicon violated Rowling’s copyrights, put the kibosh on the book, and fined the publisher. But he imposed the minimum fine prescribed by statute, and made it clear that, contrary to Rowling’s claims, other people were welcome to publish lexicons and other nonfiction books that comment on Harry Potter or other works of fiction, without having to get the copyright holder’s permission. They just had to be more sparing in their reuse of the work than the author of this lexicon was.

A key question in the case had to do with the purpose of the book at issue. Was it “transformative”; that is, was it trying to do something essentially new and original, using the older work as base material? Or was it simply a rearrangement of the work, or derivative variation on a theme? Fan fiction, for instance, is usually considered derivative rather than transformative work (since it, like the original it’s based on, is typically a story meant for entertainment, based on the same characters, settings, and plot structure.) As a derivative work, it gets minimal fair use protection. Likewise, in the same circuit that decided the Harry Potter case, an unauthorized Seinfeld trivia game was ruled not to be fair use, since it simply retold imagined events from the TV show in a new arrangement, without adding significant original content. (The Seinfeld case, known as Castle Rock vs. Carol Publishing, was repeatedly cited in the Harry Potter decision, as were a number of “unauthorized guidebook” cases.)

Patterson ruled that a lexicon was transformative use of Rowling’s novels, since a set of stories was transmuted into a reference guide that included original commentary on the story elements. Unfortunately, there wasn’t that much original commentary in the lexicon, and the amount of material quoted from Rowling was a good deal more than what was needed for that commentary, the judge ruled. (Note that I have not read the lexicon myself; for the purposes of this post, I’m relying on Patterson’s findings of fact.) Moreover, the lexicon also borrowed heavily from two companion volumes by Rowling, Quidditch Through the Ages and Fantastic Beasts and Where to Find Them, that already were very similar in form and intent to the lexicon.

A pure lexicon could simply have had short definitions (say, 1 or 2 sentences of original prose) for each character or concept in Rowling’s books, and then simply cited places in Rowling’s works where the character or concept appears or is further described. Instead, all too often the author apparently wanted to mention everything significant Rowling had to say about things in the lexicon, and borrowed extensively from Rowling’s text, either literally quoted or closely paraphrased. (Paraphrasing doesn’t avoid the problem of copyright infringement, if you’re still copying the author’s imagery or other original expression.)

Extensive reuse of Rowling’s expression might still have been okay if the author needed to comment specifically on that expression. (For instance, a critic might quote Rowling’s use of imagery for magical spells to compare it to, say Tolkien’s imagery for the same concept.) But too often, the copying of Rowling’s expression in the lexicon was not used to back up original commentary by the author, but was used instead of original comment. This happened often enough, Patterson decided, that he could not uphold a claim of fair use.

The lexicon’s publication as a book sold commercially, as opposed to its earlier form as a noncommercial website, was also a factor in the final ruling. But it wasn’t as decisive as one might imagine, and the judge devotes relatively little text in the decision to this factor. Even a commercial book on Harry Potter can be fair use; and a noncommercial website on Harry Potter (such as one that posts complete copies of Rowling’s books) can be infringing.

The take-away from this decision is that authors of commentaries and guides to other works of fiction can proceed in many cases without permission, provided that they’re making significant original contributions to readers’ understanding of the works they comment on, and that they reuse or quote only what is necessary to provide these contributions. In other words, if you’re writing one of these guides, the focus should be on what you’re giving to the reader, rather than on what you’re taking from the earlier writer.

Online opinion of Patterson’s decision has been mixed, with some applauding the final ruling and some arguing against it. I’m not a lawyer, and don’t presume to say whether he got the ruling exactly right. But I think his extensive discussion of the facts and precedents behind his decision provides a valuable guide for writers who want to maintain the proper and legal focus in their own fair use of others’ work.

Posted in copyright | 15 Comments

Changing the subject(s)

When I implemented subject maps for browsing the Online Books Page by subject a while back, I had a big problem to face: I didn’t actually have subject terms for the books. How could I implement subject browsing without subjects?

By bootstrapping. I did have call numbers for the books, which arrange books by discipline. (That’s what lets you see similar books grouped together in a library’s nonfiction shelves.) It’s possible to infer a subject from a call number, though you’ll sometimes get a more general subject than the book’s really about, or miss secondary subjects. The Library of Congress authority records for subjects include call number ranges that apply to many of their authorized terms. My library had some of these authority records in its catalog. They’re often a few years behind the state of LC’s official list, but they’re still recent enough to be useful.

So I downloaded those records and then wrote a program that, given a call number, would try to find the subject with the smallest range that included that call number. Doing that helps you get specific subjects instead of general ones; if your call number is HD1306, you want to match the range HD1301-HD1306 (for “Land, Nationalization of“) rather than the wider range HD101-HD1131 (for the more general subject “Land use“). After filtering out some bad data in a few authority records, and suppressing some terms to break ties, I ran the program, and instantly got subject terms for tens of thousands of books. Some of them were pretty generic (thousands of books were simply labeled “English literature”, for instance), but many were quite specific, and I’d say over 90% of the time were useful descriptions of the book. The maps I built largely based on these assigned subjects worked pretty well from day one.

I didn’t stop there, though. Little by little I went back and found more precise and appropriate subjects for books in various parts of my collection. When I did this, I could also assign multiple subjects, instead of having to make do with one. (I’m not trained as a cataloger, but I know the basics of how LCSH subject assignment works, and can look over terms assigned by the Library of Congress or other libraries and choose the ones that seem to make the most sense for a given title.) I also kept track of which books had the automated subject assignments, and which had human-overseen cataloging.

As of today, I’ve assigned these more precise and comprehensive subjects to all the non-literature books in my collection that had call numbers. A lot of the fiction, and some of the more obscure nonfiction without a call number, still lacks subject cataloging. (As far as I can tell from Worldcat, many fiction books have never been subject-cataloged by anyone.) Some of these books will eventually get subject terms as well. But by now, the only automatic subject assignments left were for a few (mostly large) generic literature categories, which by now are mostly getting in the way of discovery of the other books. So today I’m turning those automated categorizations off.

Now that I’ve completed this phase of subject browsing enhancement, I’m excited to think about what might come next. I know from my usage logs that lots of people are browsing by subject (and that, based on the bad link reports I get, they’re finding books that had largely been overlooked before). Now that I have consistent high-quality subject metadata for nonfiction, I can think of various ways to improve subject-based discovery, both for this collection and for others. I can work on ways to keep the subject map up to date with the latest changes in subject vocabulariess. I can implement techniques for establishing more relevant connections between subjects. I can investigate ways to integrate data from less consistent sources (such as most large library catalogs) into subject maps and compensate for (or even automatically correct) their inconsistencies.

For now, though, I’ll stop for a moment, take a breath, and come up to blog, before diving back into this and other projects.

Posted in discovery, libraries, meta, online books | 2 Comments

Celebrating freedom, in various ways

This week marks both the anniversary of Canadian confederation (Canada Day, July 1), and the anniversary of American independence (Independence Day, July 4, or should that be July 2?) This week I’m also finishing up subject cataloging of online books on both countries (or at least, all the ones for which I previously had less precise US and Canadian history subjects automatically assigned.) So if you’d like to read up on Canada or United States history, there are plenty of free online books in various relevant topics you can browse through.

There’s a lot more that I could potentially index, so please suggest titles or topics you’d like to see. (I sometimes find suggestions myself, such as this book on India’s history recommended in this blog comment, but it’s generally quicker and more reliable to tell me directly.)

Even if I can’t list the precise title you’re asking for, it’s often possible to find similar titles if I know what kinds of books you’re interested in. I recently received a request for a few published diaries that are still under copyright and that I couldn’t list, but knowing the person’s interests, I could point them to this subject map, where you can browse through published diaries from all kinds of people, times, and places online, arranged by subject.

I’m very thankful to the folks who put the materials online that I list. You never know when something you post will suddenly become important or attract widespread interest. Who would have guessed, for instance, that a 1957 Air Force study of Chinese Communist techniques to extract false confessions would become relevant to present-day investigations of interrogations at Guantanamo? Thanks to the folks at PubMedCentral, we have not only that article, but the full journal issue on various aspects of Communist prisoner mistreatment (which I only wish were not so relevant today) in which this article appeared, and indeed, nearly the entire run of that medical journal. (I’ll list the journal later today in the serials listings of The Online Books Page.) The articles in that joural issue remind us of some of the oppressions that Americans have fought against since our country’s founding, and that I hope we can defeat again.

Sometimes the digitizers need a little help. A while back, legal threats led to the shutdown of the International Music Score Library Project, based in Canada and the US. The scores in the library were public domain in those countries, but cease-and-desist notices were sent based on the possibility of scores being downloaded by musicians in countries with longer copyright terms. I’m happy to hear that the library is back online, thanks in part to legal support from both Canada and the US. I’ll be adding this library to my archives and indexes page today as well.

Increasingly, good information resources on just about any subject are freely and legitimately available online, or can be. As we celebrate freedom this week in much of North America, let’s also remember and thank the folks who help free knowledge and culture online.

Posted in online books | Comments Off on Celebrating freedom, in various ways

Repositories: What they are, and what we use them for

(Note: This is the second of an ongoing series of posts on repositories. The first post is here.)

The JISC Repositories Support Project defines a digital repository as “a mechanism for managing and storing digital content.” I find this a useful definition, both for what it says and what it doesn’t say. It notes that repositories, as such, focus on content and its management. It doesn’t say anything about the kind of digital content managed by the repository, or about the use this content is put to.

A repository’s focus is related to, but distinct from, the focus of a library or an application. Repositories focus on particular information content. Applications (like Zotero, FeedReader, or Google Docs) focus on particular information tasks, like tracking citations, getting news, or authoring documents. Libraries focus on the information needs of particular communities (which might be towns, schools, peer researchers, or Internet users with particular interests). Applications and libraries may use repositories to support their tasks or communities, and some may be primarily built around one specific repository (as most libraries in the pre-computer age were built around what was in their physical stacks). But they are not identical to their repositories, and it’s often useful to distinguish the functions of a library and the functions of the repositories that it uses.

At the same time, though, you can’t plan the development of a library without thinking about its repositories. Repositories really are essential infrastructure for libraries, but not simply as a place to “capture and preserve the intellectual output of university communities” (as a 2002 SPARC white paper put it), or, more pessimistically, as “a place where you dump stuff and then nothing happens to it” (as a 2005 JISC workshop annex put it). The Penn Libraries today rely on hundreds of digital repositories, mostly run by various publishers. We also manage a few important ones ourselves. Here are a few that we manage, or are considering managing:

  • A repository providing open access to the scholarly output of our researchers (what is often thought of as the traditional “institutional repository”). For this repository, we manage the content, and contract with an outside company to manage the servers and develop the software. While many faculty cooperate in populating this repository, and some faculty deposit their own work themselves, librarians do much of the work to populate it.
  • A repository preserving content from some of our electronic subscription resources. This repository is normally only seen by library staff, but it’s an important part of our preservation strategy, and will be exposed selectively when subscription resources it preserves are no longer available from the publisher. We run this repository on a local server, using open source software developed elsewhere, and its content is selected by us and ingested and preserved largely automatically, in cooperation with other users of the same repository software. (We also subscribe to another preservation initiative, involving a centralized preservation repository system that we don’t manage.)
  • The repository used to store content in our main courseware management system. The server is managed by us, using proprietary software, and is populated by instructors from all over the university. It is largely torn down and built anew every semester (sometimes carrying over material from previous semester’s incarnations). While this isn’t a permanent repository, it has very strong and definite persistence requirements that we have to take pains to support. And if some of our users just think of this as a place to do their teaching, and the “repository” aspects just come along for the ride, that’s a feature, not a bug.
  • Repositories for various digital image collections and digitized special collections. Historically these collections have been a mishmash of systems developed ad-hoc, involving filesystems, metadata in a database, custom-built websites, backup procedures, and sometimes little else. We’re currently locally developing a digital library architecture that will unify discovery and usage of many of these collections, and we hope to similarly unify repository management for many of these collections as well. Traditionally, the content is selected by bibliographers and the repositories and collection sites created by techies; we hope that the new architecture will let the bibliographers do more repository management and site design, and let the techies do less site-by-site management and more unified service management.
  • We have also tested repositories for managing numeric data, which are increasingly important shared research resources in many fields. We do not currently have a repository in production for this, but the repositories developed by projects like this one have important features for data-centric research that are not supported to the same extent by “traditional” repository systems.

As you can see from these examples, libraries like ours have all kinds of different uses for repositories, and various ways we can develop and manage them. We’re not starting repositories because they’re what all the cool Research I libraries are doing this year. We’re managing them because they help us provide what we see as important services to our communities. We recognize that different repositories have different uses, and that it often makes more sense to integrate multiple repositories into a single library than to build One Repository to Rule Them All. Once we have a clear understanding of why we would benefit from a particular repository, and what it would manage, we can consider various options for who would run it, where, and how. (And of course, what its costs would be, and how we can realistically expect those costs to be covered. But that’s a topic for another post.)

Posted in repositories | Comments Off on Repositories: What they are, and what we use them for

Now it’s official

As I hoped, the good news was announced by Peter Brantley of the Digital Library Federation while I was away in Canada: the recommendations of the ILS-Discovery interface task group, which we’ve been talking about and drafting over the last many months online and off, have now been officially released. You can find the official release on the DLF website. We’ll be putting some supplementary information on there shortly as well; for now, you can still find background and supplementary material on our wiki.

I’d like to thank the members of the task group for all their work in putting the recommendation together; the Digital Library Federation for sponsoring this work; our steering group (Dale Flecker, Robert Wolven, Marty Kurth, Terry Ryan, and especially Peter Brantley) for all sorts of help and support in making this initiative viable; the Penn Libraries for supporting my chairing the task group this past year (as well as hosting one of the early meetings); the vendors that signed the Berkeley Accord for meeting with us and agreeing to support the basic discovery interface functions describes in our recommendation; and the many library folks, developers, and vendors that gave us suggestions and publicity.

We’ve intended the recommendations to be a first step in an ongoing process of supporting interoperability between the online data and services of libraries and a wide range of discovery applications. The recommendations we produced give fairly detailed proposals for a basic level of interoperability, and more open-ended proposals for higher levels. But you should only spend so long on proposals before it’s time to shift emphasis onto implementing them. With the official version now out, I hope we can start implementing these functions in earnest. (And once we’ve accumulated some experience with implementations, I hope that folks will revisit and refine the recommendations to further help things along.)

Locally, we already have one demonstration implementation, and we hope to now work on getting the basic functions implemented for our actual ILS.) And I hope that many others will be working on or using implementations soon. The DLF is now planning a developer’s workshop for folks interested in implementing the ILS-DI recommendations, which hopefully will convene later this summer. There should also be online forums of various kinds to support folks who are interested in implementing the recommendations or using them in their application. Exactly how these forums will develop over time remains to be seen; but for now, the ILS-DI Google Group is one good place to look for news and discussion of activities related to the ILS-DI recommendation.

I’m thankful myself for having the opportunity to work with so many good people on this project, and look forward to getting to work on implementations, and to continuing the conversations that have started to make the most of library resources and services.

Posted in architecture, libraries | Comments Off on Now it’s official

A break, and coming attractions

I’m about to head off to the wilds (okay, the farms) of Saskatchewan to relax with family on a much-welcomed break. I’ve got to the point in packing where we’re trying to figure out which books to bring. (Which involves some careful selection to narrow it down to the number of books we can bring on the ever-more-limited-space airlines without excess baggage problems.)

I leave the ILS-Discovery Interface work in good hands, and there should be good news shortly (hopefully, quite shortly, and well before I return) for folks who are interested in this initiative. I’ll have more to say on what comes next after I get back. Also after I come back, a couple of weeks from now, I’ll be picking up on the repositories series I started last month, with a review of the what-why-who-and-where of the various kinds of repositories that libraries may find of use.

Online book fans may also be interested in following a debate going on now about ebook publishing, business models, and piracy. Author David Pogue had a Times Blog post a couple of weeks ago giving his reasons for not issuing electronic editions of his titles, that drew a long set of reader comments. Now Adam Engst has posted an interesting and detailed rebuttal, where he describes his own sales successes with his ebooks (piracy notwithstanding).

You might also enjoy “Reading sets you free”, an article posted about a month ago by K. G. Schneider (who I had the pleasure of meeting in person recently at a NISO discovery forum.) I was reminded of it again just now as I was trying to think of what books the kids might bring. As in the picture accompanying her article, both of them are very much read-under-the-covers kids at this point, as were both their parents. We’re all looking forward to spending a lot of time conversing with each other and with our books these next couple of weeks.

Posted in architecture, online books, reading | 1 Comment