Surpassing all records

What will happen to all the White House emails after George W. Bush leaves office in January? Who will take charge of all the other electronic records of the government, after they’re no longer in everyday use? How can you archive 1 million journal articles a month from dozens of different publishers? Can the virtual world handle the Large Hadron Collider’s generation of 15 petabytes of data per year without being swallowed by a singularity? And how can we find what we need in all these bits, anyway?

These were some of the digital archiving challenges discussed this week at the Partnerships in Innovation II symposium in College Park, Maryland. Co-sponsored by the National Archives and Records Administration and the University of Maryland, the symposium brought together experts and practitioners in digital preservation for a day and a half of talks, panels, and demonstrations. It looked to me like over 200 people attended.

This conference was a sequel to an earlier symposium that was held in 2004. Many of the ideas and plans presented at the earlier forum have now grown into fruition. The symposium opened with an overview of NARA’s Electronic Records Archives (ERA), a long-awaited system for preserving massive amounts of records from all federal government agencies, that went live this summer. It’s still in pilot mode with a limited number of agencies, but will be importing lots of electronic records soon, including the Bush administration files after the next president is inaugurated.

The symposium also reviewed progress with older systems and concepts. The OAIS reference model, a framework for thinking about and planning long-term preservation repositories, influences not only NARA’s ERA, but many other initiatives and repositories, including familiar open source systems like Fedora and DSpace. Some of the developers of OAIS, including NASA’s Don Sawyer, reviewed their experiences with the model, and the upcoming revision of the standard. Fedora and DSpace themselves have been around long enough to be subjects of a “lessons learned” panel featuring speakers who have built ambitious institutional repositories around them.

The same panel also featured Evan Owens of Portico discussing the extensive testing and redesign they had to do to scale up their repository to handle the million articles per month mentioned at the top of this post. Heavily automated workflows were a big part of this scaling up, a strategy echoed by the ERA developers and a number of the other repository pracitioners, some of whom showed some interesting tools for automatically validating content, and for creating audit trails for certification and rollback of repository content.

Networks of interoperating repositories may allow digital preservation to scale up further still. That theme arose in a couple of the other panels, including the last one, dedicated to a new massive digital archiving initiative: the National Science Foundation‘s Datanet. NSF envisions large interoperating global networks of scientific data that could handle many Large Hadron Colliders worth of data, and would make the collection, sharing, reuse, and long-term preservation of scientific data an integral part of scientific research and education. The requirements and sizes of the grants are both prodigious– $20 million each to four or five multi-year projects that have to address a wide range of problems and disciplines– but NSF expects that the grants will go to wide-ranging partnerships. (This forum is one place interested parties can find partners.)

I gave a talk as part of the Tools and Technologies panel, where I stressed the importance of discovery as part of effective preservation and content, and discussed the design of architectures (and example tools and interfaces) that can promote discovery and use of repository content. My talk echoed in part a talk I gave earlier this year at a Palinet symposium, but focused on repository access rather than cataloging.

I’m told that all the presentations were captured on video, and hopefully those videos, and the slides from the presentations, will all be placed online by the conference organizers. In the meantime, my selected works site has a PDF of the slides and a draft of the script I used for my presentation. I scripted it to make sure I’d stay within the fairly short time slot while still speaking clearly. The talk as delivered was a bit different (and hopefully more polished) than this draft script, but I hope this file will let folks contemplate at leisure the various points I went through rather quickly.

I’d like to thank the folks at the National Archives and UMD (especially Ken Thibodeau, Robert Chadduck, and Joseph Jaja) for putting on such an interesting and well-run symposium, and giving me the opportunity to participate. I hope to see more forums bringing together large-scale digital preservation researchers and practitioners in the years to come.

Author: John Mark Ockerbloom

I'm a digital library architect and planner at the University of Pennsylvania.