DLF ILS Discovery Interfaces: Revised recommendation draft open for comments

Today we released a draft of “revision 1.1” of the ILS Discovery Interfaces recommendation. As I discussed in my previous post, this revision is intended to clarify the implementation of the Basic Discovery Interfaces recommended for integrated library systems (ILS’s), and make them more useful for discovery applications.

On the DLF ILS Discovery Interfaces web site, you’ll find the revision draft and the accompanying schema, along with the initial official recommendation (or “revision 1.0”). My last post included a summary of the major changes from version 1.0.

We’d like to give folks a chance to comment on the changes before we make them official. We’ll take comments until November 18, shortly after the end of the DLF Fall Forum, so folks wanting to go to our birds of a feather session on implementing the recommendations can talk with us there and still have some time to send in written comments. (Or, you can send them in ahead of time so we can think on them at the forum.) Comments may be emailed to me, and I will pass them along to the rest of the task group. There’s also still the open Google Group for discussions.

I’m hoping we’ll start to see Basic Discovery Interfaces implementations, clients, and test suites soon based on the new recommendations and schema. They’re not that different from version 1.0, but should be more useful. I’m working on revising my example implementation now, and hope to see more implementations in the not too distant future. And I look forward to hearing interested people’s thoughts and comments as well.

Update on ILS-Discovery Interface work

It’s been a while since I posted about the official release of the Digital Library Federation’s ILS Discovery interface recommendation. Marshall Breeding recently posted a useful update on the further development of the interfaces at Library Technology Guides. As the chair of the ILS-DI task group, which is now charged with some followup work described in Marshall’s article, I’d like to add some further updates.

As Marshall mentions, the DLF convened a meeting in August inviting potential developers of the ILS-Discovery interfaces to discuss implementations of recommendations of the DLF’s ILS-Discovery Interface task group. In the course of the discussion, a few changes were suggested and generally agreed upon by the participants. Updating the recommendation was not the main purpose of the meeting, but as we discussed things, it became clear that some clarifications and small updates to the recommendation would be helpful for producing more consistent and useful implementations of the Basic Discovery Interfaces, the interoperability “Level 1” that was agreed to in the Berkeley Accord.

The ILS-DI task group is therefore preparing a slight revision, to be known as “version 1.1” of the recommendation. A draft of this revision will be released for comment shortly, and will include the following changes, summarized here to give developers some idea of what to expect:

  • For the HarvestBibliographicRecords and HarvestExpandedRecords functions, it will be clarified that the function should return the records that are available for discovery. (That is, suppressed records and others that might be in the ILS but aren’t intended for discovery will not be shown, except possibly as deleted records as described below).
  • Support for the OAI-PMH binding for these functions will be noted as required. (That is, it must be supported for full ILS-BDI compliance; other bindings can be supported too.) It will also be noted that Dublin Core is a minimum requirement for returned records (as it is for OAI-PMH in general), and that if MARC records exist in the ILS (or are produced by it), MARC XML should also be available.
  • We also will require some level of support for deleted records (which includes records no longer available for discovery), to make it feasible for discovery apps to keep in sync with the ILS’s records via incremental harvesting. We’ll note that ILSs should document how long they keep deleted-record information.
  • For GetAvailability, the simple availability schema defined in the document will be noted as required. (That is, it should be returned for full ILS-BDI compliance; other schemas can be supported too if asked for and supported.) There was some talk at the August meeting about completely dropping the alternative NCIP and ILS-Holdings schemas as replies to GetAvailability, because of their complexity. The draft at this point doesn’t go that far, but it will specify the simple availability schema as the default, and the required, schema to support in the ILS-BDI profile.
  • That simple availability schema will also be augmented slightly to include an optional location element, distinct from the availability-message element. Location was the one specific data field that many implementors said was essential to include that wasn’t in the original schema.
  • We will also add a request parameter to GetAvailability for specifying whether bib or item-level availability is desired if a bib. identifier is given. (Formerly the server had the option of choosing the level in that case; there was a strong sentiment in discussions that the client to be able to specify this.)
  • We expect to leave GoToBibliographicRequestPage alone.

The new draft will be released shortly, and be open to public comment for at least a couple of weeks before we make a last edit for an official release. Feedback is welcome and encouraged, and public discussion can take place in the ILS-DI Google Group, among other places

The new draft will be accompanied by a revised XML schema. The current schema, reflecting the original or “version 1.0” official recommendation, can be found here. For the location of the new one (which is not yet posted), substitute “1.1” for “1.0” in the schema URL. (We intend to keep the old schema up for a good while after the new one is posted, for compatibility with implementations based on the original recommendation.)

I will also be leading a Birds of a Feather session at the upcoming Digital Library Federation fall forum in Providence next month. This will be an opportunity for developers of interfaces implementing the DLF’s ILS-Discovery interface recommendations to present their work to others, ask and answer questions about the recommendations and their implementations, and discuss further development initiatives and coordination. If you’d like us to set aside some time to show or discuss a particular initiative or project you’re working on, let me know.

Watch this space and the ILS-DI Google Group for further developments. And if you can come to the session at DLF in November, I hope we’ll have an interesting and enlightening discussion there as well.

(Update, Oct. 30: The draft of the revision is now out for comment.)

What repositories do: The OAIS model

(Another post in an ongoing series on repositories.)

In my previous post, I mentioned the OAIS reference model as an influential framework for thinking about and planning repositories intended for long-term preservation. If you’re familiar with some of the literature or marketing for digital repositories, you may well have seen OAIS mentioned, or seen a particular system marketed as “OAIS compliant”. You may have also noticed remarks that it’s not always clear in practice what OAIS compliance means. The JISC Standards Catalogue notes “The [OAIS] documentation is quite long and complex and this may prove to be a barrier to smaller repositories or archives.” A common impression I’ve heard of OAIS is that it’s a nice idea that one should really try to pay more attention to, but complex enough that one will have to wait for some less busy time to think about it. Perhaps, one might think, if we just pick a repository system whose marketing says it’s OAIS compliant, we can be spared thinking about it ourselves.

I think we can do better than that, even in smaller projects. The basics of the OAIS model can be understood without having to be conversant with all 148 pages of the reference document. Those basics can help you think about what you need to be doing if you’re planning on preserving information for a long term (as most libraries do). The basics of OAIS also make it clear that following the model isn’t just a matter of installing the right product, but of having the right processes. It’s made very explicit that repository curators need to work with the people who produce and use the information in the repository, and make sure that the repository acquires all the information necessary for its primary audience to use and understand this information far into the future.

To help folks get oriented, here’s a quick introduction to OAIS. It won’t tell you everything about the model, but it should let you see why it’s useful, how you can use it, and what else you might need to consider in your repository planning.

What OAIS is and isn’t

First, let’s start with some basics: OAIS is a reference model for Open Archival Information Systems (whose initials make up the OAIS), that’s now an ISO standard, but is also freely available. It was developed by NASA’s Consultative Committee for Space Data Systems, who have had to deal with large volumes of data and other records generated by decades of space missions and observations, so they’ve had to think hard about how to manage and preserve it. To develop OAIS, they had open discussions with lots of other people and groups (like the National Archives) who were also interested in long-term preservation. OAIS is called “Open” because of the open process that went into creating it. It does not require that the archives are open access, or have open architecture, and it has no direct relation to the similarly-acronymed Open Archives Initiative (OAI). (Though all of these things are also useful to know about in their own right.) An “archival information system” or “archive” can simply be thought of as a repository that’s responsible for long-term preservation of the information it manages.

Unlike many standards, OAIS specifies no particular implementation, API, data format, or protocol. Instead, it’s an abstract model that provides four basic things:

  • A vocabulary for talking about common operations, services, and information structures of a repository. (This alone can provide very useful common ground for different people who use and produce repositories to talk to each other.) A glossary of this vocabulary can be found in section 1 of the reference model.
  • A simple data model for the information that a repository takes in (or “ingests”, to use the OAIS vocabulary), manages internally, and provides to others. This information is assumed to be in distinct, discrete packages known as Submission Information Packages (SIPs) for ingestion, Archival Information Packages (AIPs) for internal management, and Dissemination Information Packages (DIPs) for providing the information to consumers (or to other repositories). These packages include not just raw content, but also metadata and other information necessary for interpreting, preserving, and packaging this content. They have different names because the information they contain can take different forms as it goes into, through, and out of the archive. They are described in more detail in sections 2 and 4 of the reference model.
  • A set of required responsibilities of the archive. In brief, the archive (or its curators) must negotiate with producers of information to get appropriate content and contextual information, work with a designated community of consumers to make sure they can independently understand this information, and follow well-defined and well-documented procedures for obtaining, preserving, authenticating, and providing this information. Section 3 of the model goes into more detail about these responsibilities, and section 5 discusses some of the basic methodologies involved in preservation.
  • A set of recommended functions for carrying out the archive’s required responsibilities. These are broken up into 6 functional modules: ingest, data management, archival storage, access, administration, and preservation planning. The model describes about half a dozen functions in each model (ingest, for example, includes things like “receive submission”, “quality assurance”, and “generate AIP”) and data flows and dependencies that might exist between the functions. Some of these functions are automated, some (like “monitor technology”), are carried out by humans, and some may involve a combination of human oversight and automated assistance. The functions are described in more detail in section 4 of the model (with issues of multi-archive interoperability discussed in Section 6.)

OAIS conformance and usage

It is important to note that OAIS compliance simply requires fulfilling the required responsibilities, and supporting the basic OAIS data model of information packages. A repository is not required to implement all the functions recommended in the OAIS model, or replicate the detailed internal data flows, to be OAIS compliant. But it can be very useful to look through the functions in any case, both to make sure that your repository is doing everything it needs to do, and to see how the big problem of reliable data preservation can be broken down into smaller, more manageable operations and workflows.

You may also find the functions a useful reference point for detailed descriptions of the exact formats and protocols your repository uses for ingesting and storing information, providing content to users, and migrating it to other repositories. Although the OAIS model does not itself provide specific formats or protocols to use, it makes it clear that a repository provider needs to specify these so it can receive information from producers and make it clearly understandable to consumers.

The OAIS model has been used to help construct more detailed criteria for trusted repositories, as well as checklists for repository audit and certification. In most cases, repositories will operate perfectly well without satisfying every last criterion or checklist item. At the Partnerships in Innovation symposium I attended last week, Don Sawyer, one of the main people behind OAIS, remarked that the archives where he worked satisfied about 80% of the trusted repository checklist items. But he still found it useful to go through the whole list to verify that certain functions were not relevant or required for their repository needs, as well as to spot aspects of the repositories (like disaster recovery or provenance tracking) that might need more attention. Similarly, you can go through the recommended OAIS functions and data-model breakdowns to evaluate what’s important to have in your repository, what can be safely omitted, and what might need more careful attention or documentation.

What else you need to think about

Although the OAIS model includes examples of various kinds of repositories that might use it, it’s at its heart a fairly generic, domain-independent model, largely concerned with preservation needs. It doesn’t say a whole lot about how a repository needs to interact with specific communities to fulfill its purposes. For instance, in the talk I gave last week, I stressed the importance of designing the architecture of repositories to support rich discovery mechanisms. As Ken Thibodeau noted in later conversation, the access model of OAIS is more primitive than the architectures I described. OAIS is not incompatible with those architectures, but designing the right kinds of discovery architectures requires going beyond the criteria of OAIS itself.

You’ll also need to think carefully about the needs of the communities you’re collecting from and serving. The OAIS model notes this requirement, but doesn’t pursue it in depth. I can understand why it doesn’t, since those needs are highly dependent on the domain you’re working in. A repository intended to preserve static, published text documents for possible use in legal deposition will need to interact with its community very differently from, say, a repository intended to manage, capture, and ultimately preserve works in progress used in ongoing research and teaching. They both have preservation requirements that OAIS may well address effectively, but designing effective repositories for these disparate needs may require going well beyond OAIS, doing detailed requirements analyses, and assessing benefits and costs of various options.

I’ll talk more about requirements for particular kinds of repositories in later posts. But I hope I’ve made it clear how the OAIS model can be useful for general thinking and planning what a repository needs to do to manage and preserve its content. If it sounds promising, you can download the full OAIS model as a PDF. A revised document that will clarify some of the terminology and recommendations, but will not substantially change the model, is expected to be released in early 2009.

Surpassing all records

What will happen to all the White House emails after George W. Bush leaves office in January? Who will take charge of all the other electronic records of the government, after they’re no longer in everyday use? How can you archive 1 million journal articles a month from dozens of different publishers? Can the virtual world handle the Large Hadron Collider’s generation of 15 petabytes of data per year without being swallowed by a singularity? And how can we find what we need in all these bits, anyway?

These were some of the digital archiving challenges discussed this week at the Partnerships in Innovation II symposium in College Park, Maryland. Co-sponsored by the National Archives and Records Administration and the University of Maryland, the symposium brought together experts and practitioners in digital preservation for a day and a half of talks, panels, and demonstrations. It looked to me like over 200 people attended.

This conference was a sequel to an earlier symposium that was held in 2004. Many of the ideas and plans presented at the earlier forum have now grown into fruition. The symposium opened with an overview of NARA’s Electronic Records Archives (ERA), a long-awaited system for preserving massive amounts of records from all federal government agencies, that went live this summer. It’s still in pilot mode with a limited number of agencies, but will be importing lots of electronic records soon, including the Bush administration files after the next president is inaugurated.

The symposium also reviewed progress with older systems and concepts. The OAIS reference model, a framework for thinking about and planning long-term preservation repositories, influences not only NARA’s ERA, but many other initiatives and repositories, including familiar open source systems like Fedora and DSpace. Some of the developers of OAIS, including NASA’s Don Sawyer, reviewed their experiences with the model, and the upcoming revision of the standard. Fedora and DSpace themselves have been around long enough to be subjects of a “lessons learned” panel featuring speakers who have built ambitious institutional repositories around them.

The same panel also featured Evan Owens of Portico discussing the extensive testing and redesign they had to do to scale up their repository to handle the million articles per month mentioned at the top of this post. Heavily automated workflows were a big part of this scaling up, a strategy echoed by the ERA developers and a number of the other repository pracitioners, some of whom showed some interesting tools for automatically validating content, and for creating audit trails for certification and rollback of repository content.

Networks of interoperating repositories may allow digital preservation to scale up further still. That theme arose in a couple of the other panels, including the last one, dedicated to a new massive digital archiving initiative: the National Science Foundation‘s Datanet. NSF envisions large interoperating global networks of scientific data that could handle many Large Hadron Colliders worth of data, and would make the collection, sharing, reuse, and long-term preservation of scientific data an integral part of scientific research and education. The requirements and sizes of the grants are both prodigious– $20 million each to four or five multi-year projects that have to address a wide range of problems and disciplines– but NSF expects that the grants will go to wide-ranging partnerships. (This forum is one place interested parties can find partners.)

I gave a talk as part of the Tools and Technologies panel, where I stressed the importance of discovery as part of effective preservation and content, and discussed the design of architectures (and example tools and interfaces) that can promote discovery and use of repository content. My talk echoed in part a talk I gave earlier this year at a Palinet symposium, but focused on repository access rather than cataloging.

I’m told that all the presentations were captured on video, and hopefully those videos, and the slides from the presentations, will all be placed online by the conference organizers. In the meantime, my selected works site has a PDF of the slides and a draft of the script I used for my presentation. I scripted it to make sure I’d stay within the fairly short time slot while still speaking clearly. The talk as delivered was a bit different (and hopefully more polished) than this draft script, but I hope this file will let folks contemplate at leisure the various points I went through rather quickly.

I’d like to thank the folks at the National Archives and UMD (especially Ken Thibodeau, Robert Chadduck, and Joseph Jaja) for putting on such an interesting and well-run symposium, and giving me the opportunity to participate. I hope to see more forums bringing together large-scale digital preservation researchers and practitioners in the years to come.