Repository services, Part 1: Galleries vs. self-storage units

Back near the start of my occasional series on repositories, I noted that we had not just one but a number of repositories, each serving different purposes.

In tight budgetary times, this approach might seem questionable.  Right now, we’re putting up a new repository structure (in addition to our existing ones) to keep our various digitized special collections and make them available for discovery and use.  We hope this will make our digital special collections more uniformly manageable, and less costly to maintain.

At the same time, we’re continuing to maintain an institutional repository of our scholars’ work on a completely different platform, one for which we pay a subscription fee annually.  I’ve heard more than one person ask “Well, once our new  repository is up, can’t we just move the existing institutional repository content into it, and drop our subscription?”

To which I generally answer: “We might do that at some point, but right now it’s worth maintaining the subscription past the opening date of our new repository.”  The basic reason is that the two repositories not only have different purposes, but also, at least in their current uses, support very different kinds of interactions, with different kinds of audiences.

The interactions we need initially for the repository we’re building for our special collections are essentially internal ones.  Special collections librarians create (or at least digitize) a thematic set of items, give them detailed cataloging, and deposit them en masse into the collection.  The items are then exposed via machine interfaces to our discovery applications, that then let users find and interact with the contents in ways that our librarians think will best show them off.

The repository itself, then, can work much like a self-storage unit.  Every now and then we move in a bunch of stuff, and then later we bring it out into a nicer setting when people want to look at it.  Access, discovery, and delivery are built on top of the repository, in separate applications that emphasize things like faceted browsing, image panning and zooming, and rare book page display and page turning.

Our institutional repository interacts with our community quite differently.  Here, the content is created by various scholars who are largely outside the library, who may deposit items bit by bit whenever they get around to it (or when library staff can find the time to bring in their content).  They want to see their work widely read, cited, and appreciated.  They don’t want to spend more time than they have to putting stuff in– they’ve got work to do– and they want their work quickly and easily accessible.  And they’d like to know when their work is being viewed.  In short, they need a gallery, not just a self-storage unit.  They want something that lets them show off and distribute their work in elegant ways.

Our institutional repository applications, bundled with the repository, thus emphasize things like full text search and search-engine openness, instant downloads of content, and notification of colleagues uploading and downloading papers.

We could in theory build similar applications ourselves, and layer them on top of the same “self-storage” repository structure we use for special collections.   (Museums likewise often have their exhibit galleries literally on top of the bulk of their collection kept in their basements, or other compact storage areas.)  But it would take us a while to build the applications we need, so for now we see it as a better use of our resources to rely on the applications bundled with our institutional repository service.

(An alternative, of course, would be to see if an existing open source application would serve our needs.  I hope to talk more about open source repository software in a future post, but we haven’t to date decided to run our institutional repository that way.)

I hope I’ve at least made it clear that for a viable institutional repository, you need quite a bit more than just “a place to put stuff”: you need a suite of services that support its purposes.  In Part 2,  I’ll enumerate some of the specific services that we need or find useful in our institutional scholarship repository.

Posted in repositories | 1 Comment

Public Domain Day 2009: Freeing the libraries

In many countries, January 1 isn’t just the start of a new year: it’s the time when a new year’s worth of works are welcomed into the public domain.  As I noted in last year’s Public Domain Day post, countries that use the copyright terms specified by the Berne Convention bring works into the public domain on the first January 1 that’s more than 50 years after the death of their authors.  So today, most works by authors who died in 1958 join the public domain in those countries.  This page at authorandbookinfo.com lists many such authors, and their books.  Some of the more notable names include James Branch Cabell, Rachel Crothers, Dorothy Canfield Fisher, C. M. Kornbluth, Mary Roberts Rinehart, Robert W. Service, and Ralph Vaughan Williams.

Many countries, however, have extended their copyright terms in recent years.   Most European Union countries, for instance, took 20 years worth of works out of the public domain in the 1990s when the EU mandated that copyright terms be extended to run for the life of the author plus 70 years.  This year, they come a little bit closer to recovering their lost public domain, welcoming back works by authors who died in 1938, including people like Karel Capek, Zona Gale, Georges Melies, Constantin Stanislavsky, Osip Mandelstam, Owen Wister, and Thomas Wolfe.

In some other countries, very little is entering the public domain today.  Here in the US, we’re midway through a freeze on most copyright expirations, resulting from a term extension enacted in 1998.  We now have 10 years to go until copyrights on published works start expiring again due to age. (By 1998, all works copyrighted prior to 1922 had entered the public domain.  Remaining copyrights from 1923 are scheduled to expire at the start of 2019.) Some special interests would like to make copyright terms even longer (even “forever less one day”, as Congresswoman Mary Bono requested on behalf of the movie industry).  Those of us who value the public domain will need to ensure that it is not further eroded, and that copyrights are allowed to expire on schedule.  This is in keeping with the intents of the country’s founders, who specified in the Constitution that copyrights were meant to last only for “limited times”.

But even though few works are entering the public domain in the US today, many more works are now freely and easily available to the public today than a year ago.  Much of this is thanks to initiatives like Google Books and the Open Content Alliance, which are digitizing books and other works that libraries have acquired and preserved.  Many of the digitized works are in the public domain, and these projects have been making them freely readable and downloadable when they can confirm their public domain status.  And now that Google has negotiated a settlement with book publisher and author groups,  they plan to be more proactive about identifying and releasing public domain works, including works published after 1922 that are out of copyright (but are not so easy identified as public domain as older books are).

These works have been part of the public domain for years, but when they were simply sitting on the shelves of a few research libraries, they weren’t doing the public much good.  Once they’re digitized, though, and their digitizations and descriptions are shared online, they can be much more easily found, read, adapted, and reused by anyone online.  By opening up the treasure trove of public domain expression that libraries have preserved, we magnify its value.  When libraries share their intellectual endowment, they better fulfill their mission to bring art and knowledge to readers, and make it easy for readers to learn, build on, and be enriched by this knowledge.

I wish I could say that libraries always acted with this understanding.   Unfortunately, all too often libraries and affiliated organizations have been resistant or slow to share the information they compile and control.  The effective value of what libraries offer has been significantly diminished as a result.

Sometimes libraries simply have not moved as quickly as they could.  The Copyright Office has long provided online access to copyright records, but only from 1978 onward.  I started digitizing older copyright records over 10 years ago, and a few libraries started doing so as well, but many older records have not yet been publicly digitized, though they’re available in printed form in many government depository libraries.  These records can make it much easier to verify public domain status of many works, and then make them available to the public.

Sometimes libraries and affiliated organizations put up their own restrictions on sharing information they already have in digital form.  I had a series of posts in November, for instance, criticizing OCLC‘s newly revised restrictions on sharing and reusing catalog records that libraries have contributed to WorldCat, the largest shared cataloging resource for libraries. The data in WorldCat can be the basis for many useful and innovative applications to direct readers towards useful information resources, and information about those resources.  And in December, an extremely useful downloadable semantic web representation of Library of Congress subject headings, the basis for information discovery applications like this one, was ordered taken down by LC administrators.

In the new year, I hope to encourage libraries to be more open in sharing their knowledge resources (and to support partners that also enable such openness).  My gifts to the public domain this year are in that spirit.

The first one, dedicated immediately to the public domain, is the start of a simple, free decimal classification system, intended to be reasonably compatible with certain existing library standards, but freely available and usable by anyone for any purpose.  (I created this after someone requested such a system for their institutional repository, and found out that the current Dewey Decimal system is subject to usage restrictions based on copyright and trademark.)  While this is more of a proof of concept than something I expect libraries to adopt in great numbers, I hope it inspires further open sharing of library metadata and standards.

Also, as I did last year, I’m dedicating another year’s worth of copyrights that I can control, this time from 1994, to the public domain, so that they follow the initial 14 year copyright term originally prescribed by this country’s founders.  These copyrights include the first versions of Banned Books Online, and the first database-driven versions of The Online Books Page.  Versions of these resources from 1994 and earlier are now given to the public domain.

I hope readers find value in these, and all the other public domain and freely licensed works they can enjoy and use online. Happy Public Domain Day!

Update: See also the Public Domain Day posts at Creative Commons, and the Center for Internet and Society.

Posted in copyright, discovery, open access | 3 Comments

Revised ILS-Discovery interface recommendation released

I’ve just sent the following announcement out to the ILS-Discovery Interface Google Group:

The Digital Library Federation’s ILS-DI task group has officially released revision 1.1 of their recommendation for standard interfaces for integrating the data and services of the Integrated Library System (ILS) with new applications supporting user discovery.

Our initial official release (“revision 1.0”) was made in June, and included a recommendation of a basic level of interoperability (the Basic Discovery Interfaces, or “Level 1” interoperability) that was agreed to by many ILS vendors in the “Berkeley Accord“.

In August, the DLF convened an implementor’s meeting in Berkeley that was attended by a number of developers and vendors of ILS and discovery software.  In the meeting, we agreed to make certain changes to clarify the requirements of the basic level of compliance, and to make them more useful for discovery applications.  A revised draft that included these changes was made available for comment at the end of October.  We now release the final version.

We hope that this revision will be useful for people implementing ILS’s, ILS interaction layers, and discovery applications, and enable easier interoperation between ILS’s (existing and planned) and innovative discovery applications of all kinds.  We look forward to seeing implementations of these recommendations (some of which are already in progress), and further progress towards interoperability and improved discovery of the knowledge resources of libraries.

I’d like to re-echo my thanks I made on the release of  our “1.0 revision”  back in the summer, and thank everyone who helped write, comment on, and support this recommendation.

And now, I think I’ve got some implementation work to do…

Posted in architecture, discovery, libraries | Comments Off on Revised ILS-Discovery interface recommendation released

DLF ILS Discovery Interfaces: Revised recommendation draft open for comments

Today we released a draft of “revision 1.1” of the ILS Discovery Interfaces recommendation. As I discussed in my previous post, this revision is intended to clarify the implementation of the Basic Discovery Interfaces recommended for integrated library systems (ILS’s), and make them more useful for discovery applications.

On the DLF ILS Discovery Interfaces web site, you’ll find the revision draft and the accompanying schema, along with the initial official recommendation (or “revision 1.0”). My last post included a summary of the major changes from version 1.0.

We’d like to give folks a chance to comment on the changes before we make them official. We’ll take comments until November 18, shortly after the end of the DLF Fall Forum, so folks wanting to go to our birds of a feather session on implementing the recommendations can talk with us there and still have some time to send in written comments. (Or, you can send them in ahead of time so we can think on them at the forum.) Comments may be emailed to me, and I will pass them along to the rest of the task group. There’s also still the open Google Group for discussions.

I’m hoping we’ll start to see Basic Discovery Interfaces implementations, clients, and test suites soon based on the new recommendations and schema. They’re not that different from version 1.0, but should be more useful. I’m working on revising my example implementation now, and hope to see more implementations in the not too distant future. And I look forward to hearing interested people’s thoughts and comments as well.

Posted in architecture, discovery, libraries | Comments Off on DLF ILS Discovery Interfaces: Revised recommendation draft open for comments

Update on ILS-Discovery Interface work

It’s been a while since I posted about the official release of the Digital Library Federation’s ILS Discovery interface recommendation. Marshall Breeding recently posted a useful update on the further development of the interfaces at Library Technology Guides. As the chair of the ILS-DI task group, which is now charged with some followup work described in Marshall’s article, I’d like to add some further updates.

As Marshall mentions, the DLF convened a meeting in August inviting potential developers of the ILS-Discovery interfaces to discuss implementations of recommendations of the DLF’s ILS-Discovery Interface task group. In the course of the discussion, a few changes were suggested and generally agreed upon by the participants. Updating the recommendation was not the main purpose of the meeting, but as we discussed things, it became clear that some clarifications and small updates to the recommendation would be helpful for producing more consistent and useful implementations of the Basic Discovery Interfaces, the interoperability “Level 1” that was agreed to in the Berkeley Accord.

The ILS-DI task group is therefore preparing a slight revision, to be known as “version 1.1” of the recommendation. A draft of this revision will be released for comment shortly, and will include the following changes, summarized here to give developers some idea of what to expect:

  • For the HarvestBibliographicRecords and HarvestExpandedRecords functions, it will be clarified that the function should return the records that are available for discovery. (That is, suppressed records and others that might be in the ILS but aren’t intended for discovery will not be shown, except possibly as deleted records as described below).
  • Support for the OAI-PMH binding for these functions will be noted as required. (That is, it must be supported for full ILS-BDI compliance; other bindings can be supported too.) It will also be noted that Dublin Core is a minimum requirement for returned records (as it is for OAI-PMH in general), and that if MARC records exist in the ILS (or are produced by it), MARC XML should also be available.
  • We also will require some level of support for deleted records (which includes records no longer available for discovery), to make it feasible for discovery apps to keep in sync with the ILS’s records via incremental harvesting. We’ll note that ILSs should document how long they keep deleted-record information.
  • For GetAvailability, the simple availability schema defined in the document will be noted as required. (That is, it should be returned for full ILS-BDI compliance; other schemas can be supported too if asked for and supported.) There was some talk at the August meeting about completely dropping the alternative NCIP and ILS-Holdings schemas as replies to GetAvailability, because of their complexity. The draft at this point doesn’t go that far, but it will specify the simple availability schema as the default, and the required, schema to support in the ILS-BDI profile.
  • That simple availability schema will also be augmented slightly to include an optional location element, distinct from the availability-message element. Location was the one specific data field that many implementors said was essential to include that wasn’t in the original schema.
  • We will also add a request parameter to GetAvailability for specifying whether bib or item-level availability is desired if a bib. identifier is given. (Formerly the server had the option of choosing the level in that case; there was a strong sentiment in discussions that the client to be able to specify this.)
  • We expect to leave GoToBibliographicRequestPage alone.

The new draft will be released shortly, and be open to public comment for at least a couple of weeks before we make a last edit for an official release. Feedback is welcome and encouraged, and public discussion can take place in the ILS-DI Google Group, among other places

The new draft will be accompanied by a revised XML schema. The current schema, reflecting the original or “version 1.0” official recommendation, can be found here. For the location of the new one (which is not yet posted), substitute “1.1” for “1.0” in the schema URL. (We intend to keep the old schema up for a good while after the new one is posted, for compatibility with implementations based on the original recommendation.)

I will also be leading a Birds of a Feather session at the upcoming Digital Library Federation fall forum in Providence next month. This will be an opportunity for developers of interfaces implementing the DLF’s ILS-Discovery interface recommendations to present their work to others, ask and answer questions about the recommendations and their implementations, and discuss further development initiatives and coordination. If you’d like us to set aside some time to show or discuss a particular initiative or project you’re working on, let me know.

Watch this space and the ILS-DI Google Group for further developments. And if you can come to the session at DLF in November, I hope we’ll have an interesting and enlightening discussion there as well.

(Update, Oct. 30: The draft of the revision is now out for comment.)

Posted in architecture, discovery, libraries | 3 Comments

What repositories do: The OAIS model

(Another post in an ongoing series on repositories.)

In my previous post, I mentioned the OAIS reference model as an influential framework for thinking about and planning repositories intended for long-term preservation. If you’re familiar with some of the literature or marketing for digital repositories, you may well have seen OAIS mentioned, or seen a particular system marketed as “OAIS compliant”. You may have also noticed remarks that it’s not always clear in practice what OAIS compliance means. The JISC Standards Catalogue notes “The [OAIS] documentation is quite long and complex and this may prove to be a barrier to smaller repositories or archives.” A common impression I’ve heard of OAIS is that it’s a nice idea that one should really try to pay more attention to, but complex enough that one will have to wait for some less busy time to think about it. Perhaps, one might think, if we just pick a repository system whose marketing says it’s OAIS compliant, we can be spared thinking about it ourselves.

I think we can do better than that, even in smaller projects. The basics of the OAIS model can be understood without having to be conversant with all 148 pages of the reference document. Those basics can help you think about what you need to be doing if you’re planning on preserving information for a long term (as most libraries do). The basics of OAIS also make it clear that following the model isn’t just a matter of installing the right product, but of having the right processes. It’s made very explicit that repository curators need to work with the people who produce and use the information in the repository, and make sure that the repository acquires all the information necessary for its primary audience to use and understand this information far into the future.

To help folks get oriented, here’s a quick introduction to OAIS. It won’t tell you everything about the model, but it should let you see why it’s useful, how you can use it, and what else you might need to consider in your repository planning.

What OAIS is and isn’t

First, let’s start with some basics: OAIS is a reference model for Open Archival Information Systems (whose initials make up the OAIS), that’s now an ISO standard, but is also freely available. It was developed by NASA’s Consultative Committee for Space Data Systems, who have had to deal with large volumes of data and other records generated by decades of space missions and observations, so they’ve had to think hard about how to manage and preserve it. To develop OAIS, they had open discussions with lots of other people and groups (like the National Archives) who were also interested in long-term preservation. OAIS is called “Open” because of the open process that went into creating it. It does not require that the archives are open access, or have open architecture, and it has no direct relation to the similarly-acronymed Open Archives Initiative (OAI). (Though all of these things are also useful to know about in their own right.) An “archival information system” or “archive” can simply be thought of as a repository that’s responsible for long-term preservation of the information it manages.

Unlike many standards, OAIS specifies no particular implementation, API, data format, or protocol. Instead, it’s an abstract model that provides four basic things:

  • A vocabulary for talking about common operations, services, and information structures of a repository. (This alone can provide very useful common ground for different people who use and produce repositories to talk to each other.) A glossary of this vocabulary can be found in section 1 of the reference model.
  • A simple data model for the information that a repository takes in (or “ingests”, to use the OAIS vocabulary), manages internally, and provides to others. This information is assumed to be in distinct, discrete packages known as Submission Information Packages (SIPs) for ingestion, Archival Information Packages (AIPs) for internal management, and Dissemination Information Packages (DIPs) for providing the information to consumers (or to other repositories). These packages include not just raw content, but also metadata and other information necessary for interpreting, preserving, and packaging this content. They have different names because the information they contain can take different forms as it goes into, through, and out of the archive. They are described in more detail in sections 2 and 4 of the reference model.
  • A set of required responsibilities of the archive. In brief, the archive (or its curators) must negotiate with producers of information to get appropriate content and contextual information, work with a designated community of consumers to make sure they can independently understand this information, and follow well-defined and well-documented procedures for obtaining, preserving, authenticating, and providing this information. Section 3 of the model goes into more detail about these responsibilities, and section 5 discusses some of the basic methodologies involved in preservation.
  • A set of recommended functions for carrying out the archive’s required responsibilities. These are broken up into 6 functional modules: ingest, data management, archival storage, access, administration, and preservation planning. The model describes about half a dozen functions in each model (ingest, for example, includes things like “receive submission”, “quality assurance”, and “generate AIP”) and data flows and dependencies that might exist between the functions. Some of these functions are automated, some (like “monitor technology”), are carried out by humans, and some may involve a combination of human oversight and automated assistance. The functions are described in more detail in section 4 of the model (with issues of multi-archive interoperability discussed in Section 6.)

OAIS conformance and usage

It is important to note that OAIS compliance simply requires fulfilling the required responsibilities, and supporting the basic OAIS data model of information packages. A repository is not required to implement all the functions recommended in the OAIS model, or replicate the detailed internal data flows, to be OAIS compliant. But it can be very useful to look through the functions in any case, both to make sure that your repository is doing everything it needs to do, and to see how the big problem of reliable data preservation can be broken down into smaller, more manageable operations and workflows.

You may also find the functions a useful reference point for detailed descriptions of the exact formats and protocols your repository uses for ingesting and storing information, providing content to users, and migrating it to other repositories. Although the OAIS model does not itself provide specific formats or protocols to use, it makes it clear that a repository provider needs to specify these so it can receive information from producers and make it clearly understandable to consumers.

The OAIS model has been used to help construct more detailed criteria for trusted repositories, as well as checklists for repository audit and certification. In most cases, repositories will operate perfectly well without satisfying every last criterion or checklist item. At the Partnerships in Innovation symposium I attended last week, Don Sawyer, one of the main people behind OAIS, remarked that the archives where he worked satisfied about 80% of the trusted repository checklist items. But he still found it useful to go through the whole list to verify that certain functions were not relevant or required for their repository needs, as well as to spot aspects of the repositories (like disaster recovery or provenance tracking) that might need more attention. Similarly, you can go through the recommended OAIS functions and data-model breakdowns to evaluate what’s important to have in your repository, what can be safely omitted, and what might need more careful attention or documentation.

What else you need to think about

Although the OAIS model includes examples of various kinds of repositories that might use it, it’s at its heart a fairly generic, domain-independent model, largely concerned with preservation needs. It doesn’t say a whole lot about how a repository needs to interact with specific communities to fulfill its purposes. For instance, in the talk I gave last week, I stressed the importance of designing the architecture of repositories to support rich discovery mechanisms. As Ken Thibodeau noted in later conversation, the access model of OAIS is more primitive than the architectures I described. OAIS is not incompatible with those architectures, but designing the right kinds of discovery architectures requires going beyond the criteria of OAIS itself.

You’ll also need to think carefully about the needs of the communities you’re collecting from and serving. The OAIS model notes this requirement, but doesn’t pursue it in depth. I can understand why it doesn’t, since those needs are highly dependent on the domain you’re working in. A repository intended to preserve static, published text documents for possible use in legal deposition will need to interact with its community very differently from, say, a repository intended to manage, capture, and ultimately preserve works in progress used in ongoing research and teaching. They both have preservation requirements that OAIS may well address effectively, but designing effective repositories for these disparate needs may require going well beyond OAIS, doing detailed requirements analyses, and assessing benefits and costs of various options.

I’ll talk more about requirements for particular kinds of repositories in later posts. But I hope I’ve made it clear how the OAIS model can be useful for general thinking and planning what a repository needs to do to manage and preserve its content. If it sounds promising, you can download the full OAIS model as a PDF. A revised document that will clarify some of the terminology and recommendations, but will not substantially change the model, is expected to be released in early 2009.

Posted in preservation, repositories | 2 Comments