Everybody's Libraries

May 16, 2009

January 15, 2009

Repository services, Part 2: Supporting deposit and access

Filed under: discovery,formats,repositories — John Mark Ockerbloom @ 6:01 pm

A couple of days ago, I talked about how we provided multiple repository services, and why an institutional scholarship repository needs to provide more than just a place to store stuff.  In this post, I’ll describe some of the useful basic deposit and access services for institutional scholarly repositories (IRs).

The enumeration of services in this series is based in part on discussions I’ve had with our scholarly communications librarian, Shawn Martin, but any wrong-headed or garbled statements you find here can be laid at my own feet.  (Whereupon I can pick them up, smooth them out, and find the right head for them.)

Ingestion:

One of the major challenges of running an institutional repository is filling it up with content: finding it, making sure it can go in, and making sure it goes in properly, in a manageable format, with informative metadata.  Among other things, this calls for:

  • Efficient, flexible, user-friendly deposit workflows. Most of your authors will not bother with anything that looks like it’s wasting their time.  And you shouldn’t waste your staff’s time either, or drive them mad, with needlessly tedious deposit procedures they have to do over and over and over and over again.
  • Conversion to  standard formats on ingestion. Word processing documents, and other formats tied to a particular software product, have a way of becoming opaque and unreadable a few years after the vendor has moved on to a new version, a new product, or that dot-com registry in the sky.  Our institutional repository, for instance, converts text documents to PDF on ingestion, which both helps preserve them and ensures wide readability.  (PDF is an openly specified format, readable by programs from many sources, available on virtually all kinds of computers.)
  • Journal workflows. Much of what our scholars publish is destined for scholarly journals, which in turn are typically reviewed and edited by those scholars.  Letting scholars review, compile, and publish those journals directly in the repository can save their time, and encourage rapid, open electronic access.   (And you don’t have to go back and try to get a copy for your repository when it’s already in the repository.)  Our BePress IR software has journal workflows and publication built into it.  Alternatively, specialized journal editing and publishing systems, such as Open Journal Systems, also serve as repositories for their journal content.
  • Support for automated submission protocols such as SWORD. Manual repository deposit can be tedious and error-prone, especially if there are multiple repositories that want your content (such as a funder-mandated repository, your own institution repository, and perhaps an independent subject repository.)  Manual deposit also often wastes people’s time re-entering information that’s already available online.  If you can work with an automated protocol that can automatically put content into a repository, though, things can get much better: you can support multiple simultanous deposits, ingestion procedures designed especially for your own environment that use the automated protocol for deposit, and automated bulk transfer of content from one repository to another.  SWORD is an automated repository deposit protocol that is starting to be supported by various repositories. (BePress does not yet support it, but we’re hoping they will soon).

From a practical standpoint, if you want a significant stream of content coming into your repository, you’ll probably need to have a content wrangler as well: someone who makes sure that authors’ content is going into the repository as intended. (In practice, they often end up doing the deposit themselves.)

Discovery:

You want it to be easy and enjoyable for readers to explore your site and find content of interest to them.  Here are a few important ways to enable discovery:

  • Search of full text and/or metadata, either over the repository as a whole, or over selected portions of the repository.  Full text search can be simple and turn up lots of useful content that might not be discovered through metadata search alone.  More precise, metadata-based searches can also be important for specialized needs.   Full text indexing is not always available (in some cases, you might only have page images), but it should be supported where possible.
  • Customization of discovery for different communities and collections.  Different communities may have different ways of organizing and finding things.  Some communities may want to organize primarily by topic, or author, or publication type, or date.  Some may have specialized metadata that should be available for general and targeted searching and browsing.  If you can customize how different collections can be explored, you can make them more usable to their audiences.
  • Aggregator feeds using RSS or Atom, so people can keep track of new items of interest in their favorite feed readers.  This needs to exist at multiple levels of granularity.   Many repositories give RSS feeds of everything added to the repository, but most people will be more interested in following what’s new from a particular department or author, or in a particular subject.
  • Search engine friendliness. Judging from our logs, most of the downloads of our repository papers occur not via our own searching and browsing interfaces, but via Google and other search engines that have crawled the repository.  So you need to make sure your repository is set up to make it easy and inviting for search engines to crawl.  Don’t hide things behind Flash or Javascript unless you don’t want them easily found.  Make sure your pages have informative titles, and the site doesn’t require excessive link-clicking to get to content.  You also need to make sure that your site can handle the traffic produced by search-engine indexers, some of which can be quite enthusiastic about frequently crawling content.
  • Metadata export via protocols like OAI-PMH.  This is useful in a number of ways:  It allows your content to be indexed by content aggregators; it lets you maintain and analyze your own repository’s inventory; and, in combination with automated deposit protocols like SWORD (and content aggregation languages like OAI-ORE), it may eventually make it much simpler to replicate and redeposit content in multiple repositories.

Access:

  • Persistent URIs for items. Content is easier to find and cite when it doesn’t move away from its original location.  You would think it would be well known that cool URLs don’t change, but I still find a surprisingly large number of documents put in content management systems where I know the only visible URIs will not survive the next upgrade of the system, let alone a migration to a new platform.  If possible, the persistent URI should be the only URI the user sees.  If not, the persistent URI should at least be highly visible, so that users link to it, and not the more transient URI that your repository software might use for its own purposes.
  • An adequate range of access control options for particular collections and items.  I’m all in favor of open access to content, but sometimes this is not possible or appropriate.  Some scholarship includes information that needs to be kept under wraps, or in limited release, temporarily or permanently.  We want to still be able to manage this content in the repository when appropriate.
  • Embargo management is an important part of  access control.   In some cases, users may want to keep their content limited-access for a set time period, so that they can get a patent, obey a publishing contract, or prepare for a coordinated announcement.  Currently, because of BePress’ limited embargo support, we sit on embargoed content and have to remember to put it into the repository, or manually turn on open access, when the embargo ends.  It’s much easier if depositors can just say “keep this limited access until this data, and then open it up,” and the repository service handles matters from there.

That may seem like a lot to think about, but we’re not done yet.  In the next part, I’ll talk about services for managing content in the IR, including promoting it, letting depositors know about its impact, and preserving it appropriately.

January 13, 2009

Repository services, Part 1: Galleries vs. self-storage units

Filed under: repositories — John Mark Ockerbloom @ 6:00 pm

Back near the start of my occasional series on repositories, I noted that we had not just one but a number of repositories, each serving different purposes.

In tight budgetary times, this approach might seem questionable.  Right now, we’re putting up a new repository structure (in addition to our existing ones) to keep our various digitized special collections and make them available for discovery and use.  We hope this will make our digital special collections more uniformly manageable, and less costly to maintain.

At the same time, we’re continuing to maintain an institutional repository of our scholars’ work on a completely different platform, one for which we pay a subscription fee annually.  I’ve heard more than one person ask “Well, once our new  repository is up, can’t we just move the existing institutional repository content into it, and drop our subscription?”

To which I generally answer: “We might do that at some point, but right now it’s worth maintaining the subscription past the opening date of our new repository.”  The basic reason is that the two repositories not only have different purposes, but also, at least in their current uses, support very different kinds of interactions, with different kinds of audiences.

The interactions we need initially for the repository we’re building for our special collections are essentially internal ones.  Special collections librarians create (or at least digitize) a thematic set of items, give them detailed cataloging, and deposit them en masse into the collection.  The items are then exposed via machine interfaces to our discovery applications, that then let users find and interact with the contents in ways that our librarians think will best show them off.

The repository itself, then, can work much like a self-storage unit.  Every now and then we move in a bunch of stuff, and then later we bring it out into a nicer setting when people want to look at it.  Access, discovery, and delivery are built on top of the repository, in separate applications that emphasize things like faceted browsing, image panning and zooming, and rare book page display and page turning.

Our institutional repository interacts with our community quite differently.  Here, the content is created by various scholars who are largely outside the library, who may deposit items bit by bit whenever they get around to it (or when library staff can find the time to bring in their content).  They want to see their work widely read, cited, and appreciated.  They don’t want to spend more time than they have to putting stuff in– they’ve got work to do– and they want their work quickly and easily accessible.  And they’d like to know when their work is being viewed.  In short, they need a gallery, not just a self-storage unit.  They want something that lets them show off and distribute their work in elegant ways.

Our institutional repository applications, bundled with the repository, thus emphasize things like full text search and search-engine openness, instant downloads of content, and notification of colleagues uploading and downloading papers.

We could in theory build similar applications ourselves, and layer them on top of the same “self-storage” repository structure we use for special collections.   (Museums likewise often have their exhibit galleries literally on top of the bulk of their collection kept in their basements, or other compact storage areas.)  But it would take us a while to build the applications we need, so for now we see it as a better use of our resources to rely on the applications bundled with our institutional repository service.

(An alternative, of course, would be to see if an existing open source application would serve our needs.  I hope to talk more about open source repository software in a future post, but we haven’t to date decided to run our institutional repository that way.)

I hope I’ve at least made it clear that for a viable institutional repository, you need quite a bit more than just “a place to put stuff”: you need a suite of services that support its purposes.  In Part 2,  I’ll enumerate some of the specific services that we need or find useful in our institutional scholarship repository.

October 13, 2008

What repositories do: The OAIS model

Filed under: preservation,repositories — John Mark Ockerbloom @ 11:23 pm

(Another post in an ongoing series on repositories.)

In my previous post, I mentioned the OAIS reference model as an influential framework for thinking about and planning repositories intended for long-term preservation. If you’re familiar with some of the literature or marketing for digital repositories, you may well have seen OAIS mentioned, or seen a particular system marketed as “OAIS compliant”. You may have also noticed remarks that it’s not always clear in practice what OAIS compliance means. The JISC Standards Catalogue notes “The [OAIS] documentation is quite long and complex and this may prove to be a barrier to smaller repositories or archives.” A common impression I’ve heard of OAIS is that it’s a nice idea that one should really try to pay more attention to, but complex enough that one will have to wait for some less busy time to think about it. Perhaps, one might think, if we just pick a repository system whose marketing says it’s OAIS compliant, we can be spared thinking about it ourselves.

I think we can do better than that, even in smaller projects. The basics of the OAIS model can be understood without having to be conversant with all 148 pages of the reference document. Those basics can help you think about what you need to be doing if you’re planning on preserving information for a long term (as most libraries do). The basics of OAIS also make it clear that following the model isn’t just a matter of installing the right product, but of having the right processes. It’s made very explicit that repository curators need to work with the people who produce and use the information in the repository, and make sure that the repository acquires all the information necessary for its primary audience to use and understand this information far into the future.

To help folks get oriented, here’s a quick introduction to OAIS. It won’t tell you everything about the model, but it should let you see why it’s useful, how you can use it, and what else you might need to consider in your repository planning.

What OAIS is and isn’t

First, let’s start with some basics: OAIS is a reference model for Open Archival Information Systems (whose initials make up the OAIS), that’s now an ISO standard, but is also freely available. It was developed by NASA’s Consultative Committee for Space Data Systems, who have had to deal with large volumes of data and other records generated by decades of space missions and observations, so they’ve had to think hard about how to manage and preserve it. To develop OAIS, they had open discussions with lots of other people and groups (like the National Archives) who were also interested in long-term preservation. OAIS is called “Open” because of the open process that went into creating it. It does not require that the archives are open access, or have open architecture, and it has no direct relation to the similarly-acronymed Open Archives Initiative (OAI). (Though all of these things are also useful to know about in their own right.) An “archival information system” or “archive” can simply be thought of as a repository that’s responsible for long-term preservation of the information it manages.

Unlike many standards, OAIS specifies no particular implementation, API, data format, or protocol. Instead, it’s an abstract model that provides four basic things:

  • A vocabulary for talking about common operations, services, and information structures of a repository. (This alone can provide very useful common ground for different people who use and produce repositories to talk to each other.) A glossary of this vocabulary can be found in section 1 of the reference model.
  • A simple data model for the information that a repository takes in (or “ingests”, to use the OAIS vocabulary), manages internally, and provides to others. This information is assumed to be in distinct, discrete packages known as Submission Information Packages (SIPs) for ingestion, Archival Information Packages (AIPs) for internal management, and Dissemination Information Packages (DIPs) for providing the information to consumers (or to other repositories). These packages include not just raw content, but also metadata and other information necessary for interpreting, preserving, and packaging this content. They have different names because the information they contain can take different forms as it goes into, through, and out of the archive. They are described in more detail in sections 2 and 4 of the reference model.
  • A set of required responsibilities of the archive. In brief, the archive (or its curators) must negotiate with producers of information to get appropriate content and contextual information, work with a designated community of consumers to make sure they can independently understand this information, and follow well-defined and well-documented procedures for obtaining, preserving, authenticating, and providing this information. Section 3 of the model goes into more detail about these responsibilities, and section 5 discusses some of the basic methodologies involved in preservation.
  • A set of recommended functions for carrying out the archive’s required responsibilities. These are broken up into 6 functional modules: ingest, data management, archival storage, access, administration, and preservation planning. The model describes about half a dozen functions in each model (ingest, for example, includes things like “receive submission”, “quality assurance”, and “generate AIP”) and data flows and dependencies that might exist between the functions. Some of these functions are automated, some (like “monitor technology”), are carried out by humans, and some may involve a combination of human oversight and automated assistance. The functions are described in more detail in section 4 of the model (with issues of multi-archive interoperability discussed in Section 6.)

OAIS conformance and usage

It is important to note that OAIS compliance simply requires fulfilling the required responsibilities, and supporting the basic OAIS data model of information packages. A repository is not required to implement all the functions recommended in the OAIS model, or replicate the detailed internal data flows, to be OAIS compliant. But it can be very useful to look through the functions in any case, both to make sure that your repository is doing everything it needs to do, and to see how the big problem of reliable data preservation can be broken down into smaller, more manageable operations and workflows.

You may also find the functions a useful reference point for detailed descriptions of the exact formats and protocols your repository uses for ingesting and storing information, providing content to users, and migrating it to other repositories. Although the OAIS model does not itself provide specific formats or protocols to use, it makes it clear that a repository provider needs to specify these so it can receive information from producers and make it clearly understandable to consumers.

The OAIS model has been used to help construct more detailed criteria for trusted repositories, as well as checklists for repository audit and certification. In most cases, repositories will operate perfectly well without satisfying every last criterion or checklist item. At the Partnerships in Innovation symposium I attended last week, Don Sawyer, one of the main people behind OAIS, remarked that the archives where he worked satisfied about 80% of the trusted repository checklist items. But he still found it useful to go through the whole list to verify that certain functions were not relevant or required for their repository needs, as well as to spot aspects of the repositories (like disaster recovery or provenance tracking) that might need more attention. Similarly, you can go through the recommended OAIS functions and data-model breakdowns to evaluate what’s important to have in your repository, what can be safely omitted, and what might need more careful attention or documentation.

What else you need to think about

Although the OAIS model includes examples of various kinds of repositories that might use it, it’s at its heart a fairly generic, domain-independent model, largely concerned with preservation needs. It doesn’t say a whole lot about how a repository needs to interact with specific communities to fulfill its purposes. For instance, in the talk I gave last week, I stressed the importance of designing the architecture of repositories to support rich discovery mechanisms. As Ken Thibodeau noted in later conversation, the access model of OAIS is more primitive than the architectures I described. OAIS is not incompatible with those architectures, but designing the right kinds of discovery architectures requires going beyond the criteria of OAIS itself.

You’ll also need to think carefully about the needs of the communities you’re collecting from and serving. The OAIS model notes this requirement, but doesn’t pursue it in depth. I can understand why it doesn’t, since those needs are highly dependent on the domain you’re working in. A repository intended to preserve static, published text documents for possible use in legal deposition will need to interact with its community very differently from, say, a repository intended to manage, capture, and ultimately preserve works in progress used in ongoing research and teaching. They both have preservation requirements that OAIS may well address effectively, but designing effective repositories for these disparate needs may require going well beyond OAIS, doing detailed requirements analyses, and assessing benefits and costs of various options.

I’ll talk more about requirements for particular kinds of repositories in later posts. But I hope I’ve made it clear how the OAIS model can be useful for general thinking and planning what a repository needs to do to manage and preserve its content. If it sounds promising, you can download the full OAIS model as a PDF. A revised document that will clarify some of the terminology and recommendations, but will not substantially change the model, is expected to be released in early 2009.

October 9, 2008

Surpassing all records

Filed under: architecture,discovery,preservation,repositories — John Mark Ockerbloom @ 10:47 pm

What will happen to all the White House emails after George W. Bush leaves office in January? Who will take charge of all the other electronic records of the government, after they’re no longer in everyday use? How can you archive 1 million journal articles a month from dozens of different publishers? Can the virtual world handle the Large Hadron Collider’s generation of 15 petabytes of data per year without being swallowed by a singularity? And how can we find what we need in all these bits, anyway?

These were some of the digital archiving challenges discussed this week at the Partnerships in Innovation II symposium in College Park, Maryland. Co-sponsored by the National Archives and Records Administration and the University of Maryland, the symposium brought together experts and practitioners in digital preservation for a day and a half of talks, panels, and demonstrations. It looked to me like over 200 people attended.

This conference was a sequel to an earlier symposium that was held in 2004. Many of the ideas and plans presented at the earlier forum have now grown into fruition. The symposium opened with an overview of NARA’s Electronic Records Archives (ERA), a long-awaited system for preserving massive amounts of records from all federal government agencies, that went live this summer. It’s still in pilot mode with a limited number of agencies, but will be importing lots of electronic records soon, including the Bush administration files after the next president is inaugurated.

The symposium also reviewed progress with older systems and concepts. The OAIS reference model, a framework for thinking about and planning long-term preservation repositories, influences not only NARA’s ERA, but many other initiatives and repositories, including familiar open source systems like Fedora and DSpace. Some of the developers of OAIS, including NASA’s Don Sawyer, reviewed their experiences with the model, and the upcoming revision of the standard. Fedora and DSpace themselves have been around long enough to be subjects of a “lessons learned” panel featuring speakers who have built ambitious institutional repositories around them.

The same panel also featured Evan Owens of Portico discussing the extensive testing and redesign they had to do to scale up their repository to handle the million articles per month mentioned at the top of this post. Heavily automated workflows were a big part of this scaling up, a strategy echoed by the ERA developers and a number of the other repository pracitioners, some of whom showed some interesting tools for automatically validating content, and for creating audit trails for certification and rollback of repository content.

Networks of interoperating repositories may allow digital preservation to scale up further still. That theme arose in a couple of the other panels, including the last one, dedicated to a new massive digital archiving initiative: the National Science Foundation‘s Datanet. NSF envisions large interoperating global networks of scientific data that could handle many Large Hadron Colliders worth of data, and would make the collection, sharing, reuse, and long-term preservation of scientific data an integral part of scientific research and education. The requirements and sizes of the grants are both prodigious– $20 million each to four or five multi-year projects that have to address a wide range of problems and disciplines– but NSF expects that the grants will go to wide-ranging partnerships. (This forum is one place interested parties can find partners.)

I gave a talk as part of the Tools and Technologies panel, where I stressed the importance of discovery as part of effective preservation and content, and discussed the design of architectures (and example tools and interfaces) that can promote discovery and use of repository content. My talk echoed in part a talk I gave earlier this year at a Palinet symposium, but focused on repository access rather than cataloging.

I’m told that all the presentations were captured on video, and hopefully those videos, and the slides from the presentations, will all be placed online by the conference organizers. In the meantime, my selected works site has a PDF of the slides and a draft of the script I used for my presentation. I scripted it to make sure I’d stay within the fairly short time slot while still speaking clearly. The talk as delivered was a bit different (and hopefully more polished) than this draft script, but I hope this file will let folks contemplate at leisure the various points I went through rather quickly.

I’d like to thank the folks at the National Archives and UMD (especially Ken Thibodeau, Robert Chadduck, and Joseph Jaja) for putting on such an interesting and well-run symposium, and giving me the opportunity to participate. I hope to see more forums bringing together large-scale digital preservation researchers and practitioners in the years to come.

September 11, 2008

Repositories: Benefits, costs, contingencies (with an example)

Filed under: repositories — John Mark Ockerbloom @ 4:53 pm

(This is the third post in a slow-cooking series on repositories.)

In my last repository post, I listed a variety of repository types that we maintain at our institution, each with different content, operation, and policies. At the end of the post, I wrote:

Once we have a clear understanding of why we would benefit from a particular repository, and what it would manage, we can consider various options for who would run it, where, and how. (And of course, what its costs would be, and how we can realistically expect those costs to be covered….)

Without a clear sense of benefits and costs, you won’t have a sensible repository strategy. And, as Dorothea Salo reminds us today, without a sensible strategy you’re likely to burn through a lot of money, labor, and goodwill with little to show for it at the end. You have to go in knowing what you want, and being realistic about what you’re willing to invest to produce it. (For instance, if you’re planning to build a repository of your community’s own scholarship, and hope to get lots of free help from your community just by doing some marketing, you really need to read Dorothea’s post for a reality check.)

Even when your initial plan is sound, you have to be prepared for change, and the unexpected. Technology changes quickly. Online tools, communities, and scholarly societies also change. Methods of scholarship also change, often more slowly, but sometimes in significant ways. Even if you’ve done your homework, you may eventually find that the repository that seemed just fine a few years ago doesn’t really meet your needs like it used to. Maybe the software hasn’t been updated as you’d like it, and there’s a better system available now. Maybe you’re storing different kinds of things, or you’ve found a new application that your scholars really want to use that’s not compatible with your existing setup. Maybe the formats you’re managing have gone out of date. Maybe it becomes more cost effective to move to a big externally managed repository that your scholars are flocking to already– or away from one that they’re not finding useful. Maybe you even decide it no longer makes sense for you to maintain a particular repository.

You need to start thinking about strategies for change (and for exit) the moment you start planning a repository. Remember, repositories ultimately don’t exist for themselves, but for their content (and for the people using that content). And the kind of content that libraries often care about is likely to remain relevant much longer than any particular repository configuration. You want to ensure that the content remains useable for as long as your patrons care about it, even as it moves and migrates between systems (and possibly, between caretakers).

An example: Planning for data repository services

What does it mean, practically, to plan with benefits, costs, and contingencies in mind? Well, at Penn, we’re starting to consider repository services for data sets. We have a general idea of the benefits of archiving data sets, because we’ve heard from faculty in various departments who want to analyze data previously collected by research groups (their own or others), who are having a hard time managing their own data, or who are required by their journals or support agencies to publish or maintain their data sets. Before we commit to providing a new data repository service, though, we need a better sense of these benefits. How broad and deep is the desire for data services among our faculty? Where is it most acute, in terms of disciplines and services? What would be gained from having our institution provide our own data repository services, rather than just having our scholars use someone else’s services, or fend for themselves? What are the benefits of introducing services specifically for data, rather than just, say, saving data sets alongside other files in existing repositories? If we’re considering a significant investment, we need more than just anecdotal answers to these questions. A survey of faculty in various disciplines can give us a better idea of how they could benefit from and support data repository services.

We also have to consider costs. What options do we have for creating, acquiring, or contracting with a data repository or repository service? What do they cost to install and run, both in monetary and staffing terms? What are the costs of acquiring content (again in money and labor, where the labor might come from librarians, scholars, or students)? How about costs of maintaining, accessing, and migrating the content? How will these costs be covered? What about costs associated specifically with this kind of content? Are there confidentiality, security, intellectual property, or liability concerns we have to consider? To help answer these questions, we should evaluate various data repository systems in existence and in development. The faculty survey mentioned above could also help us answer some of the questions about labor and support.

Contingencies, by their nature, tend not to be fully foreseeable. But there are a few obvious things we can ask about and plan for. Will our data still be readable for decades to come? Can we migrate it to new formats, and if so, what would be involved? Can we make sure we have good enough metadata and annotation to know how to read, use, and migrate the data in the future? Do we have clear identifiers for our content that will survive a move to a new platform (and leave a workable forwarding address, if necessary)? What happens to our content if our repository loses funding, our machine room is sucked into a mini-black-hole, or we simply decide it’s not worth the trouble of keeping the repository going? What do we do if we’re told to withdraw or change the data we’re maintaining, by the person who deposited it, by someone else using or mentioned in the data, or by the government? We won’t necessarily come up with definitive answers to all these questions, but brainstorming and thinking through possible and likely scenarios should help us know what to expect and reduce the chance of our getting caught unawares by a costly problem.

Is it worth it?

That’s a lot to do, you might be thinking, before you even get started. Can’t we just put this cool system up and see what happens? Well, you could, if you and your community will be satisfied with something that might be here today and gone tomorrow, and that doesn’t have any support or reliability guarantees. But if you have scholars to serve, and you’d like them to take the time and trouble to entrust their content to your repository, they’re probably going to want some reassurance that the repository will have staying power, and give them benefits worth their time. Otherwise, they have plenty of other, more important things to do.

Running a large, successful, long-lasting repository takes a lot of work over its lifespan. Better to do some planning work up front than get stuck with a lot of costly and unnecessary work later on.

June 26, 2008

Repositories: What they are, and what we use them for

Filed under: repositories — John Mark Ockerbloom @ 3:15 pm

(Note: This is the second of an ongoing series of posts on repositories. The first post is here.)

The JISC Repositories Support Project defines a digital repository as “a mechanism for managing and storing digital content.” I find this a useful definition, both for what it says and what it doesn’t say. It notes that repositories, as such, focus on content and its management. It doesn’t say anything about the kind of digital content managed by the repository, or about the use this content is put to.

A repository’s focus is related to, but distinct from, the focus of a library or an application. Repositories focus on particular information content. Applications (like Zotero, FeedReader, or Google Docs) focus on particular information tasks, like tracking citations, getting news, or authoring documents. Libraries focus on the information needs of particular communities (which might be towns, schools, peer researchers, or Internet users with particular interests). Applications and libraries may use repositories to support their tasks or communities, and some may be primarily built around one specific repository (as most libraries in the pre-computer age were built around what was in their physical stacks). But they are not identical to their repositories, and it’s often useful to distinguish the functions of a library and the functions of the repositories that it uses.

At the same time, though, you can’t plan the development of a library without thinking about its repositories. Repositories really are essential infrastructure for libraries, but not simply as a place to “capture and preserve the intellectual output of university communities” (as a 2002 SPARC white paper put it), or, more pessimistically, as “a place where you dump stuff and then nothing happens to it” (as a 2005 JISC workshop annex put it). The Penn Libraries today rely on hundreds of digital repositories, mostly run by various publishers. We also manage a few important ones ourselves. Here are a few that we manage, or are considering managing:

  • A repository providing open access to the scholarly output of our researchers (what is often thought of as the traditional “institutional repository”). For this repository, we manage the content, and contract with an outside company to manage the servers and develop the software. While many faculty cooperate in populating this repository, and some faculty deposit their own work themselves, librarians do much of the work to populate it.
  • A repository preserving content from some of our electronic subscription resources. This repository is normally only seen by library staff, but it’s an important part of our preservation strategy, and will be exposed selectively when subscription resources it preserves are no longer available from the publisher. We run this repository on a local server, using open source software developed elsewhere, and its content is selected by us and ingested and preserved largely automatically, in cooperation with other users of the same repository software. (We also subscribe to another preservation initiative, involving a centralized preservation repository system that we don’t manage.)
  • The repository used to store content in our main courseware management system. The server is managed by us, using proprietary software, and is populated by instructors from all over the university. It is largely torn down and built anew every semester (sometimes carrying over material from previous semester’s incarnations). While this isn’t a permanent repository, it has very strong and definite persistence requirements that we have to take pains to support. And if some of our users just think of this as a place to do their teaching, and the “repository” aspects just come along for the ride, that’s a feature, not a bug.
  • Repositories for various digital image collections and digitized special collections. Historically these collections have been a mishmash of systems developed ad-hoc, involving filesystems, metadata in a database, custom-built websites, backup procedures, and sometimes little else. We’re currently locally developing a digital library architecture that will unify discovery and usage of many of these collections, and we hope to similarly unify repository management for many of these collections as well. Traditionally, the content is selected by bibliographers and the repositories and collection sites created by techies; we hope that the new architecture will let the bibliographers do more repository management and site design, and let the techies do less site-by-site management and more unified service management.
  • We have also tested repositories for managing numeric data, which are increasingly important shared research resources in many fields. We do not currently have a repository in production for this, but the repositories developed by projects like this one have important features for data-centric research that are not supported to the same extent by “traditional” repository systems.

As you can see from these examples, libraries like ours have all kinds of different uses for repositories, and various ways we can develop and manage them. We’re not starting repositories because they’re what all the cool Research I libraries are doing this year. We’re managing them because they help us provide what we see as important services to our communities. We recognize that different repositories have different uses, and that it often makes more sense to integrate multiple repositories into a single library than to build One Repository to Rule Them All. Once we have a clear understanding of why we would benefit from a particular repository, and what it would manage, we can consider various options for who would run it, where, and how. (And of course, what its costs would be, and how we can realistically expect those costs to be covered. But that’s a topic for another post.)

May 7, 2008

Everybody’s repositories (first of a series)

Filed under: repositories — John Mark Ockerbloom @ 11:50 am

The library where I work has decided to think long and hard about its digital repository strategy. Your library may be doing this too, or may have recently done so and is now working on carrying out that strategy. If it’s not, it probably should be.

Libraries have for a long time hosted repositories of content in paper form; indeed, such repositories account for a large portion of both the budget and the floor space of many libraries. But many of them have been slow to take on responsibility for digital repositories, or have only done so in a very limited way, compared to their physical repository investments.

But while established libraries have often hesitated in taking up digital repositories, the rest of the world has not. As folks in research libraries have known for a while, a lot of the money we now spend on content pays for electronic resources held in publisher repositories. In typical arrangements, libraries no longer own this content (as they owned the print content the electronic versions supplant) but lease it. And even if a library has a “perpetual access” contract that lets it download publisher content after ending a subscription, for practical purposes many libraries are not ready to host it or make it available as readily and seamlessly as their patrons have grown to expect.

However, even if publisher repositories, or scholar-run discipline repositories like the social scientists’ SSRN, aren’t directly run by traditional libraries, those libraries are among their primary customers. Therefore, the folks who run those repositories have incentives to provide the kinds of services that those libraries need to carry our their missions (at least, if the libraries know to ask for them).

Increasingly, though, people are using new kinds of repositories that have little or no connection to traditional libraries. Some of these repositories are on their users’ own computers– their digital music collection and photo library, managed by programs like ITunes, IPhoto, and Picasa. Some of these repositories are on Internet sites like YouTube, Flickr, Google Docs and Google Base, and the various WikiMedia sites. We often don’t think of all of these as “repositories”, but that’s how people are using them: to manage and provide access to information in a stable way, potentially over a long period of time.

I’m not using “repository” here to mean just “glorified filesystem or website”. The everyday repositories I mention above typically put substantial effort into managing metadata, supporting discovery, providing for access control (and often backup and version control), and supporting long-term access and use of the content. They tend to do all these things much more quietly and unobtrusively than the repositories typically designed for and marketed to libraries, but that’s a feature, not a bug. We who work in research libraries need to consider these “repositories for everybody” very carefully. A lot of the digital content that libraries will want to include in our own collections will come out of those repositories. And those repositories can potentially teach us a lot about how to design and run our own.

That’s one big reason why I want to discuss my library’s strategic thinking about repositories in open forums like this one. True, the Penn Libraries don’t have exactly the same uses and needs for repositories as other people and groups. But I think there are a lot of repository issues where we and many others share common interests, or have common questions we all need to answer. Over a series of posts, I hope to discuss repository purposes, infrastructure, technologies, ingest, workflow, labor allocation, lifecycles, legal concerns, integration, policy, and community, all of which are relevant to our repository plans. The strategies and issues most salient for Penn may or may not be the same as yours. But if repositories matter to you, I hope that discussing our issues in a broader context will give you useful things to think about for your own situation. And I hope that we will learn from you as well.

Lots of other people have already written thoughtfully on repositories. I hope to stealreuse and build on their ideas wherever I can. A good introduction to many of the issues can be found at JISC’s Repository Support Project, a website to help institutions planning repositories, starting from “What is a repository, anyway?” and working from there. (It’s not a given, by the way, that libraries should always run their own repositories for their digital content– but more on that later.)

Repository planners should be familiar with both the theory and practice of repositories. You don’t have to know all the details of the OAIS reference model, for instance, but it’s helpful to know the general principles it sets out, both for issues to think about in running a repository over a long term, and for a conceptual vocabulary for understanding and interacting with other repository initiatives. Likewise it helps to at least be conversant with standard metadata schemas, protocols, recommended procedures, and the like. But you also very much need to know how repositories are working, or not working, in practice. The JISC site I mentioned earlier has an interesting case studies section, where folks who have run repositories describe their experiences, and how they may have differed from expectations. Some repository managers also run blogs where they talk about their day-to-day experiences with repositories, good and bad. Les Carr’s RepositoryMan and Dorothea Salo’s Caveat Lector are two blogs that I find must-reads, for keeping track of new developments repository maintainers can use and practical problems that repository planners can’t afford to ignore.

Future installments in this series will be posted under the “repositories” category. In the meantime, if you’re interested in these issues, I recommend you check out the resources above. And I’d be very interested in hearing about particular issues that should be discussed here.

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 87 other followers