Repository services, Part 2: Supporting deposit and access

A couple of days ago, I talked about how we provided multiple repository services, and why an institutional scholarship repository needs to provide more than just a place to store stuff.  In this post, I’ll describe some of the useful basic deposit and access services for institutional scholarly repositories (IRs).

The enumeration of services in this series is based in part on discussions I’ve had with our scholarly communications librarian, Shawn Martin, but any wrong-headed or garbled statements you find here can be laid at my own feet.  (Whereupon I can pick them up, smooth them out, and find the right head for them.)

Ingestion:

One of the major challenges of running an institutional repository is filling it up with content: finding it, making sure it can go in, and making sure it goes in properly, in a manageable format, with informative metadata.  Among other things, this calls for:

  • Efficient, flexible, user-friendly deposit workflows. Most of your authors will not bother with anything that looks like it’s wasting their time.  And you shouldn’t waste your staff’s time either, or drive them mad, with needlessly tedious deposit procedures they have to do over and over and over and over again.
  • Conversion to  standard formats on ingestion. Word processing documents, and other formats tied to a particular software product, have a way of becoming opaque and unreadable a few years after the vendor has moved on to a new version, a new product, or that dot-com registry in the sky.  Our institutional repository, for instance, converts text documents to PDF on ingestion, which both helps preserve them and ensures wide readability.  (PDF is an openly specified format, readable by programs from many sources, available on virtually all kinds of computers.)
  • Journal workflows. Much of what our scholars publish is destined for scholarly journals, which in turn are typically reviewed and edited by those scholars.  Letting scholars review, compile, and publish those journals directly in the repository can save their time, and encourage rapid, open electronic access.   (And you don’t have to go back and try to get a copy for your repository when it’s already in the repository.)  Our BePress IR software has journal workflows and publication built into it.  Alternatively, specialized journal editing and publishing systems, such as Open Journal Systems, also serve as repositories for their journal content.
  • Support for automated submission protocols such as SWORD. Manual repository deposit can be tedious and error-prone, especially if there are multiple repositories that want your content (such as a funder-mandated repository, your own institution repository, and perhaps an independent subject repository.)  Manual deposit also often wastes people’s time re-entering information that’s already available online.  If you can work with an automated protocol that can automatically put content into a repository, though, things can get much better: you can support multiple simultanous deposits, ingestion procedures designed especially for your own environment that use the automated protocol for deposit, and automated bulk transfer of content from one repository to another.  SWORD is an automated repository deposit protocol that is starting to be supported by various repositories. (BePress does not yet support it, but we’re hoping they will soon).

From a practical standpoint, if you want a significant stream of content coming into your repository, you’ll probably need to have a content wrangler as well: someone who makes sure that authors’ content is going into the repository as intended. (In practice, they often end up doing the deposit themselves.)

Discovery:

You want it to be easy and enjoyable for readers to explore your site and find content of interest to them.  Here are a few important ways to enable discovery:

  • Search of full text and/or metadata, either over the repository as a whole, or over selected portions of the repository.  Full text search can be simple and turn up lots of useful content that might not be discovered through metadata search alone.  More precise, metadata-based searches can also be important for specialized needs.   Full text indexing is not always available (in some cases, you might only have page images), but it should be supported where possible.
  • Customization of discovery for different communities and collections.  Different communities may have different ways of organizing and finding things.  Some communities may want to organize primarily by topic, or author, or publication type, or date.  Some may have specialized metadata that should be available for general and targeted searching and browsing.  If you can customize how different collections can be explored, you can make them more usable to their audiences.
  • Aggregator feeds using RSS or Atom, so people can keep track of new items of interest in their favorite feed readers.  This needs to exist at multiple levels of granularity.   Many repositories give RSS feeds of everything added to the repository, but most people will be more interested in following what’s new from a particular department or author, or in a particular subject.
  • Search engine friendliness. Judging from our logs, most of the downloads of our repository papers occur not via our own searching and browsing interfaces, but via Google and other search engines that have crawled the repository.  So you need to make sure your repository is set up to make it easy and inviting for search engines to crawl.  Don’t hide things behind Flash or Javascript unless you don’t want them easily found.  Make sure your pages have informative titles, and the site doesn’t require excessive link-clicking to get to content.  You also need to make sure that your site can handle the traffic produced by search-engine indexers, some of which can be quite enthusiastic about frequently crawling content.
  • Metadata export via protocols like OAI-PMH.  This is useful in a number of ways:  It allows your content to be indexed by content aggregators; it lets you maintain and analyze your own repository’s inventory; and, in combination with automated deposit protocols like SWORD (and content aggregation languages like OAI-ORE), it may eventually make it much simpler to replicate and redeposit content in multiple repositories.

Access:

  • Persistent URIs for items. Content is easier to find and cite when it doesn’t move away from its original location.  You would think it would be well known that cool URLs don’t change, but I still find a surprisingly large number of documents put in content management systems where I know the only visible URIs will not survive the next upgrade of the system, let alone a migration to a new platform.  If possible, the persistent URI should be the only URI the user sees.  If not, the persistent URI should at least be highly visible, so that users link to it, and not the more transient URI that your repository software might use for its own purposes.
  • An adequate range of access control options for particular collections and items.  I’m all in favor of open access to content, but sometimes this is not possible or appropriate.  Some scholarship includes information that needs to be kept under wraps, or in limited release, temporarily or permanently.  We want to still be able to manage this content in the repository when appropriate.
  • Embargo management is an important part of  access control.   In some cases, users may want to keep their content limited-access for a set time period, so that they can get a patent, obey a publishing contract, or prepare for a coordinated announcement.  Currently, because of BePress’ limited embargo support, we sit on embargoed content and have to remember to put it into the repository, or manually turn on open access, when the embargo ends.  It’s much easier if depositors can just say “keep this limited access until this data, and then open it up,” and the repository service handles matters from there.

That may seem like a lot to think about, but we’re not done yet.  In the next part, I’ll talk about services for managing content in the IR, including promoting it, letting depositors know about its impact, and preserving it appropriately.

About John Mark Ockerbloom

I'm a digital library strategist at the University of Pennsylvania, in Philadelphia.
This entry was posted in discovery, formats, repositories. Bookmark the permalink.