Open catalog APIs and data: ALA presentation notes posted

I’ve now posted my materials for the two panels I participated in at ALA Midwinter.

I have slides  available for “Opening the ILS for Discovery: The Digital Library Federation’s ILS-Discovery Interface Recommendations“, a presentation for LITA’s Next Generation Catalog interest group, where I gave an overview of the recommendations and their use.   At the same session, Beth Jefferson of BiblioCommons talked about some of the social and legal issues of sharing user content in library catalogs and other discovery applications.

And I have the slides and remarks I prepared for  “Open Records, Open Possibilities“, a presentation for the ALCTS panel on shared bibliographic records and the future of WorldCat.  In that one, I argue for more open access to bibliographic records, showing some of the benefits and sustainability strategies of open access models.

Karen Calhoun has also posted the slides from her presentation at that panel.  Peter Murray also presented; I haven’t yet found his slides online, but he’s blogging about what he said.  The fourth panelist, Brian Schottlaender, didn’t present slides, but instead gave thoughtful summaries and follow-on questions to some of the points the rest of us made.  From the audience, Norman Oder of Library Journal took notes and then wrote a useful report on the session.

I’d like to thank the organizers of these sessions, Sharon Shafer and Charles Wilt, for inviting me to speak, and my co-presenters for sharing their ideas and viewpoints.

Neil Gaiman wins Newbery medal; more Newbery honorees go online

I just got back from a whirlwind trip to Denver for ALA Midwinter.    While I was there, they announced the winner of this year’s Newbery medal: Neil Gaiman‘s Graveyard Book.  I’ve been hoping to get around to this book– but if it’s anywhere near as well-written as Gaiman’s other juvenile titles (like Coraline), the Newbery committee chose well.  (You can hear an interview with Neil, and an excerpt from the book, on NPR’s website.)  Congratulations to Neil, and to the other authors who won Newbery honors and ALA’s other awards for children’s books this year.

When I blogged about last year’s Newbery awards, I noted that most of the 1922 medalists and honorees were online, and that about a dozen later Newbery honorees were also out of copyright and could go online.  Since then, my partner-in-bookery Mary has found and digitized many of those later books for a special Celebration of Women Writers exhibit, “Newbery Honor Books and Medal Winners by Women, 1922-1964“.  You can also find these and other online prize-winners from my Prize Winning Books Online page.

(As I mentioned in a previous post, I went out to ALA Midwinter to give a couple of talks on the future of library catalog interfaces and data.  I’ll have a post with pointers to slides and notes for those talks shortly.)

Repository services, Part 2: Supporting deposit and access

A couple of days ago, I talked about how we provided multiple repository services, and why an institutional scholarship repository needs to provide more than just a place to store stuff.  In this post, I’ll describe some of the useful basic deposit and access services for institutional scholarly repositories (IRs).

The enumeration of services in this series is based in part on discussions I’ve had with our scholarly communications librarian, Shawn Martin, but any wrong-headed or garbled statements you find here can be laid at my own feet.  (Whereupon I can pick them up, smooth them out, and find the right head for them.)


One of the major challenges of running an institutional repository is filling it up with content: finding it, making sure it can go in, and making sure it goes in properly, in a manageable format, with informative metadata.  Among other things, this calls for:

  • Efficient, flexible, user-friendly deposit workflows. Most of your authors will not bother with anything that looks like it’s wasting their time.  And you shouldn’t waste your staff’s time either, or drive them mad, with needlessly tedious deposit procedures they have to do over and over and over and over again.
  • Conversion to  standard formats on ingestion. Word processing documents, and other formats tied to a particular software product, have a way of becoming opaque and unreadable a few years after the vendor has moved on to a new version, a new product, or that dot-com registry in the sky.  Our institutional repository, for instance, converts text documents to PDF on ingestion, which both helps preserve them and ensures wide readability.  (PDF is an openly specified format, readable by programs from many sources, available on virtually all kinds of computers.)
  • Journal workflows. Much of what our scholars publish is destined for scholarly journals, which in turn are typically reviewed and edited by those scholars.  Letting scholars review, compile, and publish those journals directly in the repository can save their time, and encourage rapid, open electronic access.   (And you don’t have to go back and try to get a copy for your repository when it’s already in the repository.)  Our BePress IR software has journal workflows and publication built into it.  Alternatively, specialized journal editing and publishing systems, such as Open Journal Systems, also serve as repositories for their journal content.
  • Support for automated submission protocols such as SWORD. Manual repository deposit can be tedious and error-prone, especially if there are multiple repositories that want your content (such as a funder-mandated repository, your own institution repository, and perhaps an independent subject repository.)  Manual deposit also often wastes people’s time re-entering information that’s already available online.  If you can work with an automated protocol that can automatically put content into a repository, though, things can get much better: you can support multiple simultanous deposits, ingestion procedures designed especially for your own environment that use the automated protocol for deposit, and automated bulk transfer of content from one repository to another.  SWORD is an automated repository deposit protocol that is starting to be supported by various repositories. (BePress does not yet support it, but we’re hoping they will soon).

From a practical standpoint, if you want a significant stream of content coming into your repository, you’ll probably need to have a content wrangler as well: someone who makes sure that authors’ content is going into the repository as intended. (In practice, they often end up doing the deposit themselves.)


You want it to be easy and enjoyable for readers to explore your site and find content of interest to them.  Here are a few important ways to enable discovery:

  • Search of full text and/or metadata, either over the repository as a whole, or over selected portions of the repository.  Full text search can be simple and turn up lots of useful content that might not be discovered through metadata search alone.  More precise, metadata-based searches can also be important for specialized needs.   Full text indexing is not always available (in some cases, you might only have page images), but it should be supported where possible.
  • Customization of discovery for different communities and collections.  Different communities may have different ways of organizing and finding things.  Some communities may want to organize primarily by topic, or author, or publication type, or date.  Some may have specialized metadata that should be available for general and targeted searching and browsing.  If you can customize how different collections can be explored, you can make them more usable to their audiences.
  • Aggregator feeds using RSS or Atom, so people can keep track of new items of interest in their favorite feed readers.  This needs to exist at multiple levels of granularity.   Many repositories give RSS feeds of everything added to the repository, but most people will be more interested in following what’s new from a particular department or author, or in a particular subject.
  • Search engine friendliness. Judging from our logs, most of the downloads of our repository papers occur not via our own searching and browsing interfaces, but via Google and other search engines that have crawled the repository.  So you need to make sure your repository is set up to make it easy and inviting for search engines to crawl.  Don’t hide things behind Flash or Javascript unless you don’t want them easily found.  Make sure your pages have informative titles, and the site doesn’t require excessive link-clicking to get to content.  You also need to make sure that your site can handle the traffic produced by search-engine indexers, some of which can be quite enthusiastic about frequently crawling content.
  • Metadata export via protocols like OAI-PMH.  This is useful in a number of ways:  It allows your content to be indexed by content aggregators; it lets you maintain and analyze your own repository’s inventory; and, in combination with automated deposit protocols like SWORD (and content aggregation languages like OAI-ORE), it may eventually make it much simpler to replicate and redeposit content in multiple repositories.


  • Persistent URIs for items. Content is easier to find and cite when it doesn’t move away from its original location.  You would think it would be well known that cool URLs don’t change, but I still find a surprisingly large number of documents put in content management systems where I know the only visible URIs will not survive the next upgrade of the system, let alone a migration to a new platform.  If possible, the persistent URI should be the only URI the user sees.  If not, the persistent URI should at least be highly visible, so that users link to it, and not the more transient URI that your repository software might use for its own purposes.
  • An adequate range of access control options for particular collections and items.  I’m all in favor of open access to content, but sometimes this is not possible or appropriate.  Some scholarship includes information that needs to be kept under wraps, or in limited release, temporarily or permanently.  We want to still be able to manage this content in the repository when appropriate.
  • Embargo management is an important part of  access control.   In some cases, users may want to keep their content limited-access for a set time period, so that they can get a patent, obey a publishing contract, or prepare for a coordinated announcement.  Currently, because of BePress’ limited embargo support, we sit on embargoed content and have to remember to put it into the repository, or manually turn on open access, when the embargo ends.  It’s much easier if depositors can just say “keep this limited access until this data, and then open it up,” and the repository service handles matters from there.

That may seem like a lot to think about, but we’re not done yet.  In the next part, I’ll talk about services for managing content in the IR, including promoting it, letting depositors know about its impact, and preserving it appropriately.

Repository services, Part 1: Galleries vs. self-storage units

Back near the start of my occasional series on repositories, I noted that we had not just one but a number of repositories, each serving different purposes.

In tight budgetary times, this approach might seem questionable.  Right now, we’re putting up a new repository structure (in addition to our existing ones) to keep our various digitized special collections and make them available for discovery and use.  We hope this will make our digital special collections more uniformly manageable, and less costly to maintain.

At the same time, we’re continuing to maintain an institutional repository of our scholars’ work on a completely different platform, one for which we pay a subscription fee annually.  I’ve heard more than one person ask “Well, once our new  repository is up, can’t we just move the existing institutional repository content into it, and drop our subscription?”

To which I generally answer: “We might do that at some point, but right now it’s worth maintaining the subscription past the opening date of our new repository.”  The basic reason is that the two repositories not only have different purposes, but also, at least in their current uses, support very different kinds of interactions, with different kinds of audiences.

The interactions we need initially for the repository we’re building for our special collections are essentially internal ones.  Special collections librarians create (or at least digitize) a thematic set of items, give them detailed cataloging, and deposit them en masse into the collection.  The items are then exposed via machine interfaces to our discovery applications, that then let users find and interact with the contents in ways that our librarians think will best show them off.

The repository itself, then, can work much like a self-storage unit.  Every now and then we move in a bunch of stuff, and then later we bring it out into a nicer setting when people want to look at it.  Access, discovery, and delivery are built on top of the repository, in separate applications that emphasize things like faceted browsing, image panning and zooming, and rare book page display and page turning.

Our institutional repository interacts with our community quite differently.  Here, the content is created by various scholars who are largely outside the library, who may deposit items bit by bit whenever they get around to it (or when library staff can find the time to bring in their content).  They want to see their work widely read, cited, and appreciated.  They don’t want to spend more time than they have to putting stuff in– they’ve got work to do– and they want their work quickly and easily accessible.  And they’d like to know when their work is being viewed.  In short, they need a gallery, not just a self-storage unit.  They want something that lets them show off and distribute their work in elegant ways.

Our institutional repository applications, bundled with the repository, thus emphasize things like full text search and search-engine openness, instant downloads of content, and notification of colleagues uploading and downloading papers.

We could in theory build similar applications ourselves, and layer them on top of the same “self-storage” repository structure we use for special collections.   (Museums likewise often have their exhibit galleries literally on top of the bulk of their collection kept in their basements, or other compact storage areas.)  But it would take us a while to build the applications we need, so for now we see it as a better use of our resources to rely on the applications bundled with our institutional repository service.

(An alternative, of course, would be to see if an existing open source application would serve our needs.  I hope to talk more about open source repository software in a future post, but we haven’t to date decided to run our institutional repository that way.)

I hope I’ve at least made it clear that for a viable institutional repository, you need quite a bit more than just “a place to put stuff”: you need a suite of services that support its purposes.  In Part 2,  I’ll enumerate some of the specific services that we need or find useful in our institutional scholarship repository.

Public Domain Day 2009: Freeing the libraries

In many countries, January 1 isn’t just the start of a new year: it’s the time when a new year’s worth of works are welcomed into the public domain.  As I noted in last year’s Public Domain Day post, countries that use the copyright terms specified by the Berne Convention bring works into the public domain on the first January 1 that’s more than 50 years after the death of their authors.  So today, most works by authors who died in 1958 join the public domain in those countries.  This page at lists many such authors, and their books.  Some of the more notable names include James Branch Cabell, Rachel Crothers, Dorothy Canfield Fisher, C. M. Kornbluth, Mary Roberts Rinehart, Robert W. Service, and Ralph Vaughan Williams.

Many countries, however, have extended their copyright terms in recent years.   Most European Union countries, for instance, took 20 years worth of works out of the public domain in the 1990s when the EU mandated that copyright terms be extended to run for the life of the author plus 70 years.  This year, they come a little bit closer to recovering their lost public domain, welcoming back works by authors who died in 1938, including people like Karel Capek, Zona Gale, Georges Melies, Constantin Stanislavsky, Osip Mandelstam, Owen Wister, and Thomas Wolfe.

In some other countries, very little is entering the public domain today.  Here in the US, we’re midway through a freeze on most copyright expirations, resulting from a term extension enacted in 1998.  We now have 10 years to go until copyrights on published works start expiring again due to age. (By 1998, all works copyrighted prior to 1922 had entered the public domain.  Remaining copyrights from 1923 are scheduled to expire at the start of 2019.) Some special interests would like to make copyright terms even longer (even “forever less one day”, as Congresswoman Mary Bono requested on behalf of the movie industry).  Those of us who value the public domain will need to ensure that it is not further eroded, and that copyrights are allowed to expire on schedule.  This is in keeping with the intents of the country’s founders, who specified in the Constitution that copyrights were meant to last only for “limited times”.

But even though few works are entering the public domain in the US today, many more works are now freely and easily available to the public today than a year ago.  Much of this is thanks to initiatives like Google Books and the Open Content Alliance, which are digitizing books and other works that libraries have acquired and preserved.  Many of the digitized works are in the public domain, and these projects have been making them freely readable and downloadable when they can confirm their public domain status.  And now that Google has negotiated a settlement with book publisher and author groups,  they plan to be more proactive about identifying and releasing public domain works, including works published after 1922 that are out of copyright (but are not so easy identified as public domain as older books are).

These works have been part of the public domain for years, but when they were simply sitting on the shelves of a few research libraries, they weren’t doing the public much good.  Once they’re digitized, though, and their digitizations and descriptions are shared online, they can be much more easily found, read, adapted, and reused by anyone online.  By opening up the treasure trove of public domain expression that libraries have preserved, we magnify its value.  When libraries share their intellectual endowment, they better fulfill their mission to bring art and knowledge to readers, and make it easy for readers to learn, build on, and be enriched by this knowledge.

I wish I could say that libraries always acted with this understanding.   Unfortunately, all too often libraries and affiliated organizations have been resistant or slow to share the information they compile and control.  The effective value of what libraries offer has been significantly diminished as a result.

Sometimes libraries simply have not moved as quickly as they could.  The Copyright Office has long provided online access to copyright records, but only from 1978 onward.  I started digitizing older copyright records over 10 years ago, and a few libraries started doing so as well, but many older records have not yet been publicly digitized, though they’re available in printed form in many government depository libraries.  These records can make it much easier to verify public domain status of many works, and then make them available to the public.

Sometimes libraries and affiliated organizations put up their own restrictions on sharing information they already have in digital form.  I had a series of posts in November, for instance, criticizing OCLC‘s newly revised restrictions on sharing and reusing catalog records that libraries have contributed to WorldCat, the largest shared cataloging resource for libraries. The data in WorldCat can be the basis for many useful and innovative applications to direct readers towards useful information resources, and information about those resources.  And in December, an extremely useful downloadable semantic web representation of Library of Congress subject headings, the basis for information discovery applications like this one, was ordered taken down by LC administrators.

In the new year, I hope to encourage libraries to be more open in sharing their knowledge resources (and to support partners that also enable such openness).  My gifts to the public domain this year are in that spirit.

The first one, dedicated immediately to the public domain, is the start of a simple, free decimal classification system, intended to be reasonably compatible with certain existing library standards, but freely available and usable by anyone for any purpose.  (I created this after someone requested such a system for their institutional repository, and found out that the current Dewey Decimal system is subject to usage restrictions based on copyright and trademark.)  While this is more of a proof of concept than something I expect libraries to adopt in great numbers, I hope it inspires further open sharing of library metadata and standards.

Also, as I did last year, I’m dedicating another year’s worth of copyrights that I can control, this time from 1994, to the public domain, so that they follow the initial 14 year copyright term originally prescribed by this country’s founders.  These copyrights include the first versions of Banned Books Online, and the first database-driven versions of The Online Books Page.  Versions of these resources from 1994 and earlier are now given to the public domain.

I hope readers find value in these, and all the other public domain and freely licensed works they can enjoy and use online. Happy Public Domain Day!

Update: See also the Public Domain Day posts at Creative Commons, and the Center for Internet and Society.