Everybody's Libraries

Lots of conversation keeps stuff sustainable

Posted on March 23, 2010 by John Mark Ockerbloom

Among the hats I wear at my place of work is that of LOCKSS cache administrator. LOCKSS is a useful distributed preservation system built around the principle “Lots of copies keep stuff safe” (whose initials give the system its name). The idea is that, with the cooperation of publishers, a bunch of libraries each harvest copies of selected online content, and keep backups on our own LOCKSS caches, which are hooked up to local library proxy services. Then, if the material ever becomes inaccessible from the publisher, our users will automatically be routed to our local copies. Each LOCKSS cache also periodically checks with other LOCKSS caches to ensure that our copies are still in good shape, and to repair or replace copies that have been lost or damaged. (Various security features protect against leaks of restricted content, or unauthorized revisions of content.)

LOCKSS is open source software that runs on commodity hardware. It was originally envisioned to run virtually automatically. As Chris Dobson described the ideal in a 2003 Searcher article, “Take a computer a generation past its prime…. Hook it up to the Internet and put it in a closet. Stick in the LOCKSS CD-ROM and boot it up. Close the closet door.” And then presumably walk away and forget about it.

Of course, it’s not that simple in practice, particularly if your library is proactive about its preservation strategy. The thing about preservation at scale is there’s always something that needs attention. It might be something technical, or content-related, or planning-related, but preserving a growing collection requires ongoing thought. And if you want to think as clearly and sensibly as you can, you’ll want to collaborate.

Right now, for instance, I’m trying to get my cache to harvest the full run of a journal that’s just been made available for LOCKSS harvesting, where we hope to provide post-cancellation access through LOCKSS. Someone at Stanford just gave me a useful tip on how to give this journal priority over the other volumes I’ve got queued up for harvest. Unfortunately, I can’t try it out until I get my cache back up after it failed to reboot cleanly after a power failure. While I wait to hear back instructions about how best to remedy this, I wonder whether switching to a new Linux-based version of LOCKSS might make such operating system-level problems easier to deal with. But it would be useful to hear from folks who are running that version to see what their experience has been.

Meanwhile, we’re wondering how best to approach new publishers who have content that our bibliographers would like to preserve via LOCKSS. Our special collections folks wonder whether we should preserve some of our own home-grown content via a private LOCKSS network. I’m also doing some ongoing monitoring and testing of our LOCKSS cache’s behavior (some of which I’ve reported on earlier), and would be interested in knowing if others are seeing some of the same kinds of things that I see on the cache I administer.

In short, there are a lot of things to think about, when LOCKSS plays a significant role in a preservation plan. And a lot of the issues I’ve mentioned above are ones that others may be thinking about as well. So let’s talk about them. As the LOCKSS group has said, “”A vibrant, active, and engaged user community is key to the success of Open-Source efforts like LOCKSS.”

One thing you need for such an engaged community is a forum for them to talk to each other. As it turns out, the LOCKSS group at Stanford tell me they created a LOCKSS Forum mailing list a while back, but I haven’t yet seen it publicized. Its information page is at https://mailman.stanford.edu/mailman/listinfo/lockss-forum . (Currently, archived email messages are not visible on the open web, though this may change in the future.) If you’re interested in talking with others about how you use or might use LOCKSS to preserve access to digital content, I invite you to sign up and help get the conversation going.

Posted in libraries, people, preservation, sharing | Comments Off

Implementing interoperability between library discovery tools and the ILS

Posted on March 9, 2010 by John Mark Ockerbloom

Last June I gave a presentation in a NISO webinar about the work a number of colleagues and I did for the Digital Library Federation to recommend standard interfaces for Integrated Library Systems (the systems that keep track of our library’s acquisitions, catalog, and circulation) to support a wide variety of tools and applications for discovery. Our “ILS-DI” recommendation was published in 2008, and encompassed a number of functions that some ILS’s supported. But it also included many functions that were not generally, or uniformly, supported by ILS’s of the time. That’s still the case today.

As I said in my presentation last June, “If we look at the ILS-DI process as a development spiral, we’ve moved from a specification stage to an implementation stage.” My hope has been that vendors and other library software implementers would implement the basics of what we recommended– as many agreed to— and the library community could progress from there. This often takes longer to achieve than one might hope.

But I’m happy to report that the Code4lib community is now picking up the ball. At this month’s Code4lib conference, a group met to discuss “collaboratively develop[ing] a middleware infrastructure” to link together ILS’s and discovery tools, based on the work done by the DLF’s ILS-DI group and by the developers of systems like Jangle and XC. The middleware would help power discovery applications like Blacklight, VuFind, Summon, WorldCat Local, and whatever else the digital library community might invent.

I wasn’t at the Code4lib conference, but the group that met there to kick off the effort has impressive collective expertise and accomplishments. It includes several members of the DLF’s ILS-DI group, as well as the lead implementors of several relevant systems. Roy Tennant from OCLC Research is coordinating the initial activity, and Emily Lynema of the ILS-DI group has converted the Google groups space used by the ILS-DI group for the new effort.

And you’re welcome to join too, if you’d like to help out or learn more. “This is an open, collaborative effort” is how Roy put it in the announcement of the new initiative. Due to some prior commitments, I’ll personally be watching more than actively participating, at least to begin with, but I’ll be watching with great interest. To find out more, and to get involved, see the Google Group.

Posted in architecture, discovery | 1 Comment

Shedding light on images in the public domain

Posted on February 8, 2010 by John Mark Ockerbloom

For years, I’ve regularly gotten requests from authors and publishers for licenses to reproduce images in books listed on The Online Books Page, or included in the local collection of A Celebration of Women Writers. Sometimes these requests relate to copyrighted books that I list but don’t control rights for; in those cases, I do my best to refer the request to the book’s copyright holder. But often, they’re for images in our own collections, from books published over 100 years ago. In those cases, I respond that the image is in the public domain (and our digitization, which adds no originality, is also in the public domain), so no license is necessary or appropriate.

Usually that response receives a thankful reply, sometimes with signs of surprise that an image can be reused without permission. But sometimes I’ll get back a more alarmed reply. “My publisher says I need a license for every image in my book, or I can’t use it,” it might say, followed by a plea for help in tracking down some long-defunct 19th century publisher.

I wish I could say this was an atypical anecdote. But, if you look around the Web, you’ll find that there are huge numbers of historic images– paintings, photographs, figures, and the like– that are behind access barriers, or closed off altogether from online access, when they don’t have to be. Artstor has over a million images of thousands of years of art that you can’t look at unless you’re at an institution that has a subscription. The fine arts image catalog at my own library has over 100,000 digital images, none of which can be seen online by the public outside of Penn, except in thumbnails. Neither Artstor nor Penn want to keep art away from the public; both are nonprofit educational institutions. But clearing images for free public access on a large scale has to date been impractical for these institutions.

Restrictions on images also create holes in other works. For instance, under the proposed Google Books settlement, images in books that might be under copyright would be blanked out unless the rightsholder to the book also asserted they held the rights in the images. These sorts of omissions can cut the heart out of many works. In a recent New Republic article, “For the Love of Culture“, Lawrence Lessig described how a critical table was omitted in an otherwise free article about his daughter’s possible illness, due to rights-clearance issues. “I could not believe that we were this far down the path to insanity already,” he wrote of the incident.

Part of the insanity is that many of these images from our cultural heritage are actually in the public domain. Many people are aware that copyrights prior to 1923 have expired in the US. But so have many copyrights from later in the 20th century. Pre-1964 copyrights generally had to be renewed 28 years after the start of their term, or they would expire. (Exceptions and further details are described here.) But most copyrights were never renewed; and that’s especially true for images.

In 1923, there were copyright registrations for 3,059 works of art, 1,149 scientific and technical drawings, 7,533 photographs, and 11,289 prints and pictorial illustrations, making a total of 23,030 copyright registrations for these classes of image. In 1951, 28 years later, there were 198 copyright renewals for all of these image classes combined. This represents a renewal rate of less than 1%.

We have just completed posting scans that make all active copyright renewals for artwork viewable online. In fact, once we finish scanning one last batch of renewals for maps and for commercial prints (meaning images created for product packaging and promotion) all active copyright renewals for any type of still image will be viewable online. In later years, the number of image copyright renewals grows slightly, but not by much. But the number of images published in those years grows substantially.

Images without a copyright registration of their own might still be under copyright if they were first published as part of a copyrighted book, newspaper, magazine, or other larger work. Fortunately, we have complete online renewal records for those kinds of works too. It becomes much easier to establish the public domain status of a newspaper photograph, for instance, if you know (as I previously revealed) that no newspaper outside New York renewed copyright for any issue published before the end of World War II.

Having copyright renewals online for artwork is an important step towards freeing the public domain in images. But there’s more needed to make copyright clearance practical at a large scale. Putting scanned renewal records into a searchable database (perhaps combined with fair use image thumbnails) will make it easier to find any copyright renewals that might exist for a particular image. (A similar database for book renewals already exists, and there are more book renewals than image renewals.) Making original copyright registrations available as well (as we now have for artwork through 1949, and soon will have for later years) lets us determine when the copyright for an image began, and whether it was renewed in time to prevent it from expiring.

Furthermore, establishing the history and provenance of images will let us determine when unregistered artwork enter the public domain. Registered or not, the copyright to an image created before 1964 began no later than its first US publication, and the copyright for many such images therefore ended after 28 years due to a lack of renewal. And the mostly-frozen American public domain still includes more work each year that was never published before 2003. On Public Domain Day last month, all such work by artists who died in 1939 entered the public domain in the US. (I won’t get now into the rather baroque rules for establishing “publication” of an artwork, but you can determine it if the history of the image is documented.)

So we have a rich treasure trove of images in the public domain that’s been largely buried under presumptions and uncertainties about copyright. By finding and sharing information about their copyrights, we can protect and enjoy these images in the commons of the public domain, where they can be viewed freely, included in new works, and reused in any way we can imagine. If you find this prospect intriguing, I hope you’ll help bring these images to light.

Posted in copyright, sharing | 3 Comments

Every book its libraries: or, Taking care in withdrawal

Posted on January 28, 2010 by John Mark Ockerbloom

The question of when to withdraw materials from libraries has gotten heightened attention lately. Everyday readers may not always realize it, but most libraries get rid of books and other materials on a regular basis. Libraries typically have limited space, but keep acquiring new materials to serve their audience’s needs. As they acquire new materials, they typically make room by getting rid of materials that no longer serve their audience as well; this is variously known as “withdrawing”, “deaccessioning”, or “weeding”.

Some libraries weed more aggressively than others. School and public libraries tend to turn over their collections more quickly than academic research libraries. There’s not much value a middle-schooler can get out of an outdated science book, for instance, compared to a current one. And a public library user looking for a book on how to use their new Windows 7 computer shouldn’t have to wade through stacks clogged with TRS-80 programming guides and the like. You can find amusing anecdotes about books that have outlived their usefulness in these kinds of collections in the blog Awful Library Books, one of the blogs on LISNews’ 10 Librarian Blogs to Read in 2010.

Academic libraries typically don’t weed as aggressively. The larger research libraries aim to have a broad selection of thought on subjects from various points in history, as well as whatever happens to be of current interest. A book on science that no longer reflects current scientific understanding may still be useful for researchers that want to look at the history of science, or at how science interacted with culture at the time. Even the peripheral details can be of interest; for instance the photographs in an obsolete computer guide can tell us what what the computers looked like, and how they were expected to be used. The most interesting aspect of many old periodicals nowadays is often the advertisements, rather than the editorial content.

Especially when they’re digitized, large corpuses can also be of major interest even when the individual items might not be particularly noteworthy. They can help you track the use and evolution of language, for instance, or quash unwarranted patents. I’ve talked before about the great potential of Google Books and similarly comprehensive corpuses.

Even so, research libraries still get rid of materials, or move them to offsite warehouses, when space is short. As more users access materials online instead of print, we often ship out print volumes that have online surrogates. Recently Ithaka published a report called What To Withdraw that recommends gives guidelines for withdrawing materials that are online in sustainable archives (such as Ithaka’s own JSTOR), and that have a few physical copies in print archives somewhere. Doing this responsibly may help many research libraries grow their collections, or repurpose their spaces, in useful ways. Selling particularly valuable items to more appropriate libraries can also help fund additional library acquisition and activity.

Carefully considered, then, withdrawal can greatly benefit libraries and their users. But libraries need to think not only about their own collection’s purposes, but about the systemic risks of individual library collection decisions. For instance, many of the “Awful Library Books” justifiably withdrawn from public libraries might still be of historical research interest to someone. Even if academic research libraries would keep them, many of the books intended for popular or specialized non-academic audiences were not collected by academic libraries in the first place. If all the public libraries with these books simply throw them out, and no copy gets transferred to a library or archive with a longer-term interest, the materials may disappear forever.

Online access, as an alternative to retaining print copies, may not be as reliable as one expects. Recently, the archives of many popular magazines that were available through various subscription databases became part of an exclusive deal from one database vendor. This is likely to raise the costs of access to many libraries, both because they may have to subscribe to a new database to keep providing these magazines, and because the price of the new exclusive bundle is likely to increase. But even if vendors keep prices reasonable, libraries’ own situations may change. Here in Pennsylvania, funding to libraries has been cut severely enough that many now have to cancel subscriptions to heavily-used databases. The linked story has a heartbreaking quote from one of the public librarians that’s had to drop their formerly free Power Library subscription: “I got rid of [our old magazines] because everything was in the database.”

How can we insure against these sorts of cultural loss, even as we withdraw items? A key principle is replication. In the words of one well-known digital preservation program, “Lots of Copies Keep Stuff Safe”. When we consider withdrawing something, we stop to think if some other library or institution might find it of value. If we’re considering dropping print originals for digital surrogates, we check to see if other institutions we trust are keeping the originals safe, or would be willing to do so. We also make digital copies of print materials that may be at risk, and we try to spread around these copies as widely as practicality and copyright law allows. And we develop and support efficient inter-library transfer networks so that we can quickly move locally deaccessioned materials to where they’re needed or valued.

Many librarians have a philosophy of public service that draws on Ranganathan’s famous set of Five Laws of Library Science, which includes principles like “every reader his book” and “every book its reader”. As we try to preserve our broad cultural heritage in the midst of withdrawal, loss, and replication, a related principle, “Every book its libraries”, is a useful one to keep in mind.

[Edited slightly 4:12pm Jan 28, in response to a comment below: deleted struck-through text, and added italicized text]

Posted in preservation, sharing | 9 Comments

Concepts in catalogs: Where the data comes from

Posted on January 15, 2010 by John Mark Ockerbloom

I’ve now made a few posts about concept-oriented catalogs, describing the basic idea, showing some examples, and talking about the kinds of context they should provide for users. As I mentioned in my first post, concepts in such catalogs are “first-class locuses of information to help readers find useful knowledge resources”. The catalogs I’m describing include a variety of concepts (beyond the bibliographic record) that have data associated with them, and this data gives users a helpful context for finding appropriate knowledge resources.

As I said in my example post, “The concepts come from, and are maintained by, various groups of people…. [They] may be derived in part from existing MARC bibliographic metadata (sometimes through automated analysis), but often draw from additional data sources.”

If you’ve worked in cataloging lately, you might be thinking, “that’s nice, but we’ve got our hands full just providing MARC catalog records for all the books and other stuff coming through the door now. Where’s all this other ‘concept’ data going to come from? And how will it be practical to use and maintain?”

In this post, I’d like to take a stab at answering those questions. I’ll draw a lot from my experience with subject maps, but a lot of what I say should apply to other kinds of concept data as well.

The conceptual data behind subject maps consists of annotations on different subjects, and links between related subjects. A lot of what I need to build these maps can simply be reused from existing data. In particular, the Library of Congress Subject Headings system (LCSH) provides a large set of subjects with standardized names. We also have a set of authority records associated with those subjects that gave alternate names, notes, and links to related subjects.

To make it practical to build a subject map for this data, I bulk-loaded authority records from our local catalog. While the Library of Congress Authorities are more up to date than our local catalog, I could only look up records there one at a time, through an interface designed for manual browsing. Fortunately, since then the Library of Congress has provided ways to download subject authority data in bulk. It’s in a format that omits some details, but it should still help fill out our maps when we start including these records as well. Because our library and the Library of Congress are both using a common system of identifiers for subjects, as well as compatible formats for expressing subject relationships, I’ll be able to combine our authority information with theirs to provide useful maps. The identifiers we use are not always in sync; LCSH subject terms do get renamed and discontinued from time to time. But the cross-references in LCSH authority records, which often include the old terms as aliases of the new terms, help reduce the pain involved in moving from old terms to newer terms.

Subject maps built just on authority records turn out to be pretty generic, and not as useful as they could be. To make them more useful, we need more data. As I describe in more detail in this white paper, I also analyze our bibliographic corpus to see what subject terms we actually use in our catalog, look at the structure of those terms (which are often coordinated from multiple components), and also look at correlations between terms that get used together in the same bibliographic records. This analysis lets me create additional useful relationships between subjects. In short, I use automated analysis of a large data corpus to create new concept data from existing data.

In order to link together the many subjects that have geographic aspects, I need some extra data that isn’t in authority records. Once I created a data record that noted that “Pennsylvania” is a US state that gets abbreviated “Pa.” in some subject headings, I was able to build all kinds of relationships between “Philadelphia (Pa.)” and related subjects, none of which are directly stated in the authority records for these subjects, but all of which can be derived by automated analysis. (It helps that subject terms in LCSH have a fairly well-defined structure that’s amenable to lexical analysis.) A couple hundred other brief geographic data records are enough to let users zoom in and out of locations all over the globe. So a small amount of well-designed and curated supplementary data can often enhance lots of concepts, with minimal maintenance cost.

While I can easily zoom in and out between the US, Pennsylvania, Philadelphia, and locations within Philadelphia, I’d need more data to move side to side. I don’t have any data, for instance, that tells me that Philadelphia is right next to Camden, New Jersey. But fortunately, I can mine external data sources to find this information. I recently read about a source of public domain global map data, for instance, that I (or any other geographic-concept catalog builder) could use to link subjects or other resources to a world map.

Increasing amounts of public data are distributed online. If the data is public domain, or available with a liberal license, I don’t have to worry about legal roadblocks to downloading it, analyzing it, and using it in my own work. Sharing data helps everyone build not only smarter catalogs, but smarter applications of all kinds.

Data sharing does not always happen painlessly. I may have different concepts, or different names for concepts, than someone else whose data I might find useful. We may have different ideas about how to structure our data. But there are now systems that provide links between different names, and crosswalks between different structures that can help bridge the gap between my data and that of others.

With large enough corpuses of data to draw on, I can even make use of unstructured information from large groups of ordinary users. For example, LibraryThing’s tag cloud displays a number of terms that are useful to include in one’s own library catalog. Not all of them are formally defined subjects, but they’re used enough that we should expect most of them to be used in patron searches. It should be possible to analyze the cloud and the things tagged in the cloud to associate many informal terms with particular subjects or library resources.

To summarize, it becomes much easier to derive the data needed for concept-oriented catalogs if

We have stable (or at least smoothly evolving) identifiers for concepts
We can use, swipe, and reuse a large domain of [meta]data for concept analysis (including automated analysis)
We carefully consider what additional concept data would enhance our services, and use standard, recognized forms to represent it
We have correspondences and crosswalks between different concept identifiers and formats
We share our concept data (and bibliographic data in general) as openly and broadly as possible
And we share information, expertise, and code that supports the innovative, useful catalogs we build.

There’s a non-trivial technical infrastructure implied by these requirements. But it’s one that we can build. (Quite a bit of it’s in place already.) A lot of it depends on a healthy social infrastructure to create, maintain, share, and work with all the data and services that we create and adopt. I hope to talk more about this social infrastructure in future posts.

Posted in architecture, discovery, formats, metadata, sharing, subjects | Comments Off

Content and context in concept-oriented catalogs

Posted on January 7, 2010 by John Mark Ockerbloom

2010 has been designated the year of cataloging research by the Association for Library Collections and Technical Services. So it’s a good time for me to continue my series on concept-oriented catalogs. Before I talk about specific ways to implement them, I’d like to give some attention to what they look like, and the ways they interact with users.

If you’ve explored the links from my post giving examples of such catalogs, you may have noticed that there’s no single look and feel across the catalogs. Most of them diverge substantially from the familiar “search box followed by uniform list of hits” paradigm we see with Google and other generic search engines. But what should such catalogs provide, beyond the simple hit list? And what do different well-designed concept-oriented catalogs have in common, from a user’s point of view?

Generally speaking, concept-oriented catalog displays are more complex than traditional catalogs. Since they use a variety of concepts to help users find knowledge resources, the displays will generally show both a set of resources, and some information about other concepts in the catalog.

In theory, a catalog could start by just showing concepts, and requiring users to choose a particular concept before displaying the resources associated with it. That’s in fact how many traditional library catalogs work, when users try to browse by author or by subject: you just see lists of headings, and you have to click through each heading individually before you can see what resources are associated with it. But forcing this kind of interaction gets in people’s way unnecessarily. I think that’s why I don’t see a lot of this sort of browsing in library catalogs, and why the most frequently used Internet search engines take pains to show useful resource results as quickly as possible.

So good concept-oriented catalogs will feature resource “hits” quickly and prominently, just as other good catalogs will. Concept-oriented catalogs will generally also show other concepts that provide context for these knowledge resources. They can do this in a variety of ways. They may mix in “concept” hits with “resource” hits (as with these search results for “Hava Nagila” in the Freedman archive, which combine hits for tracks, albums, and compositions). They may display concept-related information alongside resource hit information. (In some displays, like in WorldCat Identities’ display of Charles Dickens, resource hits are displayed after information about the concept. In other displays, like the Online Books Page subject map for “Prisoners of war”, they’re literally side by side.) Or they may have complementary displays that overlay concepts and resources. Consider, for instance, this Philadelphia neighborhood history display, where resources are represented as push-pins that overlay a background of geographic representations, a display also used in many Google Maps applications.

In designing a concept-oriented catalog, one needs to avoid confusing users with over-complex, under-explained displays. Rather, the catalog should work with the user by providing an easily understood context for their search. Users should be given a clear idea of what concepts they’re looking at (e.g. by clearly naming and describing them), and what resources are associated with those concepts (e.g. by showing the resources along with the concepts, in some sensible organization).

In well-designed concept-oriented catalogs, the concepts and the resources complement each other. The concepts, and the information displayed about them, helps users understand and choose appropriate resources. And the resources, in turn, can help users understand what a concept is about, and what is known about the concept. Looking over the available resources, the conceptual information, and mentions of related concepts, users can decide whether they want to delve more deeply into a particular set of concepts, or move to different concepts for further investigation. (Facet-oriented displays let you do all this, for instance, and are useful even when they provide no information about concepts other than their names.)

In the Internet age, we also shouldn’t assume that users just stick to our own catalogs to do their research. There’s a whole Web’s worth of context out there. Concept-oriented catalogs should link out to (or include outright) external resources that can help users with concepts and resources, as appropriate. Going the other way, they should also make it easy for external resources to stably link to a particular concept (or concept set) in the catalog. (It’d be great if someone looking up an article on Wikipedia or another database could see a prominent link to “Stuff your library has on this topic”.)

I’ve mentioned or implied a number of design rules in what I’ve said above. I’d like to point out 4 of them in particular, for a moment:

Use clear names for your concepts.
Make your concepts linkable via the Web.
Provide useful information about concepts when people come across them.
Include links to knowledge resources, other concepts, and external resources, so people can find more things.

Now I’d like you to look at the 4 rules for linked data by Tim Berners-Lee, inventor of the Web.

Notice a resemblance? I didn’t start out trying to make my design requirements match Sir Tim’s principles, and I don’t necessarily think that Linked Data and its associated technologies are the be-all and end-all of information management. But there are enough points in common to suggest some interesting synergies between the design of concept-oriented catalogs and the work of the Linked Data community.

And with that to think on, I’ll take my leave till next time, when I’ll talk more about the data that powers concept-oriented catalogs.

Posted in architecture, discovery, metadata | Comments Off

Public domain day 2010: Drawing up the lines

Posted on January 1, 2010 by John Mark Ockerbloom

As we celebrate the beginning of the New Year, we also mark Public Domain Day (a holiday I’ve been regularly celebrating on this blog.) This is the day when a year’s worth of copyrights expire in many countries around the world, and the works they cover become free for anyone to use and adapt for any purpose.

In many counties, this is a bittersweet time for fans of the public domain. For instance, this site notes the many authors whose works enter the public domain today in Europe, now that they’ve been dead for at least 70 years. But for many European countries, this just represents reclaimed ground that had been previously lost. Europe retroactively extended and revived copyrights from life+50 to life+70 years in 1993, so it’s still three more years before Europe’s public domain is back to what it was then. Many other countries, including the United States, Australia, Russia, and Mexico, are in the midst of public domain freezes. For instance, due to a 1998 copyright extension, no copyrights of published works will expire here in the US due to age for another 9 years, at least.

In the past, many people have had only a vague idea of what’s in the public domain and what isn’t. But thanks to mass book digitization projects, the dividing line is becoming clearer. Millions of books published before 1923 (the year of the oldest US copyrights) are now digitized, and can be found with a simple Google search and read in full online. At the same time, millions more digitized books from 1923 and later can also be found with searches, but are not freely readable online.

Many of those works not freely readable online have languished in obscurity for a long time. Some of them can be shown to be in the public domain after research, and groups like Hathi Trust are starting to clear and rescue many such works. Some of them are still under copyright, but long out of print, and may have unknown or unreachable rightsholders. The current debate over Google Books has raised the profile of these works, so much so that the New York Times cited “orphan books”, a term used to describe such unclearable works, as one of the buzzwords of 2009.

The dividing line between the public domain and the world of copyright could well have been different. In 1953, for instance, US copyrights ran for a maximum of 56 years, and the last of that year’s copyrights would have expired today, were it not for extensions. Duke’s Center for the Study of the Public Domain has a page showing what could have been entering the public domain today— everything up to the close of the Korean War. In contrast, if the current 95-year US terms had been in effect all of last century, the copyrights of 1914 would have only expired today. Only now would we be able to start freely digitizing the first set of books from the start of World War I.

With the dividing line better known nowadays, do we have hope of protecting the public domain against more expansions of copyright? Many countries still stick to the life+50 years term of the Berne Convention, including Canada and New Zealand. In those countries, works from authors who died in 1959 enter the public domain for the first time. There’s pressure on some of these countries to increase their terms, so far resisted. Efforts to extend copyrights on sound recordings continues in Europe, and recently succeeded in Argentina. And secret ACTA treaty negotiations are also aimed at increasing the power of copyright holders over Internet and computer users.

But resistance to these expansions of copyright is on the rise, and public awareness of copyright extensions and their deleterious effects is quite a bit higher now than when Europe and the US extended their copyrights in the 1990s. And with concerns expressed by a number of parties over a possible Google monopoly on orphan books, one can envision building up a critical mass of interest in freeing more of these books for all to use.

So today I celebrate the incremental expansion of the public domain, and hope to help increase it further. To that end, I have a few gifts of my own. As in previous years, I’m freeing all the copyrights I control for publications (including public online postings) that are more than 14 years old today, so any such works published in 1995 and before are now dedicated to the public domain. Unfortunately, I don’t control the copyright of the 1995 paper that is my most widely cited work, but at least there’s an early version openly accessible online.

I can also announce the completion of a full set of digitized active copyright renewal records for drama and works prepared for oral delivery, available from this page. This should make it easier for people to verify the public domain status of plays, sermons, lectures, radio programs, and similar works from the mid-20th century that to date have not been clearable using online resources. We’ve also put online many copyright renewal records for images, and hope to have a complete set of active records not too far into 2010. Among other things, this will help enable the full digitization of book illustrations, newspaper photographs, and other important parts of the historical record that might be otherwise omitted or skipped by some mass digitization projects.

Happy Public Domain Day! May we have much to enjoy this day, and on many more Public Domain Days to come.

(Edited later in the day January 1 to fix an inaccurately worded sentence.)

Posted in copyright, online books, open access | 2 Comments

Some concepts and their catalogs

Posted on December 10, 2009 by John Mark Ockerbloom

This is the second of a series of think-aloud posts on what I’m calling “concept-oriented catalogs”; catalogs that, as Lorcan Dempsey aptly describes them, go “beyond the bibliographic record”. This post will present examples of concept-oriented catalogs, describe the concepts they use, and describe some of the features that make them work.

Reflections on the story so far

As I stated in my first post, a catalog helps a user get from some set of concepts (ideas, citations, words, people, places, etc.) of the information they’re seeking, to some useful knowledge resources (books, articles, web sites, songs, etc.) A concept-oriented catalog uses a variety of concepts, not just those of knowledge resources themselves, as first-class entities to help locate useful resources.

Note that this is essentially a functional view of a catalog, from a user’s perspective, as a means of knowledge discovery. Library catalogs also serve a number of other important functions. In particular, they also manage the inventory of resources the library has acquired and makes available. That function is necessarily resource-oriented. Some concept-oriented catalogs don’t need to be resource-oriented; Wikipedia, for example, maintains no inventory of the sites it links to from its concept-based articles. A concept-oriented library catalog, however, will probably also need to be resource-oriented to fulfill all its functions. That’s fine– the two qualities are not mutually exclusive.

Concepts can come from a variety of places; from libraries, from experts, and from ordinary readers. I’ll show examples of all three.

Fun with FRBR

I noted in the last post that current library catalogs are largely based around the entities that FRBR describes as manifestations and items. FRBR defines a number of other entities as well. These entities can also provide useful focuses for concept-oriented catalogs.

Fiction Finder, mentioned in my last post, is not the only example of a work-oriented catalog. A recent status report from the OpenLibrary Project indicates that they are moving to make their catalog work-oriented as well. The Amazon book catalog, and its aggregation of various editions of a book (and their reviews) on one catalog page, is oriented around expressions in the FRBR sense.

In practice, the line between works and expressions tends to be blurry in catalogs. But we’re definitely seeing more catalogs present search results at these higher bibliographic levels. (See, for instance, the Worldcat.org search results for War and Peace, which include various “view all editions and formats” links.) Catalogs that can’t easily and sensibly aggregate their manifestations of works or expressions will be at a competitive disadvantage over catalogs that can.

Numerous catalogs also provide special information for authors, whether persons or corporate bodies. I mentioned WorldCat Identities in my previous post; a commenter to that post noted the New Zealand Electronic Text Centre’s catalog, which also aggregates useful information on its authors, as well as other concepts.

Google Maps and other geospatial platforms provide the basis for many catalogs oriented around places, mediating access to information resources of all types, from local histories to violent incident reports, to hospital ratings. Catalog interfaces can also be built around events, as demonstrated in MIT’s Simile Timelines widget and in Google’s Living Stories news tracker. Objects, concepts (which in FRBR-ese denote abstract subjects, rather than the broader use I’m making of “concept” in this post), and other kinds of subjects are the focus of the subject map-based catalogs I mentioned in my last post.

Catalogs need not limit their focus to one kind of concept. They often get more interesting when users can move between different kinds of concepts. For instance, if you look up the author Fanny Jackson Coppin on The Online Books Page, you’ll see that there are not only resources by her, but resources about her. If you look for the latter, you’ll also discover that Coppin was an African-American teacher in Philadelphia, and then be able to follow further links to more books about African American teachers, or Philadelphia, or other related subjects as you see fit. Or, if you started a search with African American teachers in mind, you’ll find that concept linked to Coppin. So users can move back and forth between particular subjects, and people that are relevant to them, as they browse the catalog.

What makes this sort of navigation work? Part of it involves analysis of existing bibliographic records: in this case, catalogs like The Online Books Page (as well as others like WorldCat Identities) analyze patterns of subject headings to relate particular people to particular subjects. Having common identifiers helps as well: the authority-controlled string “Coppin, Fanny Jackson” is used both for author and subject metadata, making it possible to link author- and subject-based concepts. Subject maps also require additional data beyond just bibliographic records. Mine use a variety of data sources, including authority records and a small but essential set of records I created for certain geographical entities.

Consult the experts

Librarians have lots of understanding and experience working with FRBR-like concepts. But users also find value in many other concepts. A recent research report from Project Information Literacy on information seeking patterns of college students included some interesting findings on favored starting points. Wikipedia, as I expected, was a very popular starting point for everyday research, but for course-related research, the most popular starting point, edging out not only Wikipedia but even Google itself, was course readings.

Course readings are highly focused guideposts for academic research. The list of readings is a set of knowledge resources chosen carefully by the instructor to educate students about the subject of the course. The readings themselves typically have bibliographies, or at least lists of references, that support their own assertions and suggest further avenues of research. Essentially, the syllabus and the readings represent a careful curation by subject experts of important knowledge resources for a particular topic. The value of that curation often extends well beyond a particular class. At Penn, many of the reading lists originally developed for a class get adapted into research guides on our library’s Web site, helping others researching in similar areas. And interesting things also start to happen as you aggregate multiple scholars’ reading and citation lists, as we’ve seen with services like PennTags (originally designed, and still frequently used, for class projects), CiteULike, and Google Scholar.

There are also many collections assembled and curated by non-faculty experts. The curation of many of these collections often involves a rich set of concepts appropriate to the focus of each collection. Consider, for instance, the Freedman Jewish Sound Archive, curated by a lawyer and his wife and now housed at Penn. The catalog for the archive is oriented around several important concepts: among them songs that are musical compositions, tracks that record performances of the songs, albums that contain the tracks, sheet music in which songs are published, and artists that have various kinds of relationships with each of those other concepts. Each of these concepts has an expressive metadata schema that includes relations to the other concepts, and to resources that can be read or listened to onsite or online. The catalog is not only a guide to knowledge resources, but a valuable knowledge resource in its own right.

Because of the complexity and interrelationship between concepts, we had to write a specialized interface to bring out the full expressiveness of the catalog. Simply providing a flat search index of tracks was not enough. But many of the special concepts used to provide depth for this catalog can also be mapped to some extent into more common bibliographic data structures and interfaces. And new technologies I’ll discuss later on can make it easier to build, and share data from, these sorts of specialized catalogs.

There are a lot more scholars, and expert curators, of all types, than there are professional librarians. And they often know more about knowledge resources in their areas of interest than we do. We can potentially build much more broad, conceptually rich, and carefully curated catalogs, if we develop effective ways to work with them.

Power to the people

Ordinary readers or scholars might not have the information science training or metadata expertise that professional librarians have. But they can still help us greatly in finding useful knowledge resources, just through their ongoing reading and commentary, if we have some way of tracking and aggregating what they do.

Social software gives us a way to do that, and its benefits have been widely discussed in recent years. Social software introduces the user as a first-class concept that can be associated with particular descriptions and knowledge resources. Implicitly or explicitly, users build up collections of resources that they have noted, possibly using tags to describe them, or posts to comment on them. The tags and posts are typically informal expressions, rather than controlled terms or highly structured records. Tags can often describe resources in more accessible language, or to greater specificity, than established controlled descriptive terms. And users often find useful resource recommendations from other users who share their interests.

Many social communities have built up around different user groups, types of resource, or styles of communication. PennTags (oriented around Penn scholars), Flickr (oriented around photos and other images), and Twitter (oriented around short real-time posts) are just a few examples. I’m far from the only person who relies heavily on the users I follow on Twitter for reading tips and current awareness. Longer-form posts that review and invite comments on books and other works can also be very helpful in finding useful knowledge resources.

Social software has been making its way into libraries for a few years now. Many library catalogs, ours included, now let users tag and annotate bibliographic records. The catalogs provide a smattering of new functions based on this tagging, but tagging is essentially an add-on, not fully integrated into the catalog.

Catalogs designed from the ground up to be socially powered can look quite different. We’re starting to see some of them now. One notable “social catalog” is LibraryThing, which has been developed by Tim Spalding, a staff of about a dozen professionals, and a membership of nearly a million readers. Tim gave a talk to librarians in October called “What is Social Cataloging?“. In the talk, he describes and demonstrates the features and the concepts of his catalog (including many of the concepts I’ve mentioned in this post), and the many benefits that emerge when hundreds of thousands of readers collaboratively catalog their personal libraries. The video of the full talk runs just under an hour, and is well worth watching.

LibraryThing does not provide all of the functionality you’d find in a good research library catalog. But it provides many useful forms of guidance that most traditional catalogs lack, and it also suggests some ways in which libraries and their users can join forces in building comprehensive, user-focused, concept-oriented catalogs. (Much of the bibliographic metadata used in LibraryThing comes from traditional library catalog records; and LibraryThing in turn now sells a service that embeds some of the other information it aggregates back into traditional library catalog displays.)

Recap and coming attractions

In this post, I’ve shown how catalogs can use a wide variety of concepts to help users find resources. The concepts come from, and are maintained by, various groups of people, including librarians, scholars, collectors and other domain experts, and lots and lots of everyday readers. The catalog concepts may be derived in part from existing MARC bibliographic metadata (sometimes through automated analysis), but often draw from additional data sources. The catalogs may require new user interface designs, going beyond the standard search-box and list-of-hits paradigm, for users to take full advantage of their concepts.

The examples I’ve shown should demonstrate the versatility of concept-oriented catalogs, and also suggest some of the challenges of implementing them. In future posts, I hope to discuss the technologies, data models, system architectures, and social structures that can help make useful concept-oriented catalogs practical for libraries to build and maintain.

Posted in architecture, discovery, libraries, metadata | Comments Off

Understanding concept-oriented catalogs

Posted on December 4, 2009 by John Mark Ockerbloom

I’ve recently been thinking a fair bit about the future of library catalogs, particularly after reading Diane Hillmann’s review of the recent Library of Congress-commissioned Study of the North American MARC Records Marketplace. Hillmann laments that the study tries to propose economic tweaks to the current system of MARC bibliographic record distribution, but doesn’t consider the more basic questions of whether this sort of distribution is what cataloging should be focusing on in the first place, or how the nature of cataloging (broadly conceived) in the Internet age is already changing. She concludes “The change we need is not really about records, or catalogers; it’s a new way to think about information and added value.”

I agree with a lot of what she has to say in the review, but I also worry that, by itself, what she recommends might be too easily dismissed by library planners. They might imagine this sort of thinking as utopian, airy speculation with little relevance to the work they have to do now, getting books and other resources into their catalogs more efficiently. But in fact, new ways of cataloging and adding value to knowledge are already being practiced in many places online, and can be practically built on much of what libraries are already doing. So I’d like to spend a few blog posts thinking out loud about the potential I see developing for catalogs, and practical ways for realizing that potential.

What does a catalog do?

Let’s start with the basics: What are people doing when they’re using a library catalog, or engaging in other kinds of search? Typically, they have in mind some concept [1], or combination of concepts, that they want to find information about. Their concepts might not be fully formed or understood at the start of a search, and people may not be able to ideally express the concepts in ways their interface expects, but the concepts are there.

As they search, users often will refine their concepts, change the way they express them, explore related concepts, or altogether shift the concepts they’re interested in. They do this as they see what happens when they search, what resources come up, and what other relevant concepts (such as particular people, works, or focuses of study) become apparent. But the user’s ultimate goal is to obtain useful knowledge. A catalog, then, is a way of helping people get from concepts to useful knowledge resources.

There are many kinds of concepts people that searchers might be thinking of. Familiar kinds of library concepts include books, people, places, and topics of study, but there are many additional important kinds that we’ll discuss later on. Sometimes, users already have a concept of a specific knowledge resource. They might have a citation they’re trying to follow, or a particular book they want to read. Library catalogs tend to be particularly good at this kind of scenario, known in the trade as a “known item search“. [2]

This is not surprising, because today’s library catalogs tend to be strongly oriented around resources, and much less around other concepts. They primarily use MARC bibliographic records (which describe particular manifestations, to use FRBR terminology), and holdings records (which describe particular items that can be accessed or borrowed). Library science has meticulously defined and curated a number of other concepts: subjects with complex taxonomies; authors with wide-ranging authority control; uniform titles and series that link many bibliographic items together, and other bibliographic features that have carefully defined controlled vocabularies.

These concepts are represented in our MARC records, but as distinctly second-class entities. They’re typically attributes of the records that are the focus of the catalog, rather than focused records in their own right. The closest things to first-class non-resource concepts in most catalogs are authority records. But those records often contain minimal information about the concepts they describe, they’re typically invisible to most catalog users, and most online catalog interfaces don’t effectively exploit the information those records do contain. [3]

Concepts as first-class entities in catalogs

One of the long-recognized strengths of research libraries is the knowledge and organization they have of their resources, and of the concepts that these resources represent. Since a catalog maps concepts to resources, it should be possible for catalogs to represent important concepts more explicitly and expressively in the catalog, and use them as guideposts to help people find useful resources. We’re starting to see some prototype catalogs of this sort develop in the library world: consider, for instance, OCLC’s Worldcat Identities, oriented around authors; or their Fiction Finder, oriented around works with multiple manifestations; or the Online Books Page’s subject map views (such as this one on alphabets). In these catalogs, the concepts aren’t simply metadata attributes, or headings in a list of choices, but information-rich reference points for finding knowledge resources. These systems are all what I would call concept-oriented catalogs: catalogs that use various concepts (and not just concepts of the resources themselves) as first-class locuses of information to help readers find useful knowledge resources.

The promise of concept-oriented catalogs is still largely unrealized in the library world. Not only do current library catalogs often do little with conceptual knowledge, but proposals for future catalog architectures also often stick to keeping a tight focus on resources and make other concepts secondary. For instance, Coyle and Hillmann note in their 2007 review of the proposed RDA cataloging standard that “the focus of RDA is called ‘the resource’ and the resource is a FRBR manifestation/item described using the same concept of a pre-coordinated ‘record’ as we find in AACR2” (the older cataloging standard first published in 1978). Similarly, the final project report for OLE, which aims to build a next-generation library architecture, treats resources as first-class entities, but not the descriptions of those resources. The metadata describing knowledge resources are simply attributes of the resources, and not entities that can managed and shared in their own right. [4]

Concept-oriented catalogs at Internet scale

As more and more knowledge resources become available to users, via the expansion of the Internet, the streamlining of interlibrary loan services, and the mass digitization of print library materials, well-defined, well-documented, and well-connected concepts will become increasingly important for readers that want to find what is most useful to them in a sea of information. While we will never have well-defined concepts for everything readers might be interested in, the concepts that have been defined by someone, somewhere, can serve as valuable guideposts for subsequent information seekers, if we’re smart about managing and using them.

This view might strike some longtime users of the Internet as hopelessly naive and ignorant of history. After all, the Web started out in the 1990s with various conceptually organized catalogues like the WWW Virtual Library and the Yahoo Directory. But most people soon forsook these in favor of search engines like Google that are much more comprehensive, and that work with arbitrary keywords instead of predefined concepts. Why should we think that concept-oriented catalogs will work at the Internet scale if they’ve already been tried and rejected on the Web?

In fact, though, concept-oriented cataloging is still very widely used in online information seeking today. It just isn’t the same kind of concept-oriented cataloging. The most popular concept-oriented catalog online today doesn’t force readers to go through a particular concept hierarchy before they can get at the resources they want. Instead, it typically shows up prominently whenever you do a Google search for one of the concepts it includes. Each of its millions of concepts gets reviewed for naming, redundancy, and relevance, and can be easily linked to from elsewhere on the Web. Most of its concepts have links to various external online knowledge resources, as well as to related concepts. The catalog also has ever-increasing amounts of harvestable, structured, semantic metadata about its concepts. And it’s very often characterized, even by users who acknowledge its many flaws, as a useful starting point for finding information resources online. Oh, and anyone can edit it. You’ve probably figured out by now that I’m talking about Wikipedia. While Wikipedia is not usually billed as a catalog, my description should make it clear that it does in fact serve that function, and in a concept-oriented way.

Coming attractions

I hope what I’ve written so far gives you a good idea of what concept-oriented catalogs are, and why they’re worth thinking about as we plan for the future of libraries. In posts to come, I hope to discuss various examples of concept-oriented catalogs inside and outside the library world, and talk about how they work, what concepts (and related information) they focus on, and how they are built up and maintained. And I hope to show how we construct useful, practical concept-oriented catalogs for the future, building both on the knowledge and expertise we have in libraries, and on the contributions of others.

(Update: The second post in the series is now available.)

Notes

[1] I’m using “concept” here in the broad sense of any kind of thing that might be the object of someone’s search, rather than the narrower sense used in FRBR and elsewhere of a particular kind of abstract subject. It’s surprisingly difficult to find a straightforward word for this general idea that hasn’t already been claimed for some other purpose.

[2] Though even here, with the explosion of resources available online, we’ve found it useful to augment the traditional catalog with link resolvers to help people find resources like journal articles that we don’t specifically include in our own catalogs.

[3] I’ve discussed this at length in the past with respect to LCSH subjects; see my work on subject maps for some attempts to address this problem. I’ll talk more about subject maps as one type of concept-oriented catalog in later posts.

[4] This may well change as OLE develops, and the data architecture is fleshed out. I’m not directly involved in OLE at present, but I do work for one of the lead development partners.

Posted in architecture, discovery, libraries, metadata | 5 Comments

Respecting failure: Some thoughts, and a proposal

Posted on December 1, 2009 by John Mark Ockerbloom

Last month’s Digital Library Federation forum involved a number of interesting discussions, both at the conference site and online. This forum, different from previous ones, centered around discussions of strategies for innovation in libraries. It also involved discussions of the future of DLF itself, which earlier this year ended its independent existence and was merged into CLIR.

One theme that quickly emerged in discussions was the importance of failure. It’s a topic we often feel uncomfortable discussing, especially when we had a hand in whatever failed. Part of the discomfort in the digital library community has to do with the dual nature of what many of us do: We manage programs and services, and we also try to innovate. As managers, we don’t want our programs to fail. (If that happens anyway, we’d at least like to avoid being blamed for the failure.) And libraries have long-term ongoing service and preservation obligations that make certain kinds of catastrophic failure unacceptable.

But as innovators, we want to be open to failure as a way of learning. “Fail faster!” is a common slogan of innovative labs and ventures, and knowing the “thousands of ways that don’t work” (part of a quote often attributed to Thomas Edison) help us better understand the ways that do. I’ve mentioned before that my most widely-cited paper was written about the failure of a software development project I helped work on. And a new scientific theory isn’t usually worth considering until it is capable of failure– that is, it makes definite predictions that subsequent observations can either confirm or refute.

If we are really serious about innovating, we need to respect failure, and leave room for it. We need to let people try things that might not work, allow time for encountering dead ends, have contingency plans that let us continue to carry out our missions even as failures occur, and note both what worked and what didn’t in the things we try. It’s especially useful to note things that we found didn’t work before they were obvious to others, since we might well save others a lot of time avoiding the same pitfalls.

How should these failures be reported, though? It’s often easier and more gratifying to publish stories of success than stories of failure. And if we talk about our failures in public, whether in a journal, at a conference of like-minded professionals, or even in a blog or tweet, we may have good reason to worry that it might hurt our own positions in our organizations, or the work that we do.

The question was raised at the DLF Forum whether future forums could be a “safe” place to discuss failures. I’m not sure any public gathering can be an entirely safe place for that. People you see are identifiable, and word tends to spread over time, particularly when a large number of people are present.

One alternative that’s been proposed to address this problem, and still make useful information about failure available to a wide audience, is to anonymize reports. We can still learn a lot from failures in our field even if we don’t know exactly who failed and where. And perhaps anonymity can produce more useful information about failure, by producing more “safe” places to talk about it.

So, I’d like to try a little experiment. From now until the end of February, I invite folks involved in library-related failures to send me reports of their failures, for possible publication in this blog. I will choose which ones to publish (and may or may not request revisions), but whether or not they get published, I will do my best not to disclose your identity. (I can’t absolutely prevent email hacks or subpoenas, but I consider them unlikely.) Please clearly note the start and end of the report you’re submitting for publication; I’ll assume any information that might help identify you or your organization in the middle of the report is information that you’re comfortable disclosing. Reports should be sent to the email address shown on this page.

I’m interested in particular in failures of initiatives you were personally involved with, and what can be learned from the failures. While others may also be involved in the failures, I’m not particularly interested in reports intended primarily to focus blame on a particular third party. You should use similar care to obscure the identities of others involved as you do in obscuring your own identity.

Some folks may feel more comfortable working with someone else, or under different selection criteria than the ones I use. So I invite others to make similar offers if they see fit (and I might mention them here if I hear about any similar offers).

Will this produce something useful to the library community? I don’t know yet, but it seems worth risking a bit of failure to find out. If you have any suggestions about the proposal, or have some ideas of kinds of reports you’d like to see, feel free to mention them in the comments. And if you have a report you’d like to make, send me email.

Posted in failure, libraries | 3 Comments

Everybody's Libraries

Implementing interoperability between library discovery tools and the ILS

Content and context in concept-oriented catalogs

Public domain day 2010: Drawing up the lines

Some concepts and their catalogs

Understanding concept-oriented catalogs

Respecting failure: Some thoughts, and a proposal

Pages

Recent Posts

Recent Comments

Archives

Access for all

Copyrights and wrongs

General library-related news and comment

Interesting folks

Metadata and friends

Shiny tech

Tales from the repository

Writing and publishing