Everybody's Libraries

May 6, 2010

Making discovery smarter with open data

Filed under: architecture,discovery,online books,open access,sharing,subjects — John Mark Ockerbloom @ 9:06 am

I’ve just made a significant data enhancement to subject browsing on The Online Books Page.  It improves the concept-oriented browsing of my catalog of online books via subject maps, where users explore a subject along multiple dimensions from a starting point of interest.

Say you’d like to read some books about logic, for instance.  You’d rather not have to go find and troll all the appropriate shelf sections within math, philosophy, psychology, computing, and wherever else logic books might be found in a physical library.  And you’d rather not have to think of all the different keywords used to identify different logic-related topics in a typical online catalog. In my subject map for logic, you can see lots of suggestions of books filed both under “Logic” itself, and under related concepts.  You can go straight to a book that looks interesting, select a related subject and explore that further, or select the “i” icon next to a particular book to find more books like it.

As I’ve noted previously, the relationships and explanations that enable this sort of exploration depend on a lot of data, which has to come from somewhere.  In previous versions of my catalog, most of it came from a somewhat incomplete and not-fully-up-to-date set of authority records in our local catalog at Penn.  But the Library of Congress (LC) has recently made authoritative subject cataloging data freely available on a new website.  There, you can query it through standard interfaces, or simply download it all for analysis.

I recently downloaded their full data set (38 MB of zipped RDF), processed it, and used it to build new subject maps for The Online Books Page.   The resulting maps are substantially richer than what I had before.  My collection is fairly small by the standards of mass digitization– just shy of 40,000 items– but still, the new data, after processing, yielded over 20,000 new subject relationships, and over 600 new notes and explanations, for the subjects represented in the collection.

That’s particularly impressive when you consider that, in some ways, the RDF data is cruder than what I used before.  The RDF schemas that LC uses omit many of the details and structural cues that are in the MARC subject authority records at the Library of Congress (and at Penn).  And LC’s RDF file is also missing many subjects that I use in my catalog; in particular, at present it omits many records for geographic, personal, and organizational names.

Even so, I lost few relationships that were in my prior maps, and I gained many more.  There were two reasons for this:  First of all, LC’s file includes a lot of data records (many times more than my previous data source), and they’re more recent as well.  Second, a variety of automated inference rules– lexical, structural, geographic, and bibliographic– let me create additional links between concepts with little or no explicit authority data.  So even though LC’s RDF file includes no record for Ontario, for instance, its subject map in my collection still covers a lot of ground.

A few important things make these subject maps possible, and will help them get better in the future:

  • A large, shared, open knowledge base: The Library of Congress Subject Headings have been built up by dedicated librarians at many institutions over more than a century.  As a shared, evolving resource, the data set supports unified searching and browsing over numerous collections, including mine.  The work of keeping it up to date, and in sync with the terms that patrons use to search, can potentially be spread out among many participants.  As an open resource, the data set can be put to a variety of uses that both increase the value of our libraries and encourage the further development of the knowledge base.
  • Making the most of automation: LC’s website and standards make it easy for me to download and process their data automatically. Once I’ve loaded their data, and my own records, I then invoke a set of automated rules to infer additional subject relationships.  None of the rules is especially complex; but put together, they do a lot to enhance the subject maps. Since the underlying data is open, anyone else is also free to develop new rules or analyses (or adapt mine, once I release them).  If a community of analyzers develops, we can learn from each other as we go.  And perhaps some of the relationships we infer through automation can be incorporated directly into later revisions of LC’s own subject data.
  • Judicious use of special-purpose data: It is sometimes useful to add to or change data obtained from external sources.  For example, I maintain a small supplementary data file on major geographic areas.  A single data record saying that Ontario is a region within Canada, and is abbreviated “Ont.”, generates much of my subject map for Ontario.  Soon, I should also be able to re-incorporate local subject records, as well as arbitrary additional overlays, to fill in conceptual gaps in LC’s file.  Since local customizations can take  a lot of effort to maintain, however, it’s best to try to incorporate local data into shared knowledge bases when feasible.  That way, others can benefit from, and add on to, your own work.

Recently, there’s been a fair bit of debate about whether to treat cataloging data as an open public good, or to keep it more restricted.  The Library of Congress’ catalog data has been publicly accessible online for years, though until recently only you could only get a little a time via manual searches, or pay a large sum to get a one-time data dump.  By creating APIs, using standard semantic XML formats, and providing free, unrestricted data downloads for their subject authority data, LC has made their data much easier for others to use in a variety of ways. It’s improved my online book catalog significantly, and can also improve many other catalogs and discovery applications.  Those of us who use this data, in turn, have incentives to work to improve and sustain it.

Making the LC Subject Headings ontology open data makes it both more useful and more viable as libraries evolve.  I thank the folks at the Library of Congress for their openness with their data, and I hope to do my part in improving and contributing to their work as well.

April 7, 2010

Copyright information is busting out all over

Filed under: copyright,sharing — John Mark Ockerbloom @ 3:43 pm

Like the crocuses and daffodils now coming up all over our front garden, new copyright registration information has been popping up all over the net lately.  As I’ve described in various previous posts, this information can be extremely useful for folks who want to revive, disseminate, or reuse works from the past.

Here’s a summary of the some of the recent highlights:

Copyright renewals for maps and commercial prints are now all online, and join what is now a complete set of renewals of active copyrights for still images.  The scanning was done here at the Penn Libraries by me and by the Schoenberg Center for Electronic Text and Image, from microfilms and volumes loaned by the Free Library of Philadelphia.  I thank all the folks who helped out with this project.

With this addition of this latest set of records, you can now find copyright renewals online for nearly anything you’d find in a book, if they’re recent enough to still be in force.  The only active copyright renewals of any sort not yet online at this point to my knowledge are renewals for most music prior to 1978, and a few small sets of pre-1978 renewals for film (about 2 years’ worth in all).

Original copyright registrations are also going online at a rapid rate.   The biggest publicly accessible set of original registrations from 1923 onward (the date of the oldest copyrights still in force) is at Hathi Trust, and consists of digitized volumes that have been scanned by Google for Hathi member libraries.  I’ve include them in a list of registration volumes organized by year and type of work on my Catalog of Copyright Entries Page, which has now been reorganized to combine all the original and renewal registrations known to be available online.  I’ve also added direct page links to renewal and other important sections of the volumes, so that researchers looking for those can go to them directly.  In many cases, the renewal sections can be downloaded for offline use.  I’ve also brought out statistics from the volumes, to help give readers a sense of the rate of registrations and renewals.

Google is making enhanced versions of book copyright registration volumes available online. Specifically, they’ve digitized the full set of original and renewal registrations for books from 1922-1977, in a set of scans that are of generally higher quality than the ones at Hathi Trust.  You can search the full text of the entire set at once, or search or browse individual volumes.

These scans were done specially for copyright research purposes, and seem to involve more careful scanning than the normal mass-book-digitization procedures Google used for the Hathi Trust volumes.  They aren’t entirely free of problems– I identify a few trouble spots in my listings– and they also don’t include registrations for other types of work, which has apparently confused some folks who have contacted me.  But they’re quite high quality overall, and could be a very good basis for structured data records of these copyright registrations.  Google has previously made such records available for book copyright renewals; I hope we’ll see a release of records based on these new scans before long as well.

Also in the pipeline: Based on conversations I’ve had with others interested in copyright issues, we may well see a complete set of copyright registrations and renewals online (at least in the form of page images from the Catalog of Copyright Entries) by the end of this year.  And a number of projects are working on making this digitized information more useful for practical copyright clearance.  Today, for instance, I heard about the Durationer project, being presented at the Copyright@300 conference at Berkeley later this week.  The project is developing a tool to help people determine the copyright status of specific works in specific jurisdictions, based on copyright registrations and other relevant information.

Some possible future directions: As I described in more detail in a 2007 paper, a thorough determination of a work’s copyright status depends not just on registration information, but on various other kinds of information, much of which can be found in a work’s bibliographic records.  Copyright registration data can also be used to build new bibliographic data structures.  Therefore, the interests of copyright clearance and the interests of access to bibliographic data tend to converge.  I elaborate on this idea in a guest blog post for the Open Knowledge Foundation, who I’ve started to work with in these areas.  (For folks following the debate over OCLC’s WorldCat, this convergence is also worth keeping in mind when reading the just-released WorldCat Rights and Responsibilities draft, which I hope to comment on in the not-too-distant future.)

I hope you find this new copyright information useful.  And I’m very interested in hearing what you’re doing with it, or would like to do with it.

March 24, 2010

Erin McKean and inclusive entrepreneurship

Filed under: people — John Mark Ockerbloom @ 11:33 pm

Today is Ada Lovelace Day, a day celebrating the achievements of women in science and technology.  There are all kinds of ways to be a scientist or a technologist, and just in the fields of computing and information technology I can think of a number of first-class inventors, investigators, developers, teachers, and integrators who are women.

When it comes to entrepreneurs, though, the list of women who come to my mind gets much shorter.  And that’s unfortunate, because in computing and information technology, as in many other technical fields, entrepreneurs play a major role in bringing the fruits of technology to the world at large.   If you mention Steve and SteveBill and Paul, or Larry and Sergey, lots of people will know who you’re talking about, and also know the stories of the companies they founded and their world-changing products.  Women tech-company founders, though, have not been so noticeable.  Jessica Livingston, one of the founders of the tech-startup catalyst Y Combinator, reports that nearly all of their applicants have Y chromosomes; only about 7 percent have been women. And in mid-2008, the San Jose Mercury News reported that there were no female CEOs in the top 150 Silicon Valley companies.

Despite these discouraging statistics, you can find women pioneering new technological businesses.  One who I find particularly notable (and not just because the woman technologist closest to me works for her) is Erin McKean, co-founder and CEO of Wordnik, a company that’s reinventing the dictionary for the Internet age. If you watch this 15-minute video of a 2007 TED talk, or just read the About and FAQ pages at Wordnik and play around with the site some, you’ll pick up the general ideas.   There are a few aspects of her venture that I think are worth special note:

Rethinking the familiar. Dictionaries have been around in print for centuries, and even computerized dictionaries have existed for decades.  But as Erin makes clear in the TED video, once you have the Internet, and lots of data and computing power, you can discard many of the limitations of prior conceptions of the dictionaries, and do lots of interesting new things.  You can include all the words in a language, not just the ones that pass some sort of notability test.  You can show examples from a huge array of  formal, informal, and ephemeral sources.  You can do statistical analysis to track word usage over different times, places, and genres.  In short, you can make a reference that meets a known need in new and useful ways.

Risk-taking. It’s a truism that startups are risky ventures.   You spend years of your life working long hours, often with low pay and few benefits at the start, for a project that may make you rich, but statistically is more likely to come to nothing.   And some believe that online dictionaries, in particular, may be obsolete, and vulnerable to the same sorts of disruptive markets that blew the encylopedia business to bits.  But thus far, Erin’s risk-taking seem to be paying off; the site was named one of PC Magazine’s Top 100 products of 2009 (one of only a few websites to be included in their list), and it continues to attract funding even in the midst of a severe recession.

Inclusion. One of the key distinguishing features of Wordnik, alluded to above, is its inclusiveness.  By collecting all the words and examples that it can, from a wide variety of sources, it stands out from other dictionaries.  And along with the usual definitions, pronunciations, and example sentences for words, Wordnik also brings in many other unconventional sources — Flickr photo streams, Scrabble scoring references, and user tags and comments, to name a few– to enhance understanding and enjoyment of words.  It remains to be seen which of these sources will prove most useful or popular in the long run, but openness to new information and ideas is often a crucial part of inventing and improving new technologies.   It also helps ventures evaluate and adapt their technologies and their business models, so that they can thrive rather than perish under changing conditions.

Inviting collaboration. If inclusiveness is important to you, it’s not enough just to wait for things to come to you; you need to go out and invite people to work with you.  The Wordnik site does this to a certain extent by design, encouraging people to contribute notes, tags, lists of favorite words, and other information.  (Our 7-year-old daughter was an ardent contributor of Pokemon character names and pronunciations during the site’s alpha test.)  More recently, Wordnik has partnered with the Internet Archive and a variety of publishers and publishing sites to develop Smartwords, a forthcoming open standard for querying and embedding word information on demand as people read online, or communicate via social software.  Erin and her collaborators hope that this will extend the reach of Wordnik’s services into many more contexts than just “going to look up a word in the dictionary”.

While the details above are specific to Wordnik, the four basic qualities they embody – rethinking the familiar, risk-taking, inclusion, and inviting collaboration — have general applicability, and are especially worth consideration on Ada Lovelace Day.  The low numbers quoted earlier for women tech founders and CEOs suggests to me that women may not always be seriously considered (by themselves or others) for those roles.  Thinking in terms of the qualities I’ve mentioned makes it easier to envision women in those positions.  These are also qualities that can help both men and women take a more entrepreneurial approach to the technologies (and the libraries) they develop, so that they can have a lasting, positive effect on the world.

March 23, 2010

Lots of conversation keeps stuff sustainable

Filed under: libraries,people,preservation,sharing — John Mark Ockerbloom @ 10:12 pm

Among the hats I wear at my place of work is that of LOCKSS cache administrator. LOCKSS is a useful distributed preservation system built around the principle “Lots of copies keep stuff safe” (whose initials give the system its name).  The idea is that, with the cooperation of publishers, a bunch of libraries each harvest copies of selected online content, and keep backups on our own LOCKSS caches, which are hooked up to local library proxy services.  Then, if the material ever becomes inaccessible from the publisher, our users will automatically be routed to our local copies.  Each LOCKSS cache also periodically checks with other LOCKSS caches to ensure that our copies are still in good shape, and to repair or replace copies that have been lost or damaged.  (Various security features protect against leaks of restricted content, or unauthorized revisions of content.)

LOCKSS is open source software that runs on commodity hardware.  It was originally envisioned to run virtually automatically.  As Chris Dobson described the ideal in a 2003 Searcher article, “Take a computer a generation past its prime…. Hook it up to the Internet and put it in a closet. Stick in the LOCKSS CD-ROM and boot it up. Close the closet door.”  And then presumably walk away and forget about it.

Of course, it’s not that simple in practice, particularly if your library is proactive about its preservation strategy.  The thing about preservation at scale is there’s always something that needs attention.  It might be something technical, or content-related, or planning-related, but preserving a growing collection requires ongoing thought.  And if you want to think as clearly and sensibly as you can, you’ll want to collaborate.

Right now, for instance, I’m trying to get my cache to harvest the full run of a journal that’s just been made available for LOCKSS harvesting, where we hope to provide post-cancellation access through LOCKSS.  Someone at Stanford just gave me a useful tip on how to give this journal priority over the other volumes I’ve got queued up for harvest.  Unfortunately, I can’t try it out until I get my cache back up after it failed to reboot cleanly after a power failure. While I wait to hear back instructions about how best to remedy this, I wonder whether switching to a new Linux-based version of LOCKSS might make such operating system-level problems easier to deal with.  But it would be useful to hear from folks who are running that version to see what their experience has been.

Meanwhile, we’re wondering how best to approach new publishers who have content that our bibliographers would like to preserve via LOCKSS. Our special collections folks wonder whether we should preserve some of our own home-grown content via a private LOCKSS network.  I’m also doing some ongoing monitoring and testing of our LOCKSS cache’s behavior (some of which I’ve reported on earlier), and would be interested in knowing if others are seeing some of the same kinds of things that I see on the cache I administer.

In short, there are a lot of things to think about, when LOCKSS plays a significant role in a preservation plan.  And a lot of the issues I’ve mentioned above are ones that others may be thinking about as well.  So let’s talk about them.  As the LOCKSS group has said, “”A vibrant, active, and engaged user community is key to the success of Open-Source efforts like LOCKSS.”

One thing you need for such an engaged community is a forum for them to talk to each other.  As it turns out, the LOCKSS group at Stanford tell me they created a LOCKSS Forum mailing list a while back, but I haven’t yet seen it publicized.   Its information page is at https://mailman.stanford.edu/mailman/listinfo/lockss-forum .  (Currently, archived email messages are not visible on the open web, though this may change in the future.)  If you’re interested in talking with others about how you use or might use LOCKSS to preserve access to digital content, I invite you to sign up and help get the conversation going.

March 9, 2010

Implementing interoperability between library discovery tools and the ILS

Filed under: architecture,discovery — John Mark Ockerbloom @ 4:57 pm

Last June I gave a presentation in a NISO webinar about the work a number of colleagues and I did for the Digital Library Federation to recommend standard interfaces for Integrated Library Systems (the systems that keep track of our library’s acquisitions, catalog, and circulation) to support a wide variety of tools and applications for discovery.   Our “ILS-DI” recommendation was published in 2008, and encompassed a number of functions that some ILS’s supported.  But it also included many functions that were not generally, or uniformly, supported by ILS’s of the time.  That’s still the case today.

As I said in my presentation last June, “If we look at the ILS-DI process as a development spiral, we’ve moved from a specification stage  to an implementation stage.”  My hope has been that vendors and other library software implementers would implement the basics of what we recommended– as many agreed to– and the library community could progress from there.  This often takes longer to achieve than one might hope.

But I’m happy to report that the Code4lib community is now picking up the ball.  At this month’s Code4lib conference, a group met to discuss “collaboratively develop[ing] a middleware infrastructure” to link together ILS’s and discovery tools, based on the work done by the DLF’s ILS-DI group and by the developers of systems like Jangle and XC.  The middleware would help power discovery applications like Blacklight, VuFind, Summon, WorldCat Local, and whatever else the digital library community might invent.

I wasn’t at the Code4lib conference, but the group that met there to kick off the effort has impressive collective expertise and accomplishments.   It includes several members of the DLF’s ILS-DI group, as well as the lead implementors of several relevant systems.  Roy Tennant from OCLC Research is coordinating the initial activity, and Emily Lynema of the ILS-DI group has converted the Google groups space used by the ILS-DI group for the new effort.

And you’re welcome to join too, if you’d like to help out or learn more. “This is an open, collaborative effort” is how Roy put it in the announcement of the new initiative.  Due to some prior commitments, I’ll personally be watching more than actively participating, at least to begin with, but I’ll be watching with great interest.  To find out more, and to get involved, see the Google Group.

February 8, 2010

Shedding light on images in the public domain

Filed under: copyright,sharing — John Mark Ockerbloom @ 3:05 pm

For years, I’ve regularly gotten requests from authors and publishers for licenses to reproduce images in books listed on The Online Books Page, or included in the local collection of A Celebration of Women Writers.  Sometimes these requests relate to copyrighted books that I list but don’t control rights for; in those cases, I do my best to refer the request to the book’s copyright holder.  But often, they’re for images in our own collections, from books published over 100 years ago.  In those cases, I respond that the image is in the public domain (and our digitization, which adds no originality, is also in the public domain), so no license is necessary or appropriate.

Usually that response receives a thankful reply, sometimes with signs of surprise that an image can be reused without permission.  But sometimes I’ll get back a more alarmed reply.  “My publisher says I need a license for every image in my book, or I can’t use it,” it might say, followed by a plea for help in tracking down some long-defunct 19th century publisher.

I wish I could say this was an atypical anecdote.   But, if you look around the Web, you’ll find that there are huge numbers of historic images– paintings, photographs, figures, and the like– that are behind access barriers, or closed off altogether from online access, when they don’t have to be.  Artstor has over a million images of thousands of years of art that you can’t look at unless you’re at an institution that has a subscription.  The fine arts image catalog at my own library has over 100,000 digital images, none of which can be seen online by the public outside of Penn, except in thumbnails.  Neither Artstor nor Penn want to keep art away from the public; both are nonprofit educational institutions. But clearing images for free public access on a large scale has to date been impractical for these institutions.

Restrictions on images also create holes in other works.  For instance, under the proposed Google Books settlement, images in books that might be under copyright would be blanked out unless the rightsholder to the book also asserted they held the rights in the images.  These sorts of omissions can cut the heart out of many works.  In a recent New Republic article, “For the Love of Culture“, Lawrence Lessig described how a critical table was omitted in an otherwise free article about his daughter’s possible illness, due to rights-clearance issues.  “I could not believe that we were this far down the path to insanity already,” he wrote of the incident.

Part of the insanity is that many of these images from our cultural heritage are actually in the public domain.  Many people are aware that copyrights prior to 1923 have expired in the US.  But so have many copyrights from later in the 20th century.  Pre-1964 copyrights generally had to be renewed 28 years after the start of their term, or they would expire.   (Exceptions and further details are described here.)  But most copyrights were never renewed; and that’s especially true for images.

In 1923, there were copyright registrations for 3,059 works of art, 1,149 scientific and technical drawings, 7,533 photographs, and 11,289 prints and pictorial illustrations, making a total of  23,030 copyright registrations for these classes of image.  In 1951, 28 years later, there were 198 copyright renewals for all of these image classes combined.  This represents a renewal rate of less than 1%.

We have just completed posting scans that make all active copyright renewals for artwork viewable online.  In fact, once we finish scanning one last batch of renewals for maps and for commercial prints (meaning images created for product packaging and promotion) all active copyright renewals for any type of still image will be viewable online.   In later years, the number of image copyright renewals grows slightly, but not by much. But the number of images published in those years grows substantially.

Images without a copyright registration of their own might still be under copyright if they were first published as part of a copyrighted book, newspaper, magazine, or other larger work.  Fortunately, we have complete online renewal records for those kinds of works too.  It becomes much easier to establish the public domain status of a newspaper photograph, for instance, if you know (as I previously revealed) that no newspaper outside New York renewed copyright for any issue published before the end of World War II.

Having copyright renewals online for artwork is an important step towards freeing the public domain in images.  But there’s more needed to make copyright clearance practical at a large scale.  Putting scanned renewal records into a searchable database (perhaps combined with fair use image thumbnails) will make it easier to find any copyright renewals that might exist for a particular image.  (A similar database for book renewals already exists, and there are more book renewals than image renewals.)  Making original copyright registrations available as well (as we now have for artwork through 1949, and soon will have for later years) lets us determine when the copyright for an image began, and whether it was renewed in time to prevent it from expiring.

Furthermore, establishing the history and provenance of images will let us determine when unregistered artwork enter the public domain.  Registered or not, the copyright to an image created before 1964 began no later than its first US publication, and the copyright for many such images therefore ended after 28 years due to a lack of renewal.  And the mostly-frozen American public domain still includes more work each year that was never published before 2003.  On Public Domain Day last month, all such work by artists who died in 1939 entered the public domain in the US.  (I won’t get now into the rather baroque rules for establishing “publication” of an artwork, but you can determine it if  the history of the image is documented.)

So we have a rich treasure trove of images in the public domain that’s been largely buried under presumptions and uncertainties about copyright.  By finding and sharing information about their copyrights, we can protect and enjoy these images in the commons of the public domain, where they can be viewed freely, included in new works, and reused in any way we can imagine.  If you find this prospect intriguing, I hope you’ll help bring these images to light.

January 28, 2010

Every book its libraries: or, Taking care in withdrawal

Filed under: preservation,sharing — John Mark Ockerbloom @ 1:42 pm

The question of when to withdraw materials from libraries has gotten heightened attention lately.  Everyday readers may not always realize it, but most libraries get rid of books and other materials on a regular basis.  Libraries typically have limited space, but keep acquiring new materials to serve their audience’s needs.   As they acquire new materials, they typically make room by getting rid of materials that no longer serve their audience as well; this is variously known as “withdrawing”, “deaccessioning”, or “weeding”.

Some libraries weed more aggressively than others.  School and public libraries tend to turn over their collections more quickly than academic research libraries.  There’s not much value a middle-schooler can get out of an outdated science book, for instance, compared to a current one.  And a public library user looking for a book on how to use their new Windows 7 computer shouldn’t have to wade through stacks clogged with TRS-80 programming guides and the like.  You can find amusing anecdotes about books that have outlived their usefulness in these kinds of collections in the blog Awful Library Books, one of the blogs on LISNews’ 10 Librarian Blogs to Read in 2010.

Academic libraries typically don’t weed as aggressively.  The larger research libraries aim to have a broad selection of thought on subjects from various points in history, as well as whatever happens to be of current interest.   A book on science that no longer reflects current scientific understanding may still be useful for researchers that want to look at the history of science, or at how science interacted with culture at the time.  Even the peripheral details can be of interest; for instance the photographs in an obsolete computer guide can tell us what what the computers looked like, and how they were expected to be used.  The most interesting aspect of many old periodicals nowadays is often the advertisements, rather than the editorial content.

Especially when they’re digitized, large corpuses can also be of major interest even when the individual items might not be particularly noteworthy.  They can help you track the use and evolution of language, for instance, or quash unwarranted patents.  I’ve talked before about the great potential of Google Books and similarly comprehensive corpuses.

Even so, research libraries still get rid of materials, or move them to offsite warehouses, when space is short.  As more users access materials online instead of print, we often ship out print volumes that have online surrogates.  Recently Ithaka published a report called What To Withdraw that recommends gives guidelines for withdrawing materials that are online in sustainable archives (such as Ithaka’s own JSTOR), and that have a few physical copies in print archives somewhere.  Doing this responsibly may help many research libraries grow their collections, or repurpose their spaces, in useful ways.  Selling particularly valuable items to more appropriate libraries can also help fund additional library acquisition and activity.

Carefully considered, then, withdrawal can greatly benefit libraries and their users.  But libraries need to think not only about their own collection’s purposes, but about the systemic risks of individual library collection decisions.  For instance, many of the “Awful Library Books” justifiably withdrawn from public libraries might still be of historical research interest to someone.  Even if academic research libraries would keep them, many of the books intended for popular or specialized non-academic audiences were not collected by academic libraries in the first place.  If all the public libraries with these books simply throw them out, and no copy gets transferred to a library or archive with a longer-term interest, the materials may disappear forever.

Online access, as an alternative to retaining print copies, may not be as reliable as one expects.  Recently, the archives of many popular magazines that were available through various subscription databases became part of an exclusive deal from one database vendor.  This is likely to raise the costs of access to many libraries, both because they may have to subscribe to a new database to keep providing these magazines, and because the price of the new exclusive bundle is likely to increase.  But even if vendors keep prices reasonable, libraries’ own situations may change.  Here in Pennsylvania, funding to libraries has been cut severely enough that many now have to cancel subscriptions to heavily-used databases. The linked story has a heartbreaking quote from one of the public librarians that’s had to drop their formerly free Power Library subscription: “I got rid of [our old magazines] because everything was in the database.”

How can we insure against these sorts of cultural loss, even as we withdraw items?  A key principle is replication.  In the words of one well-known digital preservation program, “Lots of Copies Keep Stuff Safe”.  When we consider withdrawing something, we stop to think if some other library or institution might find it of value. If we’re considering dropping print originals for digital surrogates, we check to see if other institutions we trust are keeping the originals safe, or would be willing to do so.  We also make digital copies of print materials that may be at risk, and we try to spread around these copies as widely as practicality and copyright law allows.  And we develop and support efficient inter-library transfer networks so that we can quickly move locally deaccessioned materials to where they’re needed or valued.

Many librarians have a philosophy of public service that draws on Ranganathan’s famous set of Five Laws of Library Science, which includes principles like “every reader his book” and “every book its reader”.   As we try to preserve our broad cultural heritage in the midst of withdrawal, loss, and replication, a related principle, “Every book its libraries”, is a useful one to keep in mind.

[Edited slightly 4:12pm Jan 28, in response to a comment below: deleted struck-through text, and added italicized text]

January 15, 2010

January 7, 2010

January 1, 2010

Public domain day 2010: Drawing up the lines

Filed under: copyright,online books,open access — John Mark Ockerbloom @ 12:01 am

As we celebrate the beginning of the New Year, we also mark Public Domain Day (a holiday I’ve been regularly celebrating on this blog.)  This is the day when a year’s worth of copyrights expire in many countries around the world, and the works they cover become free for anyone to use and adapt for any purpose.

In many counties, this is a bittersweet time for fans of the public domain.  For instance, this site notes the many authors whose works enter the public domain today in Europe, now that they’ve been dead for at least 70 years.  But for many European countries, this just represents reclaimed ground that had been previously lost.   Europe retroactively extended and revived copyrights from life+50 to life+70 years in 1993, so it’s still three more years before Europe’s public domain is back to what it was then.  Many other countries, including the United States, Australia, Russia, and Mexico, are in the midst of public domain freezes.  For instance, due to a 1998 copyright extension, no copyrights of published works will expire here in the US due to age for another 9 years, at least.

In the past, many people have had only a vague idea of what’s in the public domain and what isn’t.  But thanks to mass book digitization projects, the dividing line is becoming clearer.  Millions of books published before 1923 (the year of the oldest US copyrights) are now digitized, and can be found with a simple Google search and read in full online.  At the same time, millions more digitized books from 1923 and later can also be found with searches, but are not freely readable online.

Many of those works not freely readable online have languished in obscurity for a long time.   Some of them can be shown to be in the public domain after research, and groups like Hathi Trust are starting to clear and rescue many such works.  Some of them are still under copyright, but long out of print, and may have unknown or unreachable rightsholders.  The current debate over Google Books has raised the profile of these  works, so much so that the New York Times cited “orphan books”, a term used to describe such unclearable works, as one of the buzzwords of 2009.

The dividing line between the public domain and the world of copyright could well have been different.   In 1953, for instance, US copyrights ran for a maximum of 56 years, and the last of that year’s copyrights would have expired today, were it not for extensions.  Duke’s Center for the Study of the Public Domain has a page showing what could have been entering the public domain today– everything up to the close of the Korean War.  In contrast, if the current 95-year US terms had been in effect all of last century, the copyrights of 1914 would have only expired today.  Only now would we be able to start freely digitizing the first set of books from the start of World War I.

With the dividing line better known nowadays, do we have hope of protecting the public domain against more expansions of copyright?  Many countries still stick to the life+50 years term of the Berne Convention, including Canada and New Zealand.  In those countries, works from authors who died in 1959 enter the public domain for the first time.  There’s pressure on some of these countries to increase their terms, so far resisted.  Efforts to extend copyrights on sound recordings continues in Europe, and recently succeeded in Argentina.  And secret ACTA treaty negotiations are also aimed at increasing the power of copyright holders over Internet and computer users.

But resistance to these expansions of copyright is on the rise, and public awareness of copyright extensions and their deleterious effects is quite a bit higher now than when Europe and the US extended their copyrights in the 1990s.  And with concerns expressed by a number of parties over a possible Google monopoly on orphan books, one can envision building up a critical mass of interest in freeing more of these books for all to use.

So today I celebrate the incremental expansion of the public domain, and hope to help increase it further. To that end, I have a few gifts of my own.  As in previous years, I’m freeing all the copyrights I control for publications (including public online postings) that are more than 14 years old today, so any such works published in 1995 and before are now dedicated to the public domain.  Unfortunately, I don’t control the copyright of the 1995 paper that is my most widely cited work, but at least there’s an early version openly accessible online.

I can also announce the completion of a full set of digitized active copyright renewal records for drama and works prepared for oral delivery, available from this page.  This should make it easier for people to verify the public domain status of plays, sermons, lectures, radio programs, and similar works from the mid-20th century that to date have not been clearable using online resources.  We’ve also put online many copyright renewal records for images, and hope to have a complete set of active records not too far into 2010.  Among other things, this will help enable the full digitization of book illustrations, newspaper photographs, and other important parts of the historical record that might be otherwise omitted or skipped by some mass digitization projects.

Happy Public Domain Day!  May we have much to enjoy this day, and on many more Public Domain Days to come.

(Edited later in the day January 1 to fix an inaccurately worded sentence.)

« Previous PageNext Page »

Theme: Rubric. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 49 other followers