I’ve seen a fair bit of buzz of the Library of Congress’s recent announcement of a new initiative for transforming the bibliographic framework of the library community. The announcement notes a number of recent developments that are drivers for such a transformation, including RDA, FRBR, the semantic web of linked data, and a growing consensus that the library community needs to move beyond MARC as its central data standard.
There’s another very important driver not explicitly mentioned in the announcement: the rise of open library data. More and more bibliographic and other library-related data is now freely available and reusable online, and it enables all kinds of improvements in resource discovery. I’ve previously discussed how open data for Library of Congress Subject Headings and Hathi Trust online books allows me to manage a catalog of over a million books, create rich subject browsing interfaces, and improve the quality of library catalogs at Penn. OCLC has also compiled data on millions of authors in VIAF, and now makes it available as open data. I recently downloaded their data set, and hope to get at least as much benefit out of is as I have with subject data to date. (I hope to report on early experiments before long.)
And there’s more big news from OCLC. At last month’s Global Council meeting, Karen Calhoun gave a presentation saying they were considering releasing letting members release WorldCat data under an open license. I’m very excited to see this development; as I said a couple of years ago (in a panel discussion that also featured Karen Calhoun) it would make a huge trove of the library community’s bibliographic intelligence available to exploit and extend in all kinds of ways that could make it easier for library patrons to discover information resources. This is no small step for OCLC– at the time of our panel, OCLC was concerned that opening access to WorldCat data might threaten the sustainability of the WorldCat cooperative. If they’re seriously considering opening their data now, I suspect that a good number of WorldCat members have indicated that it’s an important and worthwhile thing to do, and that the benefits outweigh the risks.
If you also see potential in open library data, now is an excellent time to join in the discussions that the Library of Congress and OCLC are inviting. The more these and other leading organizations in the library community see how open data can advance the goals of the community, and how open data initiatives can get the support needed to be sustainable, the richer the knowledge base that our evolving bibliographic framework will support.
I don’t yet know where and how the LC and OCLC conversations will be taking place. But I’d love to hear readers’ thoughts and pointers about using and supporting open bibliographic data in the comments here.
Updated May 27 to clarify: I’ve heard from Roy Tennant that the OCLC proposal under consideration refers to WorldCat members releasing “their catalog data containing WorldCat records”, not OCLC itself releasing data (as my post originally stated). This is still potentially a big deal, even if it doesn’t mean opening up all of WorldCat. I’ve updated the post, marking edits with strikethrough and bold, and will talk about this more in the comment thread.
Opening the data is a no-brainer for a cooperative – I explain why at http://www.software.coop/info/coopdev.html – and it would be brilliant to see OCLC do this and help build a library cooperative commonwealth. Please, pingback this post when you know where and how the LC and OCLC conversations will be taking place.
Interesting, this is the first I had heard of OCLC considering that.
What would be the best way to ‘join the conversation’, as you say? Oh, you say you don’t know either, heh.
Really, any open license is good news. A “BY” license, while it sounds nice, can still be an odd barrier when it comes to data — especially because re-use of data often involves re-using data from _multiple_ (potentially dozens or in the future even hundreds) of databases. AND because it doesn’t neccesarily involve leaving the records intact, it can involve mixing and matching individual pieces of data from multiple sources, in a continual and iterative process of re-mixing. After such a process, you don’t really know what piece of data came from where — without spending lots of resources on trying to track, which if you are doing only to meet the needs of the license, means the license has added significant cost to your project. So if you don’t know what piece of data came from where, and there are dozens or hundreds of sources — where do you need to put the attribution exactly? Do you need all dozens or hundreds attributed on every page (and every API response?) from your app?
Of course, it can all come down to the ‘terms as specified by the licensor’ for attribution. If it’s just on your About/splash page, “Some data from X”, that’s do-able.
Well, one good start is talking about it online in public, where LC, OCLC. and others can see what we say, and they and others can chime in.
As far as more formal forums go, ALA Annual next month seems a promising place to have discussions. LC’s announcement says they’ll be having discussions there on their bibliographic transformation proposal, and OCLC always has multiple sessions at ALA. I don’t know offhand if any of the OCLC sessions are specifically about opening WorldCat data, but the issue is probably relevant to at least some of the OCLC ALA discussions. So if nothing else, it’d be helpful for people there to let them know that they value open bibliographic data, and will support them if they go that route (I’m not at this point planning on going to Annual myself, but others reading this might be, and my own plans could still conceivably change.)
Extending the discussion beyond the usual professional cataloging and ALA crowd would also be useful. Off the top of my head, I could see people involved in VIVO, ORCID, CrossRef/DOI, Wikipedia, LibraryTbing, and publishing in general all potentially having something to contribute to open bibliographic data, and I’m sure there are plenty of others that could participate as well.
If libraries are only allowed to use copies of OCLC bib records that they created, sans OCLC number, will it be a barrier for smaller/less technical institutions wishing to enter the linked data world? I’ve been pondering this lately. It seems to me like it would be more efficient for OCLC to make WorldCat available as linked data rather than giving people permission to use their contributions to WorldCat as linked data. There’s a world of difference. Why should the multitudes of small less technically blessed libraries do the work to expose their stuff when OCLC could do it on their behalf? We achieved economies of scale with cooperative cataloging. Why not with linked data?
It makes no sense for libraries without a lot of unique holdings and/or all their bib records in WorldCat to host their own linked data. How many linked data bib records of Feynman’s lectures does the world need? Holdings records, on the other hand, would have much more utility.
I get it that it’s a big scary thing for OCLC to have WorldCat as-a-whole be linked data. I get that they can’t leap into doing it as Open Data, although that would be ideal. I think Jay Jordan’s notion of “doing a deal” with anybody wanting to use the full database is a better direction to go. The licensing terms aren’t quite there yet for Open Data, due to the attribution pearling problem Jonathan mentions. A two-party agreement, however, would overcome that issue. Of course, it would probably also limit the mining of the 2nd party’s exposed data. It would be a good start though.
Laura: It’s not yet clear to me whether the proposal only applies to WorldCat records originally created by a library, or WorldCat records for items held by a library. (In any case, the proposal is not for releasing the entire WorldCat database under an open license; I’ve updated my post to make that clear.)
If members want to make their catalogs generally available as open linked data, which sounds like the motivation from Karen’s slides, then it would seem to me to require the latter case: members being allowed to release open data for anything in their holdings. This in itself could be very useful.
Organizations like Hathi Trust, for instance, could release full catalog records for their digital volumes (the records I’m getting from them now are stripped down somewhat at the moment to allow open reuse). Or, a few major research libraries could start an open data resource on nearly all significant scholarly journals. There are a number of other interesting and useful things one could do with a subset of WorldCat representing the holdings of even a few key members.
If you or others know more details about the proposal, or can provide pointers to more information, I’d be grateful to hear about it.
Now that I think about it, it seems like OCLC’s efforts to prevent WorldCat as-a-whole from being released as linked data will eventually be futile. If they let people expose their bib records (either original contributions only or the bib of any item they happen to hold) then at some future point there would be critical mass and the bulk of stuff-in-use would be out in the wild.
I’m going to read into it more. I confess to skimming the record use agreement when it was released and I have to take a closer look. I’m recalling from somewhere that OCLC really doesn’t want the OCLC numbers included in any records a member wants to release. That would make any follow-your-nose RDF linking from member-exposed records back to WorldCat more difficult. And that doesn’t seem to be in OCLC’s best interest.
I *love* the idea of research libraries releasing bib information on academic serials. I’d contribute our serials records into that (nobody should have to trace the publication of Comptes Rendu if somebody else has done it!). Of course there’s nothing to stop the journal publishers from exposing their title information themselves (and I believe they will be doing that, but at the more granular article level). If the publishers do it, where does a library fit in?
It seems to me that OCLC could release a modified version of WorldCat records as linked data that wouldn’t result in releasing what is essentially “cataloging data.” Most users of bibliographic data are not interested in much of the esoterica of the library catalog entry: they want authors, main title, dates, publisher. The stuff of citations. This is also purely factual information, generally taken from the piece. Releasing this wouldn’t rival OCLC’s cataloging revenue because it wouldn’t be enough for most libraries to accept in the place of a cataloging record, but it would be enough for folks wanting to create citations. Also, the big deal about WorldCat is not the bibliographic data but the holdings information. By releasing the base citation with an OCLC number as its base URI, OCLC creates better linking back to WorldCat and libraries.
The defect that I see in OCLC’s thinking is that they seem to assume that everyone who is interested in WorldCat data wants it for cataloging, or at least that is the fear. The greatest interest, however, in my opinion is from outside of the library world, or it would be if WorldCat were presented differently.