Close readers

There’s been a lot of public fretting lately over the state of reading. People don’t read as much as they once did, we’re told. When it’s pointed out that in fact lots of people are reading online, we’re sometimes told it’s the wrong kind of reading– “inanities” of “blogging and blugging”, to quote a recent Nobel laureate. I get the impression from some essays that on the one hand there’s offline reading, deep, solitary, and contemplative, and on the other hand online reading, shallow, social, and mercurial. The two seem to have little in common, by this sort of account.

Of course, it’s not that simple. And I’ve recently encountered a few initiatives that cut right across that dichotomy, gathering people together to closely read and discuss texts with each other.

At a recent lunchtime get-together, JT Waldman told me about the Jewish Publication Society‘s new Yavnet web site, now in alpha. It’s a project to create “a living and breathing commentary” on the Torah, by encouraging readers to look at particular passages, and join online, moderated discussions on them. Drawing on the JPS’s Tanakh translations, its published, scholarly commentaries, and online discussions, readers will be able to read and participate in conversations that help bring out the meaning of Biblical passages. The basic idea isn’t new– the Talmud, after all, is a centuries-old multi-layered commentary on the scriptures– but the Yavnet folks hope to use the Internet to grow and propagate fresh understandings and appreciations of the Torah in online communities.

Another recently announced site is Book Glutton (this one says it’s in beta), which aims to bring groups of people together to discuss a book as they read it. Their reader software is also designed for close reading; instead of just reviewing or discussing a book in general, readers can attach comments to specific passages of a book, and have live chats with people reading the same sections. They appear to be built largely on public domain texts, which lend themselves well to new interfaces and purposes.

Mind you, with openly accessible texts, you don’t have to limit your discussion to a single site. Jon Udell recently wrote about how discussions of scientific articles are often widely distributed over the blog network. An active discussion ensued, and just a couple of days later, he made a followup post showing some of the tools now available to track such discussions online. That was quick!

Close reading, whether of books or other text, may sometimes be solitary, but it doesn’t have to be. And the Net can bring together close discussions of text from far-flung participants, discussions that were not practical to convene offline. As Ursula Le Guin put it in the new (February) issue of Harper’s, “Books are social vectors, but publishers have been slow to see it.” (You’ll have to subscribe online or find a copy at your library or news-stand to read the full article, but it’s already being discussed online in various places. I first saw her quote in this Mediabistro post, which also points to some related discussions elsewhere.)

I look forward to seeing the new directions that the social vectors of books and texts take online, as the sites above and others like them develop further.

More on subject maps

The slides for my ALA Midwinter presentation on subject maps (which I described in a previous post) are now online. (Yes, I’m playing around with BePress’s Selected Works.) You can also find links to the presentation, a white paper, and demos, from our library lab’s subject maps page.

I’m afraid it’s only the slides (no notes) but I’ll be happy to answer questions or take suggestions. I’m hoping to spend a fair bit of time this semester on subject maps applications for our catalog and digital library architecture, and I’m very interested in seeing how much we can do with them.

During the presentation, someone asked “What do you do when you have more than one subject ontology/thesaurus?” I addressed this question in a talk I gave at last spring’s DLF forum (slides here, though some pictures show up dark). In short, I see four basic strategies one can take. (Warning: What follows has more technical abbrevations and shorthand than my usual posts here, but I’ve provided some explanatory links, and can go into more detail and explanation in later posts if there’s interest.) The four strategies are:

  • Throw all the terms in together, and hope for the best. Sounds like a recipe for chaos, but it is cheap, and may work when users look for both kinds of terms. Ultimately, you’d like to relate the terms somehow (and automated tools may be able to help you find popular terms that are isolated from other terms, so you can relate them to other terms).
  • Normalize to a preferred ontology. This may be a good way to go, for instance, when you have a big, controlled ontology you want to use, and a small, less controlled one that doesn’t get much use. The terms in the less controlled ontology can be added as aliases for terms in the controlled ontology, or can be rewritten to fit in with the main ontology.
  • Make multiple subject maps, link them in appropriate places. This may be appropriate when your maps tend to support different research communities (e.g. MeSH for medical research vs. LCSH for general scholarship), or different kinds of search activities (e.g. folksonomies for current peer awareness vs. LCSH for back-literature searches). In some cases, such as for MeSH and LCSH, existing crosswalks exist that can be used for links between maps.
  • Build a multi-ontology subject map. This can be challenging, but may be appropriate where you have different collections described through different ontologies that people are likely to want to explore in tandem; for example, a set of books on a particular era of history described in LCSH along with a set of photographs of the same era described in TGM. It can get a bit tricky when the two ontologies have different names for the same concept, though, or identical names that refer to two different concepts.

I hope to try out some of these strategies as we develop subject maps at Penn. (Our Franklin catalog has many subject entries in LCSH and MeSH, for example; and we also have uncontrolled tags that PennTags users assign to items in our catalog. I’d like to see if we can let users browse in useful ways across all three topical spaces.) I’ll post here about any interesting new visible developments and demos.

New Newbery and Caldecott winners announced; Old Newbery winners go online

One of the highlights of the American Library Association‘s Midwinter meeting (which just concluded here in Philadelphia) is the announcement of the winners of the Newbery Medal, the Caldecott Medal, and ALA’s other book prizes. The Newbery is one of the oldest and best-known awards for children’s literature, and winning it can guarantee substantial sales, and keep a book in print for decades.

They’ve made some interesting picks this year. This year’s Newbery Medalist is Good Masters! Sweet Ladies! Voices From a Medieval Village. It’s not just one story; it’s 22: a connected set of vignettes told by characters who inhabit an English village in the year 1255. Author Laura Amy Schlitz (who’s also a librarian) wrote them to be performed by 5th graders at her school. Newbery committee chair Nina Lindsay calls the end result “a pageant that transports readers to a different time and place” through “varied poetic forms and styles offer[ing] humor, pathos and true insight into the human condition.”

The Newbery Honor books this year are:

The Caldecott medal usually goes to short picture books for young readers, but this year’s winner is different. It’s a 500+ page graphic novel by Brian Selznick called The Invention of Hugo Cabret, and tells the story of an orphan in early-20th-century Paris living inside the walls of a train station, trying to finish an invention left by his father. The ALA Caldecott site says “the suspenseful text and wordless double-page spreads narrate the tale… which is filled with cinematic intrigue.” Sounds intriguingly steampunk to me.

The Caldecott Honor books this year are:

I’m also happy to report that some of the earliest Newbery awardees have been put online. The early medal-winners have been up for a while, but the honor books can be just as interesting, though often much harder to find. Mary just posted Cornelia Meigs’ 1922 Newbery Honor Book The Windy Hill today to her Celebration of Women Writers, and the Open Content Alliance has recently scanned most of the other honor books from that year. (I’m still seeking a copy of Cedric the Forester, but all the others are now online.)

While Newbery medalists often stay in print indefinitely, lots of other good children’s books are published every year that quickly fade into obscurity. Even most of the early Newbery Honor Books are now out of print and hard to find. But perhaps not for long. Might some of the rightsholders to the older books, particularly the out of print titles, be willing to let them go online so kids can read them again? Or are some of them already fair game? In some initial investigations, Mary’s found more than a dozen post-1923 Newbery Honor Books, and two post-1923 Newbery medalists, whose copyrights appear not to have been renewed at all, and would therefore now be in the public domain.

You can read the online Newbery winners (and early winners of the Nobel and Pulitzer prizes as well) from the Prize-Winning Books Online exhibit of The Online Books Page. We hope to add more titles to this exhibit in the near future. If you’re interested in clearing copyrights or digitizing any of these books, I’d be very interested in hearing from you.

I hope you’ll enjoy reading these newly honored and newly digitized books!

Don’t shade your eyes

Back in 2006, Paul Collins wrote an article in Slate asking “Will Google Book Search uncover long-buried literary crimes?” Now that we have large corpuses of texts searchable online, he argued, it will become much easier to find words lifted from other writers than it once was. (Collins reported on a few such cases found in early GBS searches.) The Net may make it easier for people to misappropriate other authors’ words, but it also makes such misappropriation easier to detect.

Indeed, it’s starting to finger some well-known contemporary authors. Last week, some bloggers used GBS to finger a prolific, and still publishing, romance author who was repeatedly plagiarizing the works of others. (The linked story is the first report of the news last Monday; its sidebar currently point to numerous followups, including more examples uncovered by the same blog and its readers.) Cassie Edwards may be one of the first well-known current authors to be caught out via online-library Googling, but she’s not likely to be the last. Here are some things mentioned in the ensuing discussions that may be worth remembering the next time this sort of thing comes to light:

Plagiarism is not the same as copyright infringement and is therefore not justifiable with a “fair use” excuse. Plagiarism and infringment both involve improper copying, but otherwise they have important differences. Copyright infringement is copying without proper authorization (either from the copyright holder, or from copyright law), and is a legal offense. Plagiarism is copying without proper attribution, and is an ethical offense. While they sometimes go together, it’s perfectly possible to plagiarize without violating copyright (such as by plagiarizing public domain sources), and violate copyright without plagiarizing (such as by putting the Harry Potter books on your web site, with J. K. Rowling’s name left on them.)

Standards of proper attribution may vary by genre, but they exist for all genres. Formal scholarly writing standards are especially strict about attribution, with detailed citations generally required for words or even ideas taken from someone else. Popular fiction, music, preaching, and the like, may not usually include footnotes, and reuse may be more common in those genres (especially for things like standard chord progressions), but generally speaking, if you “quote” extensively from someone else, you’re expected to credit them. This can be in your acknowledgements, your liner notes, or wherever, but if it’s more than just a brief or obvious allusion (like the title of this post, taken from a Tom Lehrer song about plagiarism) you need to credit it.

Plagiarism may be forgivable, but not excused or justified, at least for anyone who expects professional respect. Being caught early on may ironically be a blessing; had Edwards been caught out by her editor or a reader on one of the first of her 100+ books, perhaps she could have apologized, changed her ways, and gone on to earn lasting plaudits on her original writing in the books that followed. For that matter, had Edwards’ plagiarism been limited just to nonfiction “infodumps”, and kept out of her story narratives, as it appeared in the first examples to be uncovered, it might have been easier to let go of. Unfortunately, it turns out she also lifted descriptive passages from fiction, which cuts closer to her main line of work as a storyteller.

On the other hand, I find it easier to forgive someone like Martin Luther King, who was (posthumously) found to have plagiarized in many of his academic writings, including his doctoral dissertation. Since Rev. King is best known and honored as a great civil rights leader, it’s easier to forgive this flaw that it would be if he were mainly remembered as an academic or an author. But it’s still sad, as it is when one uncovers the flaws of any of our American heroes (as you will for any of them, if you look closely enough). And now I can’t hear someone call him “Doctor Martin Luther King” without mentally interpolating an asterisk after the title.

More than ever, now that people’s words are increasingly available for searching and scanning in perpetuity, it’s important to take responsibility for what goes out under your name, whether you’re a storyteller, a scholar, or a politician. If you own up to your mistakes and sins early on, and do what you can to fix them, there may be good hope of redemption. Wait, and the damage can worsen both for those you’ve wronged and for yourself. I’ll try to remember that in my own writing, and I hope my readers will help hold me to it.

Subjects are more than just facets (and an ALA talk plug)

The Library of Congress’ Working Group for the Future of Bibliographic Control announced its final report today. I haven’t yet read over the final version, but I read an earlier draft, and was particularly interested in what it had to say about subjects.

“How should we offer searching in library collections?” is a question that lots of libraries are asking. The answer heard a lot nowadays is “Facets!” Facets have been used in databases and e-commerce sites for some years now. Essentially, they define several (ideally independent) attributes for items, and then let users zero in on what they want by selecting and deselecting various attributes. For example, if you go to Amazon to buy shoes, you can select values from facets like brand, size, color, and price range. Try different selections, and you can quickly pick out the few pairs that best meet your needs out of the tens of thousands offered on the site. (Assuming you’re willing to buy shoes without trying them on.)

The Endeca catalog at NC State applies the same idea to finding books in the library. When it came out two years ago, lots of library folks got excited. And when open source tools like Solr made it easy to code up your own faceted catalog, it came as no surprise that lots of folks set out to try facet-based discovery for their collections. These new catalogs are in many ways big improvements over existing catalogs. Though, as K. G Schneider and others point out, that’s not a high bar to clear.

We too use facets in some new applications we’re building here at Penn. But they don’t entirely work well with subject headings. Kelley McGrath’s article “Facet-Based Search and Navigation: Problems and Opportunities” in the inaugural issue of the Code4lib Journal describes some of the practical problems involved.

Some have said that subject headings should change to be more facet-oriented. That’s the recommendation of the Calhoun Report commissioned by the Library of Congress that was released in 2006, which recommended dismantling the Library of Congress Subject Headings (LCSH), now the most common subject headings vocabulary. The more recent report from the Future of Bibliographic Control doesn’t go that far, but it does recommend transforming LCSH, “de-coupling subject strings” and evaluating LCSH’s ability to “support faceted browsing and discovery”. The FAST system, which breaks up subjects into uncoordinated facets, is mentioned as an interesting technology to pursue.

LCSH indeed has several problems associated with it: people have a hard time finding the appropriate subject terms for what they’re looking for; catalogers have a hard time constructing terms that follow all the LCSH rules; terms are used inconsistently across collections; terms are slow to adapt to contemporary usage; and both “traditional” and faceted library catalogs have a hard time connecting related terms together using LCSH.

Should we, then, dismantle LCSH into a simple system of facet sets? Not so fast, I say. Subjects are inherently messy things, neither fully discrete nor hierarchical, and in a large collection it’s important to be able to zero in on specific subjects through relationships. Not only is there a large installed base of materials already described with LCSH, but LCSH and ontologies like it allow books to be described with greater precision, and with richer relationships, than pure facets allow. (See Thomas Mann’s “The Peloponnesian War and the Future of Reference, Cataloging, and Scholarship in Research Libraries” for a spirited argument for the power of LCSH-style subject headings.)

What we really need are better tools that allow readers and catalogers to take full advantage of rich subject headings and relationships, and make it easier for subject headings systems to evolve more quickly to meet the needs of users. A technology I’m experimenting with now, and calling subject maps, involves networks of related subjects, techniques for enriching those networks through automation and user input, and displays that let users and librarians browse large collections by navigating through complex subject areas. Subject maps can play well with facets and user-assigned tags, to produce discovery systems that offer the best features of all of these technologies.

Too good to be true? If you want to hear more, see a demo, or ask how this would actually work, come see and/or heckle me on Saturday at ALA. I’ll be presenting at the Catalog Form and Function Interest Group, at 10:30 AM in the Versailles Room of the Sofitel Philadelphia. For more info, and for other ALA forums that may be of interest to metadata librarians, see this post on the ALA blog.

Copyright and Provenance: A paper and an example

I’m happy to announce the publication of my paper “Copyright and Provenance: Some Practical Problems” in the latest issue of the IEEE Data Engineering Bulletin. I’ve also placed a copy in our institutional repository.

[Provenance of the work: Created by John Mark Ockerbloom, 2007. First published in Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, Vol. 30, No. 4, Dec. 2007, pp. 51-58. No previous works included; however, it derives in part from a previous presentation by the author at the 2007 Principles of Provenance Workshop in Philadelphia.

Provenance of the rights: Copyright originally by the author. Copyright assigned to IEEE, with certain rights retained by the author, via an IEEE Copyright Transfer Form (version as of Dec. 3, 2007) modified by a Science Commons addendum (Immediate Access 1.0).

Provenance of the preceding information: Asserted by the author, Jan 4, 2008.]

The bracketed paragraphs above should give you a taste of some of the provenance issues relevant to copyright clearance that I discuss in the paper. I wrote it primarily for computer scientists, essentially to argue that copyright clearance was an interesting and important application domain for research and development of provenance-aware systems, and to describe some of the basic issues involved. But it may also be of interest to librarians and others who are concerned about risk mitigation, efficiency, and value in clearing copyrights. It doesn’t go as deeply into clearance issues as other work in legal and library literature, but I hope that it provides a useful overview, with a minimum of technical jargon.

For what it’s worth, the draft proposal for embedding copyright status information in MARC records that I mentioned in an earlier post has a number of subfields for encoding the basics of the work-provenance I give above, as well as the information-provenance. It doesn’t have structured ways to express the derivation from previous work that I express above, or the rights assignment information, though these could conceivably go in unstructured notes fields.

Still, if it’s useful to use MARC (with the accompanying tradeoff between its installed base in libraries and its structural limitations) for encoding copyright information, the proposal looks to me like a good start (with some slight modifications I’ve suggested to the proposal committee.) But to enable large-scale copyright clearance with automated assistance, we’re going to eventually need more sophisticated data structures. Relatively speaking, the copyright of my paper is still a lot less complex than many other important examples.

I’m hoping that efforts like OCLC’s Registry of Copyright Evidence project will eventually provide ways of expressing more complex copyright issues in a structured manner. And if there’s any sort of global persistent identifier in a MARC record for a work (whether an ISBN, a DOI, a copyright registration number, or some other suitable identifier), it could be used as a key for linking bibliographic information in the MARC record with detailed copyright evidence in a registry.

Registries aren’t the only places this information can go, of course. Detailed, machine-readable copyright information can also be embedded directly in a work, thanks to the standards efforts of projects like Creative Commons. Which can be quite useful, especially for folks who want to dedicate their work to public use in a simple manner, and see no need to wait 14 years or more to do it.