When I implemented subject maps for browsing the Online Books Page by subject a while back, I had a big problem to face: I didn’t actually have subject terms for the books. How could I implement subject browsing without subjects?
By bootstrapping. I did have call numbers for the books, which arrange books by discipline. (That’s what lets you see similar books grouped together in a library’s nonfiction shelves.) It’s possible to infer a subject from a call number, though you’ll sometimes get a more general subject than the book’s really about, or miss secondary subjects. The Library of Congress authority records for subjects include call number ranges that apply to many of their authorized terms. My library had some of these authority records in its catalog. They’re often a few years behind the state of LC’s official list, but they’re still recent enough to be useful.
So I downloaded those records and then wrote a program that, given a call number, would try to find the subject with the smallest range that included that call number. Doing that helps you get specific subjects instead of general ones; if your call number is HD1306, you want to match the range HD1301-HD1306 (for “Land, Nationalization of“) rather than the wider range HD101-HD1131 (for the more general subject “Land use“). After filtering out some bad data in a few authority records, and suppressing some terms to break ties, I ran the program, and instantly got subject terms for tens of thousands of books. Some of them were pretty generic (thousands of books were simply labeled “English literature”, for instance), but many were quite specific, and I’d say over 90% of the time were useful descriptions of the book. The maps I built largely based on these assigned subjects worked pretty well from day one.
I didn’t stop there, though. Little by little I went back and found more precise and appropriate subjects for books in various parts of my collection. When I did this, I could also assign multiple subjects, instead of having to make do with one. (I’m not trained as a cataloger, but I know the basics of how LCSH subject assignment works, and can look over terms assigned by the Library of Congress or other libraries and choose the ones that seem to make the most sense for a given title.) I also kept track of which books had the automated subject assignments, and which had human-overseen cataloging.
As of today, I’ve assigned these more precise and comprehensive subjects to all the non-literature books in my collection that had call numbers. A lot of the fiction, and some of the more obscure nonfiction without a call number, still lacks subject cataloging. (As far as I can tell from Worldcat, many fiction books have never been subject-cataloged by anyone.) Some of these books will eventually get subject terms as well. But by now, the only automatic subject assignments left were for a few (mostly large) generic literature categories, which by now are mostly getting in the way of discovery of the other books. So today I’m turning those automated categorizations off.
Now that I’ve completed this phase of subject browsing enhancement, I’m excited to think about what might come next. I know from my usage logs that lots of people are browsing by subject (and that, based on the bad link reports I get, they’re finding books that had largely been overlooked before). Now that I have consistent high-quality subject metadata for nonfiction, I can think of various ways to improve subject-based discovery, both for this collection and for others. I can work on ways to keep the subject map up to date with the latest changes in subject vocabulariess. I can implement techniques for establishing more relevant connections between subjects. I can investigate ways to integrate data from less consistent sources (such as most large library catalogs) into subject maps and compensate for (or even automatically correct) their inconsistencies.
For now, though, I’ll stop for a moment, take a breath, and come up to blog, before diving back into this and other projects.