Libraries and bookstores have perennially faced the problem of how to organize books on their shelves. There’s a tension between making certain books easy to find for readers with one set of interests, and making them more difficult to find for other readers. For instance, some libraries and bookstores near me have a section for African American fiction. Readers particularly interested in African American authors can easily find their books in this section. But if novels by African American authors are shelved there instead of in the general fiction section, readers browsing general fiction might not find many African American authors there. Similar issues have arisen with genre fiction sections in libraries. A separate “Science Fiction” section can be a convenient service for for fans of that genre. But some readers have objected that such sections push science fiction off into a corner, making it easy for “mainstream” readers to overlook the genre.
In theory, online libraries shouldn’t have as much problem organizing their books and subjects. Freed from the physical constraints of bound paper and shelves, the same book can be placed in many virtual locations, not just one. But in practice, many of the problems of categorization persist in the online world. Last week, for instance, Amanda Filipacchi noted in the New York Times that Wikipedia’s category listing of American novelists was disproportionally male, in part because some editors had been taking women authors out of this category and moving them to the more specialized “American women novelists” category. As far as I can tell, Wikipedia policy does not call for this sort of marginalization, but it doesn’t prevent it from happening either. It’s not just a matter of editors with an agenda and time on their hands; it also happens because manually filing people under multiple categories takes more effort than filing them under one, and it’s easy to neglect or forget to put someone in a broader category after placing them in a narrower one. So people are classified under women authors but not authors, under chemists but not scientists, under Catholics but not Christians. Readers who look for articles in the more general category listings can easily miss people who are only filed in the more specific ones. (And even if those category listings were not originally intended for browsing, many Wikipedia readers do use them that way.)
In systems that have explicit hierarchies of categories (such as Wikipedia categories, or Library of Congress Subject Headings), there’s a fairly straightforward way to solve this particular problem: When a person is placed in a specific category, the system should automatically also place them in any broader categories of people that encompass the original category. If someone is categorized under “Women chemists”, for instance, they should also get automatically categorized under “Chemists”, “Women scientists”, and “Scientists”. This inclusion can be implemented in various ways, but the important thing is that narrowly-classified people should be just as visible for readers browsing the broader categories as people that were explicitly classified under the broader categories.
I implemented this automatic category promotion yesterday on The Online Books Page, which for a while has classified people that are the subject of listed biographies. Consider, for instance, St. Catherine of Siena. The catalog data for her biography in The Online Books Page categorizes her as a Christian woman saint, and readers have been able to find her under that subject for some time now. Thanks to the new algorithm, readers will now also find her when they browse broader subjects like Christian saints, or saints generally.
We could be doing this sort of thing in other library catalogs, and in Wikipedia, as well. Why aren’t we? I’ve seen a few objections to the idea:
It’s too hard to implement? It doesn’t have to be. It took me just part of a Sunday afternoon to implement the feature on The Online Books Page, and I suspect a good programmer who was familiar with (and could modify) the relevant source code would not have much trouble implementing the feature in a well-designed catalog or Wiki. In my experience, I had to spend more time modifying my data than modifying my code. The Library of Congress Subject Headings, the subject system used by The Online Books Page, is not complete or consistent in its subject hierarchies, and I had also miscoded some topical subjects as people. But it’s possible to clean up and enhance this kind of data, and doing so often benefits both present and future applications of the data.
It defeats the purpose of hierarchical categories? I’ve seen this objection made in some of the Wikipedia discussions around this issue, and it doesn’t make sense to me when I think it through. Far from being useless, the category hierarchy is precisely what makes it possible to automatically promote people in narrow categories into broader categories. It also helps save the time of categorizers; they only have to explicitly place people in precise categories, and if the hierarchy is well-constructed the system will automatically take care of the broader categories. (If the system also keeps track of which category assignments are explicit and which are automatic, it can also update them appropriately when categorizations or hierarchies get edited.) I’m also not flattening hierarchies across the board; I’m only recommending at this point that this sort of promotion be done for people, in categories of people. (More generally, it might be useful for any kind of individual instance that is categorized under abstract classes of those instances. But doing it for people is a good start.)
It makes the broader categories too crowded to be useful? In a comprehensive catalog such as Wikipedia, there will be a lot of people in categories like “writers”, once you include all the people in sub-categories. But there still will be a lot of people in that category even if you banish all the women to a “women writers” subcategory. Creating another category for “men writers” doesn’t really solve the problem; all it does is force people to choose which gender they want to browse, instead of letting them browse writers of both genders if that’s what they want to do. And after the split, the broader “writers” category will most likely still be left with a random assortment of writers without gender classification, who might or might not be the people a reader is most interested in.
Well-designed interfaces make it possible to usefully browse large collections of items. Relevance ranking, for instance, can be used to put the most notable examples of a category at the top of a long list of its members. That’s in fact what we routinely expect to happen in good search engines. And mechanisms like faceted navigation (used in many online catalogs) and subject maps (used on The Online Books Page) make it easy to shift focus to more precise or related categories based on a reader’s interests. In systems that implement these features, categories with lots of members are good things to have, not bad things.
I haven’t yet implemented relevance ranking in my subject browsing. Right now, The Online Books Page doesn’t actually classify many people to begin with, so most of my categories don’t have a lot of people in them. But I could see a number of ways to implement such ranking in a catalog like The Online Books Page, or in Wikipedia, which I can discuss later if there’s interest.
In summary, then, well-designed catalogs and wikis should be able to categorize people comprehensively without marginalizing them. Three features that make this possible are:
- detailed, well-organized systems of categories and their relationships
- systems that automatically show people in broader categories when they’re classified in narrower ones
- and ranking and navigation mechanisms that make it easy to pick out the people with the most general interest, or the qualities of interest to a particular researcher, from a large overall set of people.
I’ll continue to work on implementing these features on The Online Books Page, and would be very interested in participating in discussions of how they can better work there, in other catalogs, and in systems like Wikipedia.