Making your content findable

The best library collections don’t do much good if people who may be interested in the content don’t find it. That’s why it’s so important to provide good discovery tools for your collection. But even if you do that, lots of folks are going to find your content not through your own discovery interfaces, but through links from the outside. A large proportion of the visits to our institutional repository don’t come through the front door, but from Google and other search engines indexing papers in the repository.

There’s a whole industry set up around search engine optimization, by means both fair and foul. The basics are pretty simple, though: make it easy for search engines to find content that you want indexed, and make it easy for interested people to link to your content. (Not only will people follow those links, but search engines will typically favor content with more links to it.)

Seems pretty simple, but there are still lots of sites in the digital library world that don’t have clear persistent URLs for their content, that use Javascript, Flash and other extras for navigation that crawlers (and many users) can’t follow, require traversing too many links to get to important content, or that only make their content findable via searching (which crawlers won’t do) and not via browsing.

A new article in D-Lib, “Site Design Impact on Robots: An Examination of Search Engine Crawler Behavior at Deep and Wide Websites”, gives some empirical support to some of the common-sense tips for getting sites indexed. Not surprisingly, they find that “wide” websites (those that make their content reachable through a relatively small number of clicks) are more readily indexed than “deep” websites (those that require a large number of clicks to get to some of their content). They also report that Google crawled their sites much more thoroughly than MSN and Yahoo (one reason, I’m sure, why folks doing obscure searches tend to prefer it). I was a bit surprised, though, to find that their .com sites were crawled more quickly than the same sites in a .edu domain. (The authors speculate that advertising revenue considerations may have something to do with this.)

One kind of useful page that the authors don’t mention is a “new content” page, which I’ve found helps get my Online Books Page indexed very quickly. Based on cache dates I’ve seen in search results, Google seems to check my new books page every day or two, and crawls the new links it finds there. Since every book I list appears there at one time or another, this provides very thorough indexing for virtually the entire site.

I suspect that Google and other crawlers pay special attention to pages like these, particularly if they’re updated frequently. This might also explain why blogs tend to get particularly highly ranked in Google and other engines compared to other sites, since they also are frequently updated, and show all their new content prominently, in reverse-chronological order.

You can find many other useful tips for getting indexed and linked in Cory Doctorow’s article “17 Tips for Getting Bloggers to Write About You“. The title may sound off-topic, but the article is mostly about making your content easy to link to. Most of the tips also work for making your content easy to index.

The more links there are to your content from other interesting places, and the more your content gets indexed by search engines, the more interested readers will find and use your content when you put it online. Making your content link-friendly and crawler-friendly, then, can help your library serve lots of new readers.

About John Mark Ockerbloom

I'm a digital library strategist at the University of Pennsylvania, in Philadelphia.

View all posts by John Mark Ockerbloom →

This entry was posted in discovery, libraries. Bookmark the permalink.