Everybody's Libraries
Libraries for everyone, by everyone, shared with everyone, about everything
Skip to content
  • Home
  • About
  • About the Free Decimal Correspondence
  • Free Decimal Correspondence
  • ILS services for discovery applications
  • John Mark Ockerbloom
  • The Metadata Challenge
← Content and context in concept-oriented catalogs
Every book its libraries: or, Taking care in withdrawal →

Concepts in catalogs: Where the data comes from

Posted on January 15, 2010 by John Mark Ockerbloom

I’ve now made a few posts about concept-oriented catalogs, describing the basic idea, showing some examples, and talking about the kinds of context they should provide for users.  As I mentioned in my first post, concepts in such catalogs are “first-class locuses of information to help readers find useful knowledge resources”.  The catalogs I’m describing include a variety of concepts (beyond the bibliographic record) that have data associated with them, and this data gives users a helpful context for finding appropriate knowledge resources.

As I said in my example post, “The concepts come from, and are maintained by, various groups of people…. [They]  may be derived in part from existing MARC bibliographic metadata (sometimes through automated analysis), but often draw from additional data sources.”

If you’ve worked in cataloging lately, you might be thinking, “that’s nice, but we’ve got our hands full just providing MARC catalog records for all the books and other stuff coming through the door now.   Where’s all this other ‘concept’ data going to come from?  And how will it be practical to use and maintain?”

In this post, I’d like to take a stab at answering those questions.  I’ll draw a lot from my experience with subject maps, but a lot of what I say should apply to other kinds of concept data as well.

The conceptual data behind subject maps consists of annotations on different subjects, and links between related subjects.  A lot of what I need to build these maps can simply be reused from existing data. In particular, the Library of Congress Subject Headings system (LCSH) provides a large set of subjects with standardized names.  We also have a set of authority records associated with those subjects that gave alternate names, notes, and links to related subjects.

To make it practical to build a subject map for this data, I bulk-loaded authority records from our local catalog.  While the Library of Congress Authorities are more up to date than our local catalog, I could only look up records there one at a time, through an interface designed for manual browsing.  Fortunately, since then the Library of Congress has provided ways to download subject authority data in bulk.  It’s in a format that omits some details, but it should still help fill out our maps when we start including these records as well.  Because our library and the Library of Congress are both using a common system of identifiers for subjects, as well as compatible formats for expressing subject relationships, I’ll be able to combine our authority information with theirs to provide useful maps.  The identifiers we use are not always in sync; LCSH subject terms do get renamed and discontinued from time to time.  But the cross-references in LCSH authority records, which often include the old terms as aliases of the new terms, help reduce the pain involved in moving from old terms to newer terms.

Subject maps built just on authority records turn out to be pretty generic, and not as useful as they could be.  To make them more useful, we need more data.  As I describe in more detail in this white paper, I also analyze our bibliographic corpus to see what subject terms we actually use in our catalog, look at the structure of those terms (which are often coordinated from multiple components), and also look at correlations between terms that get used together in the same bibliographic records.  This analysis lets me create additional useful relationships between subjects. In short, I use automated analysis of a large data corpus to create new concept data from existing data.

In order to link together the many subjects that have geographic aspects, I need some extra data that isn’t in authority records.   Once I created a data record that noted that “Pennsylvania” is a US state that gets abbreviated “Pa.” in some subject headings, I was able to build all kinds of relationships between “Philadelphia (Pa.)” and related subjects, none of which are directly stated in the authority records for these subjects, but all of which can be derived by automated analysis.  (It helps that subject terms in LCSH have a fairly well-defined structure that’s amenable to lexical analysis.)  A couple hundred other brief geographic data records are enough to let users zoom in and out of locations all over the globe.  So a small amount of well-designed and curated supplementary data can often enhance lots of concepts, with minimal maintenance cost.

While I can easily zoom in and out between the US, Pennsylvania, Philadelphia, and locations within Philadelphia, I’d need more data to move side to side.  I don’t have any data, for instance, that tells me that Philadelphia is right next to Camden, New Jersey.  But fortunately, I can mine external data sources to find this information.  I recently read about a source of public domain global map data, for instance, that I (or any other geographic-concept catalog builder) could use to link subjects or other resources to a world map.

Increasing amounts of public data are distributed online.  If the data is public domain, or available with a liberal license, I don’t have to worry about legal roadblocks to downloading it, analyzing it, and using it in my own work.  Sharing data helps everyone build not only smarter catalogs, but smarter applications of all kinds.

Data sharing does not always happen painlessly.  I may have different concepts, or different names for concepts, than someone else whose data I might find useful.  We may have different ideas about how to structure our data.  But there are now systems that provide links between different names, and crosswalks between different structures that can help bridge the gap between my data and that of others.

With large enough corpuses of data to draw on, I can even make use of unstructured information from large groups of ordinary users.  For example, LibraryThing’s tag cloud displays a number of terms that are useful to include in one’s own library catalog.  Not all of them are formally defined subjects, but they’re used enough that we should expect most of them to be used in patron searches.  It should be possible to analyze the cloud and the things tagged in the cloud to associate many informal terms with particular subjects or library resources.

To summarize, it becomes much easier to derive the data needed for concept-oriented catalogs if

  • We have stable (or at least smoothly evolving) identifiers for concepts
  • We can use, swipe, and reuse a large domain of [meta]data for concept analysis (including automated analysis)
  • We carefully consider what additional concept data would enhance our services, and use standard, recognized forms to represent it
  • We have correspondences and crosswalks between different concept identifiers and formats
  • We share our concept data (and bibliographic data in general) as openly and broadly as possible
  • And we share information, expertise, and code that supports the innovative, useful catalogs we build.

There’s a non-trivial technical infrastructure implied by these requirements.   But it’s one that we can build.  (Quite a bit of it’s in place already.)  A lot of it depends on a healthy social infrastructure to create, maintain, share, and work with all the data and services that we create and adopt.  I hope to talk more about this social infrastructure in future posts.

Share this:

  • Email
  • Print
  • Twitter
  • Facebook
  • Reddit

Like this:

Like Loading...

Related

About John Mark Ockerbloom

I'm a digital library strategist at the University of Pennsylvania, in Philadelphia.
View all posts by John Mark Ockerbloom →
This entry was posted in architecture, discovery, formats, metadata, sharing, subjects. Bookmark the permalink.
← Content and context in concept-oriented catalogs
Every book its libraries: or, Taking care in withdrawal →
  • RSS feed
  • Pages

    • About
    • Free Decimal Correspondence
    • ILS services for discovery applications
    • John Mark Ockerbloom
    • The Metadata Challenge
  • Recent Posts

    • Public Domain Day countdown on public social media networks
    • Building a new banned books exhibit for a new era
    • Public Domain Day 2022: Trespassers Will
    • Coming soon to the public domain in 2022
    • Public Domain Day 2021: Honoring a lost generation
  • Recent Comments

    • david on Public Domain Day countdown on public social media networks
    • Rebecca on Public Domain Day countdown on public social media networks
    • sinergio katharismou on Public Domain Day countdown on public social media networks
    • Sandra McIntyre on Public Domain Day 2022: Trespassers Will
    • Chris Rusbridge on Public Domain Day 2022: Trespassers Will
  • Archives

    • November 2022
    • September 2022
    • January 2022
    • December 2021
    • January 2021
    • December 2020
    • March 2020
    • January 2020
    • December 2019
    • November 2019
    • October 2019
    • September 2019
    • July 2019
    • June 2019
    • January 2019
    • December 2018
    • October 2018
    • June 2018
    • January 2018
    • December 2017
    • September 2017
    • January 2017
    • October 2016
    • September 2016
    • July 2016
    • May 2016
    • January 2016
    • January 2015
    • June 2014
    • January 2014
    • October 2013
    • August 2013
    • April 2013
    • March 2013
    • February 2013
    • January 2013
    • December 2012
    • July 2012
    • May 2012
    • January 2012
    • October 2011
    • September 2011
    • June 2011
    • May 2011
    • April 2011
    • January 2011
    • December 2010
    • November 2010
    • October 2010
    • September 2010
    • August 2010
    • July 2010
    • June 2010
    • May 2010
    • April 2010
    • March 2010
    • February 2010
    • January 2010
    • December 2009
    • October 2009
    • September 2009
    • August 2009
    • July 2009
    • June 2009
    • May 2009
    • April 2009
    • March 2009
    • January 2009
    • December 2008
    • November 2008
    • October 2008
    • September 2008
    • August 2008
    • July 2008
    • June 2008
    • May 2008
    • April 2008
    • March 2008
    • February 2008
    • January 2008
    • December 2007
    • November 2007
  • Access for all

    • Open Access News
  • Copyrights and wrongs

    • Copyfight
    • Copyright & Fair Use
    • Freedom to Tinker
    • Lawrence Lessig
  • General library-related news and comment

    • LISNews
    • TeleRead
  • Interesting folks

    • Jessamyn West
    • John Scalzi
    • Jonathan Rochkind
    • K. G. Schneider
    • Karen Coyle
    • Lawrence Lessig
    • Leslie Johnston
    • Library Loon
    • Lorcan Dempsey
    • Paul Courant
    • Peter Brantley
    • Walt Crawford
  • Metadata and friends

    • Planet Cataloging
  • Shiny tech

    • Boing Boing
    • O’Reilly Radar
    • Planet Code4lib
  • Tales from the repository

    • RepositoryMan
  • Writing and publishing

    • if:book
    • Making Light
    • Publishing Frontier
Everybody's Libraries
Blog at WordPress.com.
  • Follow Following
    • Everybody's Libraries
    • Join 150 other followers
    • Already have a WordPress.com account? Log in now.
    • Everybody's Libraries
    • Customize
    • Follow Following
    • Sign up
    • Log in
    • Copy shortlink
    • Report this content
    • View post in Reader
    • Manage subscriptions
    • Collapse this bar
%d bloggers like this: