<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
		>
<channel>
	<title>Comments on: Making discovery smarter with open data</title>
	<atom:link href="http://everybodyslibraries.com/2010/05/06/making-discovery-smarter-with-open-data/feed/" rel="self" type="application/rss+xml" />
	<link>http://everybodyslibraries.com/2010/05/06/making-discovery-smarter-with-open-data/</link>
	<description>Libraries for everyone, by everyone, shared with everyone, about everything</description>
	<lastBuildDate>Tue, 18 Jun 2013 23:06:13 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
	<item>
		<title>By: John Mark Ockerbloom</title>
		<link>http://everybodyslibraries.com/2010/05/06/making-discovery-smarter-with-open-data/#comment-2135</link>
		<dc:creator><![CDATA[John Mark Ockerbloom]]></dc:creator>
		<pubDate>Fri, 07 May 2010 16:09:38 +0000</pubDate>
		<guid isPermaLink="false">http://everybodyslibraries.com/?p=1375#comment-2135</guid>
		<description><![CDATA[Ed: Thanks for your note, and for telling us about the feeds.  Those sound useful too, though I&#039;m happy to work with the big dumps myself for now.

Jonathan: I&#039;d be happy to consider a code4lib article if there&#039;s interest.  For now, here are my quick answers to your questions:

* It&#039;s not powered by either solr or a dbms.  The code is based on a set of Perl modules I started writing back in the 20th century; the data&#039;s kept in something much like Berkeley DB files, but much less space-hungry.  I like the idea of moving it to Lucene, which might include Solr, but shifting things over to Java would be nontrivial, Perl versions of Lucene had performance issues last I checked, and using an external Solr instance might or might not improve things (haven&#039;t thought much about this last option; it would at the very least involve more overhead keeping everything running). 

*  The NT relationships are determined from the authority file encodings plus my heuristic rules.  Some of these rules also look at subject assignments in the collection to be indexed.  The exact details are a bit long for a comment but I could detail them elsewhere.

* My current plan is to reprocess when LCSH updates, but that&#039;s not a big deal for me.  New RDFs seem to come out every month or two.  Preliminary processing to prepare the new RDF for my indexes takes less than 5 minutes for the entire RDF.  I can then regenerate not only the subject maps but also all other indexes for my collection in another 5 minutes; this is typically done daily.  The time required scales up somewhat with collection size, but earlier test indexes with Penn&#039;s whole catalog (and a smaller set of authorities and heuristics, granted) took under an hour, IIRC.

* I do plan to bring in local authorities as well; I would have done it by now except for some character encoding issues.  I could see some possible issues with links to obsolete forms, but there should be ways to address this.  (Actually, this sort of analysis could also be tweaked to flag obsolete subject terms in one&#039;s catalog, and possibly change them in bulk as well.)

* Yes, I&#039;d be happy to share code, but see comments above about 20th-century Perl.  Not sure how useful it will be in general (and I&#039;m not in a position to provide support for it), but I can show it to folks who are interested.

Hope this answers your questions; and again, if folks would be interested in learning more about various technical aspects, I&#039;d be happy to consider doing a paper.]]></description>
		<content:encoded><![CDATA[<p>Ed: Thanks for your note, and for telling us about the feeds.  Those sound useful too, though I&#8217;m happy to work with the big dumps myself for now.</p>
<p>Jonathan: I&#8217;d be happy to consider a code4lib article if there&#8217;s interest.  For now, here are my quick answers to your questions:</p>
<p>* It&#8217;s not powered by either solr or a dbms.  The code is based on a set of Perl modules I started writing back in the 20th century; the data&#8217;s kept in something much like Berkeley DB files, but much less space-hungry.  I like the idea of moving it to Lucene, which might include Solr, but shifting things over to Java would be nontrivial, Perl versions of Lucene had performance issues last I checked, and using an external Solr instance might or might not improve things (haven&#8217;t thought much about this last option; it would at the very least involve more overhead keeping everything running). </p>
<p>*  The NT relationships are determined from the authority file encodings plus my heuristic rules.  Some of these rules also look at subject assignments in the collection to be indexed.  The exact details are a bit long for a comment but I could detail them elsewhere.</p>
<p>* My current plan is to reprocess when LCSH updates, but that&#8217;s not a big deal for me.  New RDFs seem to come out every month or two.  Preliminary processing to prepare the new RDF for my indexes takes less than 5 minutes for the entire RDF.  I can then regenerate not only the subject maps but also all other indexes for my collection in another 5 minutes; this is typically done daily.  The time required scales up somewhat with collection size, but earlier test indexes with Penn&#8217;s whole catalog (and a smaller set of authorities and heuristics, granted) took under an hour, IIRC.</p>
<p>* I do plan to bring in local authorities as well; I would have done it by now except for some character encoding issues.  I could see some possible issues with links to obsolete forms, but there should be ways to address this.  (Actually, this sort of analysis could also be tweaked to flag obsolete subject terms in one&#8217;s catalog, and possibly change them in bulk as well.)</p>
<p>* Yes, I&#8217;d be happy to share code, but see comments above about 20th-century Perl.  Not sure how useful it will be in general (and I&#8217;m not in a position to provide support for it), but I can show it to folks who are interested.</p>
<p>Hope this answers your questions; and again, if folks would be interested in learning more about various technical aspects, I&#8217;d be happy to consider doing a paper.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jonathan Rochkind</title>
		<link>http://everybodyslibraries.com/2010/05/06/making-discovery-smarter-with-open-data/#comment-2134</link>
		<dc:creator><![CDATA[Jonathan Rochkind]]></dc:creator>
		<pubDate>Fri, 07 May 2010 15:37:07 +0000</pubDate>
		<guid isPermaLink="false">http://everybodyslibraries.com/?p=1375#comment-2134</guid>
		<description><![CDATA[Nice. I&#039;m interested in learning more about the technical details of what you&#039;ve done. 

Is it powered by Solr, an rdbms, both, neither?  

How do you determine the narrower term relationships (and other relationships), just from the actual authority file encodings, or also using your own heuristic rules? Sounds like some of your own heuristics too, would be interested in learning more about them. 

When you get new updates from LCSH, do you have to &quot;re-index&quot; or &quot;re-process&quot; your entire db, even for just minor updates?  

Did you consider trying to combine the LCSH authorities and your local authorities, to get a superset of both?

Are you interested in sharing your code, and what language is it written in?

And finally, would you like to write a Code4Lib Journal article about this? I think it&#039;d be a good one!]]></description>
		<content:encoded><![CDATA[<p>Nice. I&#8217;m interested in learning more about the technical details of what you&#8217;ve done. </p>
<p>Is it powered by Solr, an rdbms, both, neither?  </p>
<p>How do you determine the narrower term relationships (and other relationships), just from the actual authority file encodings, or also using your own heuristic rules? Sounds like some of your own heuristics too, would be interested in learning more about them. </p>
<p>When you get new updates from LCSH, do you have to &#8220;re-index&#8221; or &#8220;re-process&#8221; your entire db, even for just minor updates?  </p>
<p>Did you consider trying to combine the LCSH authorities and your local authorities, to get a superset of both?</p>
<p>Are you interested in sharing your code, and what language is it written in?</p>
<p>And finally, would you like to write a Code4Lib Journal article about this? I think it&#8217;d be a good one!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ed Summers</title>
		<link>http://everybodyslibraries.com/2010/05/06/making-discovery-smarter-with-open-data/#comment-2132</link>
		<dc:creator><![CDATA[Ed Summers]]></dc:creator>
		<pubDate>Thu, 06 May 2010 14:40:05 +0000</pubDate>
		<guid isPermaLink="false">http://everybodyslibraries.com/?p=1375#comment-2132</guid>
		<description><![CDATA[John, thanks so much for blogging about this. It&#039;s great to see you actually using the data at id.loc.gov and reporting back on how the data can be improved ... and for highlighting how important it is to get the name authority data on id.loc.gov.

One thing that LC hasn&#039;t publicized a whole lot is that the loading history for the data can be found here:

  http://id.loc.gov/authorities/loads/

In addition there is a Atom feed that makes all the create, update and delete activity available:

  http://id.loc.gov/authorities/feed/

The idea with the Atom feed is to allow you to keep your own local version of the records synchronized with id.loc.gov. You can follow the &quot;next&quot; URLs in the atom:link elements to drill backwards until you&#039;ve seen a change at a record update time you already knew about. It&#039;s a bit of an experimental more mainstream alternative to oai-pmh, so I&#039;d be interested in any feedback you might have if it is of interest.]]></description>
		<content:encoded><![CDATA[<p>John, thanks so much for blogging about this. It&#8217;s great to see you actually using the data at id.loc.gov and reporting back on how the data can be improved &#8230; and for highlighting how important it is to get the name authority data on id.loc.gov.</p>
<p>One thing that LC hasn&#8217;t publicized a whole lot is that the loading history for the data can be found here:</p>
<p>  <a href="http://id.loc.gov/authorities/loads/" rel="nofollow">http://id.loc.gov/authorities/loads/</a></p>
<p>In addition there is a Atom feed that makes all the create, update and delete activity available:</p>
<p>  <a href="http://id.loc.gov/authorities/feed/" rel="nofollow">http://id.loc.gov/authorities/feed/</a></p>
<p>The idea with the Atom feed is to allow you to keep your own local version of the records synchronized with id.loc.gov. You can follow the &#8220;next&#8221; URLs in the atom:link elements to drill backwards until you&#8217;ve seen a change at a record update time you already knew about. It&#8217;s a bit of an experimental more mainstream alternative to oai-pmh, so I&#8217;d be interested in any feedback you might have if it is of interest.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
