Google loves blogs. Blogs loves Google. But is there trouble in paradise? When items slip of the front page of most blogs, there is an anecdotal two- to three-week delay before archived items are reindexed. As Dylan Tweney points out this is an artifact of the fact that Google’s basic unit of indexing is the web page URL and blogs are more fine-grained: the post as the basic unit, usually multiple posts on a single page.
Permalinks arose to address this same issue, allowing post-level targetting of links to web posts. This is generally implemented with named anchors within pages, although it’s also possible to assign each entry its own page in the archives, even if several entries are aggregated at any one time on the blog’s index page.
Dylan has a suggestion, though, to help the Googlesphere catch up with the blogosphere:
As it turns out, we do have a couple of data formats that understand the difference between a post and a page, include useful summary data, and even include handy pointers back to the exact archive location of a post. They’re called RSS and RDF.
These syndication formats are used to aggregate news, but they could be useful indexing tools too. What if Google (or Daypop, once they can afford to buy a few new hard drives) collected RSS and RDF feeds — and then archived them in a searchable index?
Instead of news stories scrolling off into oblivion when they get to the bottom of a feed, they’d enter a permanent index where they could be used for information retrieval later.
It seems that the same approach would work when indexing an intranet or enterprise portal. Maybe part of the solution for turning k-logs into a true knowledge sharing system is to make sure the search implementation indexes RSS feeds from k-logs, making knowledge retrieval possible without discontinuities.