May 312012

The Problem

I’m currently working on a content ingestion system for Gravity‘s personalization and analytics systems. We already have in place a powerful and flexible article extraction library in place that we open sourced as Goose, which does a great job at identifying just the article text and optionally even its primary image. The problem is that all of the meta data for an article is not easily extracted from all sites the way we can algorithmically find which text is the article itself and which image is the best candidate to represent the article. Goose provides the mechanism to facilitate such extractions, but it is based on the DOM structure used by each publisher for their posts.

Enter RSS

Which brings me (and probably a lot of you) to RSS. Not only does RSS specify an article’s content, it even specifies a lot of that rich meta data we need. I thought I recalled RSS provided author meta data as well as image meta data, but when I got to work on exploring RSS as either a replacement of or in addition to Goose, I was a little surprised to find this wasn’t exactly the case. As described in the RSS Advisory Board’s Best Practices Profile, there is in fact an `author` element, however, it is specifically intended for the email address of the author and nothing else (although you can append a parenthesized name as well), and there isn’t an `image` element within each item at all.

RSS Specification version 2.NoMore

Since RSS has been frozen since March 30, 2009, all extensions to RSS are to be done, as the RSS Advisory Board states: “Subsequent work should happen in modules, using namespaces, and in completely new syndication formats, with new names.” Well I agree that all basic elements are well defined and therefor no longer require periodic updates, but it is very disappointing to me that after over 2 days of research, I have not found a single work that describes even as much author meta data that Atom 1.0 provides.

What I’m Looking For

To be clear, the elements I am looking to be additionally defined for any rss->channel->item should be encapsulated within some named author element. Since rss->channel->item->author is already defined, for the purpose of clarity I will use a fictitious XML Namespace (xmlns:profile=”http://somedomain.tld/rss/2.0/modules/profile”) for new elements:

<profile:name>Robbie Coleman</profile:name>

Update (6/1/2012):

A couple people pointed out (correctly) in the comments that atom’s author element provides [some of] what I need, but there is no place for an avatar uri within Atom 1.0, so I’m still short of the full solution.

Another point I would like to better state is that I’m looking for these additional fields to be present within other sites’ RSS so that I may consume it in a standard way. There are some sites that provide everything I’m looking for, but each of them have done it in their own way which makes my implementation for consuming it rather janky. 😉

I now leave this up for discussion, which actually began on twitter here: