May 312012
 

The Problem

I’m currently working on a content ingestion system for Gravity‘s personalization and analytics systems. We already have in place a powerful and flexible article extraction library in place that we open sourced as Goose, which does a great job at identifying just the article text and optionally even its primary image. The problem is that all of the meta data for an article is not easily extracted from all sites the way we can algorithmically find which text is the article itself and which image is the best candidate to represent the article. Goose provides the mechanism to facilitate such extractions, but it is based on the DOM structure used by each publisher for their posts.

Enter RSS

Which brings me (and probably a lot of you) to RSS. Not only does RSS specify an article’s content, it even specifies a lot of that rich meta data we need. I thought I recalled RSS provided author meta data as well as image meta data, but when I got to work on exploring RSS as either a replacement of or in addition to Goose, I was a little surprised to find this wasn’t exactly the case. As described in the RSS Advisory Board’s Best Practices Profile, there is in fact an `author` element, however, it is specifically intended for the email address of the author and nothing else (although you can append a parenthesized name as well), and there isn’t an `image` element within each item at all.

RSS Specification version 2.NoMore

Since RSS has been frozen since March 30, 2009, all extensions to RSS are to be done, as the RSS Advisory Board states: “Subsequent work should happen in modules, using namespaces, and in completely new syndication formats, with new names.” Well I agree that all basic elements are well defined and therefor no longer require periodic updates, but it is very disappointing to me that after over 2 days of research, I have not found a single work that describes even as much author meta data that Atom 1.0 provides.

What I’m Looking For

To be clear, the elements I am looking to be additionally defined for any rss->channel->item should be encapsulated within some named author element. Since rss->channel->item->author is already defined, for the purpose of clarity I will use a fictitious XML Namespace (xmlns:profile=”http://somedomain.tld/rss/2.0/modules/profile”) for new elements:

...
<profile:name>Robbie Coleman</profile:name>
<profile:uri>http://robbie.robnrob.com/author/robbie/</profile:uri>
<profile:avatar>http://1.gravatar.com/avatar/dc77b368ec1f077dcc4aca3b9c003d2d</profile:avatar>
...

Update (6/1/2012):

A couple people pointed out (correctly) in the comments that atom’s author element provides [some of] what I need, but there is no place for an avatar uri within Atom 1.0, so I’m still short of the full solution.

Another point I would like to better state is that I’m looking for these additional fields to be present within other sites’ RSS so that I may consume it in a standard way. There are some sites that provide everything I’m looking for, but each of them have done it in their own way which makes my implementation for consuming it rather janky. ūüėČ

I now leave this up for discussion, which actually began on twitter here:

  • danmactough

    If you like the Atom 1.0 author element, you can feel free to use it in your RSS 2.0 feed. Just declare the namespace xmlns:atom=”http://www.w3.org/2005/Atom” in your root element (if you haven’t already done so), and wherever you want to plunk your author field, instead use the element. No need to reinvent the wheel if you don’t need to.

    • ¬†Not reinventing the wheel is the reason why I am looking for a standard to support *all* of the meta data I need. only provides name, uri, and email, but I need an avatar uri as well.

      • danmactough

         There is nothing preventing you from using for the gravatar URI.

        • ¬†Sure there is. Reason #1 is that is not the author’s URI. Reason #2 is if I did that, where would I store the author’s URI?

          Not trying to be picky, but I am trying to use elements for what they are intended.

          • danmactough

            The Atom spec tells you that you can use the uri for anything you want. But if you insist that it doesn’t work for you, by all means make something that does.

          • Right, it does allow for any URI, but my other issue was that then I would not have a place for what I describe here as the author’s profile uri

            Thanks again for all of the feedback and suggestions.

          • danmactough

            You also can have an arbitrary number of elements — one of those could be the profile uri, although it’s not a semantically ideal solution. But then again, if you’re trying to map your specific ontology into a pre-existing template, trade-offs are inevitable.

  • If you are happy with what Atom provides in the atom:author element, you can use it in RSS feeds.

    It is already a well-adopted best practice in RSS to use atom:link to identify the feed’s URL, a capability that RSS lacks:

    http://www.rssboard.org/rss-profile#namespace-elements-atom-link 

    • ¬†Thanks, but as I not-at-all-clearly pointed out in my post, atom’s author element only provides their person fields: name, uri, email. There is still no avatar field.

  • virtualCableTV

    Robbie you are expecting RSS and Atom to support robust taxonomies but where might we ask oursleves and how might we ask ourselves to draw a line which once crossed as it must still allow us to achieve the objectives of the diverse descrioption of reality and all of its subsquent iterative nuances?

    That is what namespaces and extensions are used for. Having been “into”¬†RSS day one I happen to¬†agree namespaces¬†remain the best¬†way to respond to extensibility. For example in 1999 nobody had even heard of the term avatar.¬†So how do we respond to not¬†knowing what we do not know? How do we extend the unknowable? Well as I’m sure you have read I remind you and others that the answer is spelled out here…

    “RSS 2.0 adds that capability, following a simple rule. A RSS feed may contain elements and attributes not described on this page, only if those elements and attributes are defined in a namespace.”
    http://www.rssboard.org/rss-specification#extendingRss 

    Now¬†the bias for Atom which in this instance remains indicative of the dishonesty perpetuated by those who day one began¬†to insinuate Atom is somehow better than RSS and it becomes quite obvious how rancid and stinky such¬†inferences¬†are when it is a¬†fact the Atom¬† element does not support “avatars” either. This dishonesty has been the bane of syndicated content since day one.

    I have extended RSS having used namespaces myself. I have not marketed either of the two modules as of yet and therein is the challenge. We are all able to extend RSS as wanted but to do so we have to then market and encourage other developers to support and use the extended modules if we wish our extensions to go viral so to speak. Yahoo Media RSS was quite successful in that regard albeit the Yahoo! brand helped make it possible but that is not to say small fry like most of us do not have any opportunity to succeed.

    The problem then becomes file bloat as each XML file¬†can and would become inundated with¬†who knws how many¬†declared namespaces? We have the means to avoid that dilemma with “socialized”¬†namespaces but that requires collaborative agreement and consensus bringing us “back to the future” where we started day one when the likes of those biased¬†for Atom started undermining RSS when we would all have been much better off working together to learn to use RSS in instances that could one day allow us to respond to what would not be known and needed until some time in the future because if you are an honest man of integrity you are compelled to admit Atom is just as FUBAR as RSS in its inability to make it possible to respond to know what we do not know so we’re left with the same dilemma regardless.

    And think of it, Atom is what it is and there is nothing inherently wrong with Atom or any reason to dislike it at face value however if we had learned to work together back in the day and avoid allowing the Atom poo-poo people who dissed RSS at every opportunity instead of rolling up the sleeves and doing what we could have done with RSS the p.o.s. Mark Zuckerberg and Facebook would never have emerged as RSS would have continued to become as pervasive as HTML and we would not likely be experiencing what has become a persistent dilemma.

    • ¬†Although I do not share the same anger towards those that spawned an alternative to RSS, I do agree with the problems we now (and have before) face. Rather than me continuing the conversation about whodunnit though, I’d love to get down to the lets-solve-it. ūüėČ

      Could you provide me with either a link to information describing the namespace(s) extensions you have done or a link to a feed containing those related to author? Us small fries can do a lot when we work together, and what I’m working on has the potential to be incorporated by some of the biggest content producers on the internet now.

      Thank you for taking the time to clearly state your position here.

      • virtualCableTV

        Robbie you would have had to been involved in the Yahoo group discussions back in the day to really appreciate the angst. Furthermore, you would have to care deeply about the backstabbing and dirty tricks that went on and continue to this very day as they pertain to undermining the use of RSS as its not about an alternative to RSS at all its about backstabbing and dirty deeds exactly as put.

        I already lost a career as an architect foolishly choosing to avoid rocking the boat and simply “moving on” as it is advocated but that was when I was still younger and niave and did not realize it was and is those doing the dirty deeds themselves that advocate how to respond. Hello?

        Granted, we need not remain overtly concerned with the past but to forget the past and “those who done it” is to invite them to return to pollute what may be achieved in the present for I assure you getting back to the process of “solving it” will draw their attention and their presence as sure as sh!t draws flies if I may put it so bluntly as it has already happened in this page as anybody can read for themselves noticing how quickly Roger Cadenhead showed up here and how some are attempting to shove five pounds of Atom into a one pound bag regardless of what falls out of the bottom.

        I am telling you too many of the Atom advocates simply do not want to “solve” anything Robbie they want to undermine RSS and if possible destroy RSS. Hopefully by reading the following article will¬†put the fundamentals that may still be observed into context once and for all:

        Web Feed Validation Service Developed by Insidious Vandal(s) Sam Ruby et al. — http://bit.ly/IhS9bm

        If they wanted to work collaboratively as colleagues they would not have been nor would they remain insidious vandals for years on end. What is stated in that article is not embellished and one must ask how simple is it for professional software developers as they are to use color or do they indeed intend to purposefully use color and such to hinder, obstruct and undermine RSS?

        It is not a stretch to understand Joe The Manager submits the XML file developed by his employee or contractor Charlie The Programmer and then ignorantly and wrongly concluding¬†Charlie’s work is FUBAR –and worse– Joe will say nothing but Charlie will never work again all because some insidious vandals presumed to speak authoritatively and marked up Charlie’s work to appear as if it were in error and therefore of questionable worth.

        Furthermore,¬†what of polluting the use of RSS by marking it up to appear as it it were in error when it does not include the use of the Atom <link> element? When that Atom element is used validation will no longer allow the use of elements and attributes in the <channel> element itself forcing a one or the other choice that cripples and disallows the declarative use of elements and attributes in the channel used to link to the channel’s origen. Only a fool or somebody who has not yet acquired the experience to know believes undermining RSS in this way was simply accidental.

        I say again then, you would have had to been involved in the Yahoo group discussions back in the day to really appreciate the angst as what occurred still stare us all in the face.

        So you see, in my book and in accordance with everything I have been taught throughout my entire life one does not simply ignore the fact that there are cockroaches in the pantry, as once seen and observed what the cockroaches leave behind makes it unpalatable to sit down at the dining room table to enjoy a good meal without knowing somehow they may be eating cockroach droppings that were left in the ingredients used to create their meal.

        Hence the first thing that must be done to –git’ er’ done– is to face the facts of what has occurred, make it and keep it a recognized matter of public record and then move forward to make a clean start breaking with the past (and if neccessary the vandals themselves as they are no longer needed) so all who may be concerned will know the kitchen has been and will be kept¬†swept clean and the floors must be and will be mopped regulary or the cockroaches will get into the pantry and sh!t all over everything all over again. And I don’t know if you understand yet Robbie¬†but our customers and the Manager Joes have an even longer memory than I. I didn’t get it until I passed 40 and I paid dearly for my liberal youth.

        In large part this I contend is what caused RSS to begin to fall out of favor as your every day good ol’ boys writing code had to sell their choice of using RSS to customers and managers who had to answer to their customers or their executives and I ask who would risk it all when they saw cockroaches in the pantry?

        I hope you and everybody can understand this metaphor as learning from it will I contend determine any chance of success moving forward. There is always room to disagree and argue merits of this or that but insidious vandalism must not be tolerated and the validator(s) must be swept clean of all remnants of what the cockroaches have left behind as it is the validator(s) that stand in between our work and the mind’s eye of the customer. Perhaps¬†some of the younger and most talented developers will agree and write new validators as the old ones developed by Sam Ruby et al.¬†are in fact soiled with cockroach droppings.

        As for actually providing links to my work I am not ready for that at the moment but I do have some ideas how to begin anew.

        • ¬†I’m in no way trying to belittle or diminish the issues you’re discussing here. You have obviously been deeply affected by how it all went down and I’m empathetic to your pain. Please do understand though, that it is not the kind of discussion I would like to have here on this post.

          Thank you for you’re cooperation.