May 312012
 

The Problem

I’m currently working on a content ingestion system for Gravity‘s personalization and analytics systems. We already have in place a powerful and flexible article extraction library in place that we open sourced as Goose, which does a great job at identifying just the article text and optionally even its primary image. The problem is that all of the meta data for an article is not easily extracted from all sites the way we can algorithmically find which text is the article itself and which image is the best candidate to represent the article. Goose provides the mechanism to facilitate such extractions, but it is based on the DOM structure used by each publisher for their posts.

Enter RSS

Which brings me (and probably a lot of you) to RSS. Not only does RSS specify an article’s content, it even specifies a lot of that rich meta data we need. I thought I recalled RSS provided author meta data as well as image meta data, but when I got to work on exploring RSS as either a replacement of or in addition to Goose, I was a little surprised to find this wasn’t exactly the case. As described in the RSS Advisory Board’s Best Practices Profile, there is in fact an `author` element, however, it is specifically intended for the email address of the author and nothing else (although you can append a parenthesized name as well), and there isn’t an `image` element within each item at all.

RSS Specification version 2.NoMore

Since RSS has been frozen since March 30, 2009, all extensions to RSS are to be done, as the RSS Advisory Board states: “Subsequent work should happen in modules, using namespaces, and in completely new syndication formats, with new names.” Well I agree that all basic elements are well defined and therefor no longer require periodic updates, but it is very disappointing to me that after over 2 days of research, I have not found a single work that describes even as much author meta data that Atom 1.0 provides.

What I’m Looking For

To be clear, the elements I am looking to be additionally defined for any rss->channel->item should be encapsulated within some named author element. Since rss->channel->item->author is already defined, for the purpose of clarity I will use a fictitious XML Namespace (xmlns:profile=”http://somedomain.tld/rss/2.0/modules/profile”) for new elements:

...
<profile:name>Robbie Coleman</profile:name>
<profile:uri>http://robbie.robnrob.com/author/robbie/</profile:uri>
<profile:avatar>http://1.gravatar.com/avatar/dc77b368ec1f077dcc4aca3b9c003d2d</profile:avatar>
...

Update (6/1/2012):

A couple people pointed out (correctly) in the comments that atom’s author element provides [some of] what I need, but there is no place for an avatar uri within Atom 1.0, so I’m still short of the full solution.

Another point I would like to better state is that I’m looking for these additional fields to be present within other sites’ RSS so that I may consume it in a standard way. There are some sites that provide everything I’m looking for, but each of them have done it in their own way which makes my implementation for consuming it rather janky. 😉

I now leave this up for discussion, which actually began on twitter here:

Goose Wins 2nd Place in Text Extraction

 Announcements, Technology  Comments Off on Goose Wins 2nd Place in Text Extraction
Jun 112011
 

In a recent comparison of different text extraction algorithms, Gravity’s open source project: Goose tied for second place and was even written up over at Read Write Web! I find this very exciting because our project is still quite young and actively in development whereas the algorithms in close standing are mostly well established and semi-finalized. Another interesting point is that most of the competition was built by teams of researchers, you know… Doctors in their fields!

The graph below from Tomaž Kovačič‘s study shows only a small amount of the data he collected in his analysis. If you are curious of how he compared these algorithms, I highly recommend you head over to his post. He does a great job exposing the details behind his analysis.

graph

Goose's standing among other algorithms tested

So what is Goose used for at Gravity and why have we open sourced it?

Goose’s wiki provides a very detailed explanation about what Goose is and how it works, and also touches on the original need we had at Gravity behind its creation. Jim Plush wrote the first version from the ground up on his own and only recently gave me commit access to the repository. By the time I got into the project, it had all the bells and whistles required to compete in the analysis completed by Kovačič. My contributions to Goose have been to extend it to allow for more specific extractions of additional meta data outside of the primary content and have no effect on its standings above.

Such a utility can be applied to a wide variety of web content analysis problems, and I’m really glad Plush decided to share it with the rest of the open source community. At Gravity, we have been building a lot of exciting (to me at least) technology and most of it is held dearly by us and needs to remain a company secret as they make up a large part of our company’s overall value. When it comes to analyzing the content out here on the web, Goose can be looked at as our trusty messenger delivering our system plenty of content to analyze without a lot of the noise that comes along with it on the web pages the content is sourced from.

If you are looking to mine some of the golden nuggets of information that is buried under a ton of ads, peripheral links, site menu structures, and other distracting noise, then why not take a look at what Goose has to offer? If you find anything you think Goose may be lacking or have some ideas on anything else that may be improved, let us know on our Github repository: https://github.com/jiminoc/goose

May 312011
 
the plight of my klout score

the plight of my klout score

I tend to focus less on my social networks when I have my head down coding. This time however, its effect on my klout score is pretty dramatic. LOL

I guess this is just a consequence of the typical life:work balance, but I wonder how other tech professionals maintain such a high score while also paving new roads in their field.

A couple of examples of the type of tech peeps I’m talking about:

  1. Jeff Atwood – 73: klout.com/codinghorror
  2. Matt Cutts – 73: klout.com/mattcutts

Anyway, I’m not complaining here people. I am just confused about how others manage to keep up their tweeting/blogging while “deep in the cut” of some tech project.

Do any of you have any tips? Please speak up here and let me know.

May 272011
 
Upload progress showing 1,317 of 4,947 songs added...

1,317 of 4,947 songs added... OH HELL YEAH!

I just gained access to Google’s Music Beta and for the first time, I think my personal music library may be smaller than the amout of cloud storage available for free!

It is truly amazing just how far we have come from the the early days of cramming mp3’s into a JPEG image to store on a free image hosting site back in the 90’s. Napster not only broadened what was possible for music online, but also inadvertently set us all back a decade of fighting to truly OWN the music we legally purchase from the big music labels.

Yes, I know there are a lot of you that have second thoughts about giving so much to the Google Collective, and I have no beef with you and your own convictions.

I for one welcome our new online music overlords!

Migrating Community Server to WordPress

 Technology  Comments Off on Migrating Community Server to WordPress
May 262011
 

This was not an easy task! For one thing, my Community Server (CS) site was not functional, so using RSS / MetaWeblog endpoints were not available options for me. Secondly, I no longer have a Windows development machine. Since CS is built on all Microsoft technologies, I needed to fire up a virtual instance of Windows in order to extract any of the data. If my previous hosting service was able to keep my database online for longer than minutes at a time, I could have run things remotely, but… not the case.

The actual SQL code for extracting all of my blog posts looks surprisingly simple:

But if you look closely at it, you’ll see that there are to scalar functions in there: ‘dbo.old_url‘ & ‘dbo.make_slug‘. I was surprised to find not find any slugs in the CS DB tables. I assume that all of that logic is being handled from the compiled ASP.NET application itself because there was nothing in the tables, stored procedures, or even functions that did anything related to calculating/parsing URL slugs from post titles. To make matters worse, since my site was not in a running state (due to hosting shenanigans), I had basically just my memory along with the 404 logs on the new WordPress site to help me reverse engineer the rules for converting titles to slugs. This is best represented in my ‘dbo.make_slug’ snippet below:

CREATE FUNCTION [dbo].[make_slug]
(
	@post_title nvarchar(256)
)
RETURNS nvarchar(500)
AS
BEGIN
	-- Declare the return variable here
	DECLARE @slug nvarchar(500)
	DECLARE @clean_title nvarchar(500)
	
	
	SET @clean_title = LOWER(dbo.deDupeSpaces(dbo.removePunctuation(@post_title)))
	SET @slug = REPLACE(@clean_title, ' ', '-')
	
	RETURN @slug

END

And that is used by ‘dbo.old_url‘ here:

CREATE FUNCTION [dbo].[old_url]
(
	@post_date datetime,
	@post_title nvarchar(256)
)
RETURNS nvarchar(500)
AS
BEGIN
	-- Declare the return variable here
	DECLARE @url nvarchar(500)
	DECLARE @y_m_d nvarchar(10)
	DECLARE @clean_title nvarchar(500)
	
	SET @y_m_d = CONVERT(nvarchar, @post_date, 111)
	SET @url = '/archive/' + @y_m_d + '/' +  + dbo.make_slug(@post_title) + '.aspx'
	
	RETURN @url

END

There are still two more functions remaining (if you have been paying attention) that are used by ‘dbo.make_slug‘ and that is where the real fun comes in. First of these is the simpler ‘dbo.deDupeSpaces‘ which cuts all repeating space characters down to a single space:

CREATE FUNCTION [dbo].[deDupeSpaces] 
(
	@input nvarchar(500)
)
RETURNS nvarchar(500)
AS
BEGIN
    /**
    *  Based on Nigel Rivett's SQL script found: 
    *    http://www.nigelrivett.net/SQLTsql/RemoveNonNumericCharacters.html 
    */
	DECLARE @i int

	set @i = patindex('%[ ][ ]%', @input)
	while @i > 0
	begin
		set @input = replace(@input, '  ', ' ')
		set @i = patindex('%[ ][ ]%', @input)
	end

	RETURN @input

END

And the more impressive and pretty much identical to the script I found originally written by Nigel Rivett:

CREATE FUNCTION [dbo].[removePunctuation] 
(
	@input nvarchar(500)
)
RETURNS nvarchar(500)
AS
BEGIN
	/**
	 *  Based on Nigel Rivett's SQL script found: 
	 *    http://www.nigelrivett.net/SQLTsql/RemoveNonNumericCharacters.html 
	 */
	DECLARE @i int

	set @i = patindex('%[^a-zA-Z0-9 ]%', @input)
	while @i > 0
	begin
		set @input = replace(@input, substring(@input, @i, 1), '')
		set @i = patindex('%[^a-zA-Z0-9 ]%', @input)
	end

	-- Return the result of the function
	RETURN @input

END

So all of this so far is just to get my posts out of the CS DB in a format close enough to what I’ll need to stuff into my WordPress DB. In order to continue, I just ran the simple query (snippet at the top) and exported the results to an XML file. Now I could finally shutdown the virtual instance of Windows 7 that was eating up my MacBook’s resources and burning my lap from the CPU pegging. 😉

The rest is pretty straight forward. I was unable to find any WordPress Plugins so to assist me in this completely custom hackery, so I thought a brute force insert directly into my WordPress mySQL DB was a great idea. I first imported the XML file into a new table that I called cs_posts. This table’s structure is identical to the original query used to export it. Once this was done, I built a basic INSERT INTO …  SELECT query to import these CS posts directly into my WordPress posts table:

INSERT INTO wp_xxxxx_posts 
	(post_author, 
	post_date, 
	post_date_gmt, 
	post_content, 
	post_title, 
	post_status, 
	post_name, 
	post_modified, 
	post_modified_gmt, 
	guid) 
SELECT 2 AS post_author, 
	cs_posts.PostDate AS post_date, 
	cs_posts.PostDate AS post_date_gmt, 
	cs_posts.FormattedBody AS post_content, 
	cs_posts.Subject AS post_title, 
	'draft' AS post_status, 
	cs_posts.slug AS post_name, 
	cs_posts.PostDate AS post_modified, 
	cs_posts.PostDate AS post_modified_gmt, 
	cs_posts.old_url AS guid
FROM cs_posts

From this point, all that was required was for me to correct any permalinks that did not match up to the slug I had calculated. But I also wanted to get 301 redirects in place for all incoming requests looking for /archive/YYYY/MM/DD/some-post-title-slug.aspx to find their way to the new URL /YYYY/MM/some-post-title-slug. This was much easier than I anticipated due to the luxury of John Godley‘s Redirection plugin. This gem of a plugin makes my introduction to the WordPress ecosystem a dream come true. In fact, after I set it up on both this site and my root: robnrob.com site, I was able to populate the redirection item table his plugin uses to skip the need to enter in each post’s specific redirection. The plugin also has an option for regex-ish pattern matching, but a lot of the permalinks I ended up with on WordPress would not directly transpose from the basic:

url pattern: /archive/(d+)/(d+)/(d+)/([a-zA-Z0-9_-]+).aspx
redirect to: http://robbie.robnrob.com/$1/$2/$4

In the end, I lost out on previous comments, categories, and tags, but what I gained was a much more reliable hosting environment and a much more enjoyable platform to hack on. Also, to be honest, I had only a handful of comments anyway. 😉

The Most Exciting Email In My Nerd History

 Scala, Technology  Comments Off on The Most Exciting Email In My Nerd History
May 192011
 
 

This is the first time that the creator of my programming language of choice sends me an email.

Scala 2.9 was just released this week and the development team at Gravity are working to migrate our code base onto it. Just after our first attempt to run our unit tests, we hit a bug that we could not code around and hit the forums for answers. I found an already reported bug that matched our case as well and jumped on the ticket to receive updates. Later that day I saw a comment on it from Martin Odersky (the original author of Scala) himself. That was exciting enough for me, but the email…. WOW.

…okay, I can go back to my day now.

Top Google Result in One Day!

 SEO  Comments Off on Top Google Result in One Day!
May 182011
 

 

 

Google Top Search Result for Robbie Coleman

Google Top Search Result for Robbie Coleman

Amazingly, just after one day of me installing/configuring this new WordPress blog, a search on Google for my name: Robbie Coleman returns this site as the very first result!

xPollinate did not do so well…

 Technology  Comments Off on xPollinate did not do so well…
Jun 092009
 

In my last post I found that the URL used as my blog’s permalink was posted without the hostname (http://blogs/robbie/archive/2009/06/09/trying-out-wlw-xpollinate.aspx). Hmmm… Well, I never base things on a single attempt, so… here goes my send try.

Ping.fm + twitterfeed = sweet blog love

 Technology  Comments Off on Ping.fm + twitterfeed = sweet blog love
Feb 042009
 

I have been trying to find and/or create a solution for automatically posting links to new blog posts I write as well as cross-posting an excerpt to my other blogs that link back to my primary source blog.

Meet my two new friends: Ping.fm and twitterfeed. Well, Ping.fm has been a close friend for some time now, but he is quite the social butterfly and tends to find some pretty cool friends of his own. The interface for twitterfeed is a we bit clunky, but it is also exactly what I need. I am not one to complain when interfaces are not polished as my own tend to be more than rough around the edges.

The two most powerful features offered by twitterfeed are its tight integratration with Ping.fm and it’s support for bit.ly link shortening/tracking. Another nicety is the fact that you can create many feeds (the term used by twitterfeed to describe the linked configuration of an RSS feed to one of their supported endpoints). With all of these combined, I was able to create two feeds that do everything that I wanted for each new post I make to my personal blog here. The first of these pulls new posts detected in the RSS from my FeedBurner feed of this blog, and posts a “status” update to Ping.fm with the text “New Post: “ along with the title of the RSS feed item (blog post title), and a bit.ly shortened link to my post. The second consumes the same new post from the same RSS, and posts a “blog” post to Ping.fm. To do this, I simply repeated what I did for my first one but changed the Ping.fm method from “status” to “blog” and changed what to include from “title only” to “title & description.”

Since my FeedBurner RSS also splices in my Flickr photo posts (which happen to mostly be from Ping.fm MMS uploads) they also flow through twitterfeed.

Well… this should be the first actual blog post that goes through twitterfeed in the two ways I just described above.