Kicking the tires with OpenCalais

OpenCalais logo

[I deleted this post by mistake – this is a re-post]

OpenCalais is a software that takes in a piece of textual content in plain text or HTML format, extracts entities from it and generates an RDF graph in XML of them.

OpenCalais was brought to my attention by a post by Bob Jonkam on the microformats discuss list. I was intrigued and decided to take a look.

After the OpenCalais team helped me quickly resolve the HTTP 403 issue, I decide to give it a try with a text from the relatively recent heated “calendar (and other) items aren’t always tidy” discussion on uf-discuss , I will refer to as “football example”:

Bobby and Billy are on the same football team and on Sunday they’re playing against the Falcons, whose coach is Ron Smith. Ron Smith is Bobby and Billy’s father. The brothers are also the star quarterback and star fullback at Pittsfield High.”

The result in RDF XML can be found here. Below are in plain english the significant things that OpenCalais successfully identified:

  • Billy is a Person
  • Billy is mentioned at 2 places in the text (offsets 92 and 229)
  • Ron Smith is a Person
  • Ron Smith is mentioned at 2 places in the text (offsets 194 and 206)

Interestingly, Bobby was not identified as a Person, and obviously there is a whole lot of entities and relationships that haven’t been identified.

Obviously, the football example is probably a edge case for OpenCalais. With Reuters as a sponsor, this technology seems much more geared towards business news analysis. This is even more obvious when you look at their semantic metadata, where you find things like Person, PersonProfessional, PersonPolitical, Bankrupcy, Alliance, Acquisition, etc. But the roadmap mentions that later this year, third-party developers will be able to extend the extraction capabilities of OpenCalais.

This is an interesting open initiative nonetheless and a smart move from Reuters towards being a platform. Its limitations with an edge case also shows to me how the semantic Web, just like the software market, won’t probably be “owned” by one player, but that a multitude of players with a variety of breadth and depth because there are as many representations of the world as there are cultures, communities and people, and they evolve all the time.

[Note: as a French man, when I saw the name of this service, I immediately thought of the city of Calais, the town in northern France where Le Tunnel sous La Manche (”chunnel”) starts/ends. I still can’t figure why they picked that name and can’t wait to know!]