semantic Web | Guillaume's blog

Kicking the tires with OpenCalais

[I deleted this post by mistake – this is a re-post]

OpenCalais is a software that takes in a piece of textual content in plain text or HTML format, extracts entities from it and generates an RDF graph in XML of them.

OpenCalais was brought to my attention by a post by Bob Jonkam on the microformats discuss list. I was intrigued and decided to take a look.

After the OpenCalais team helped me quickly resolve the HTTP 403 issue, I decide to give it a try with a text from the relatively recent heated “calendar (and other) items aren’t always tidy” discussion on uf-discuss , I will refer to as “football example”:

Bobby and Billy are on the same football team and on Sunday they’re playing against the Falcons, whose coach is Ron Smith. Ron Smith is Bobby and Billy’s father. The brothers are also the star quarterback and star fullback at Pittsfield High.”

The result in RDF XML can be found here. Below are in plain english the significant things that OpenCalais successfully identified:

Billy is a Person
Billy is mentioned at 2 places in the text (offsets 92 and 229)
Ron Smith is a Person
Ron Smith is mentioned at 2 places in the text (offsets 194 and 206)

Interestingly, Bobby was not identified as a Person, and obviously there is a whole lot of entities and relationships that haven’t been identified.

Obviously, the football example is probably a edge case for OpenCalais. With Reuters as a sponsor, this technology seems much more geared towards business news analysis. This is even more obvious when you look at their semantic metadata, where you find things like Person, PersonProfessional, PersonPolitical, Bankrupcy, Alliance, Acquisition, etc. But the roadmap mentions that later this year, third-party developers will be able to extend the extraction capabilities of OpenCalais.

This is an interesting open initiative nonetheless and a smart move from Reuters towards being a platform. Its limitations with an edge case also shows to me how the semantic Web, just like the software market, won’t probably be “owned” by one player, but that a multitude of players with a variety of breadth and depth because there are as many representations of the world as there are cultures, communities and people, and they evolve all the time.

[Note: as a French man, when I saw the name of this service, I immediately thought of the city of Calais, the town in northern France where Le Tunnel sous La Manche (”chunnel”) starts/ends. I still can’t figure why they picked that name and can’t wait to know!]

Trying out Yahoo! Shortcuts

Yahoo! Shortcuts is a new service launched by Yahoo! on or around December 13th 2007. Yahoo! Shortcuts make it easy for bloggers to link the content in their blog posts to Yahoo! resources, such as maps, products for sale or stock quotes.

The interesting part of Yahoo! Shortcuts is in the WordPress plugin provided to insert these links. The plugin detects things as you type your post, and allows you to link them to Yahoo! resources. In this page, all these dotted underline links are Yahoo Shortcuts you can try out by passing your mouse pointer over. Examples include addresses (1600 Pennsylvania Avenue NW Washington, DC 20500), products (Garmin Nuvi 660), and companies (Apple).

Whenever a “Shortcut” is found, the Y! Powered Shortcuts widget is updated with the number of Shortcuts found. Also, the text of the shortcut is markup like: <span id="lw_1202511969_3" class="yshortcuts">Apple. The last digit of the id seems to be the number of the shortcut, and the number “1202511969” seems to be an id that is unique to the post. It’s worth noting the cryptic nature of these tags: there seems to be no way to tell that a particular shortcut is an address or product or company name.

I also noticed that it seems that when an entity like Apple is mentioned several times, it is only detected once. I guess that this is to avoid cluttering the post with tons of shortcuts.

The next step consists in reviewing the post. For each Shortcut detected, it is possible to remove it, convert it to a badge or keep it as a link (default). A badge means that the Yahoo! content will be embedded in the page itself. A link means that the Yahoo! content will appear when the link is hovered.

I have to say I’m quite impressed with the annotation technology. “999 Mission, San Francisco” is not detected, but “999 Mission Street, San Francisco” is. “123 Mission St, San Francisco” is as well. There seems to be support for detection in other languages as well: 150 rue saint-jacques, paris was detected correctly for instance. I imagine that this technology is the same one I noticed Yahoo uses in Yahoo! Mail to detect emails, phone numbers, addresses, events, etc. The FAQ also mentions that there are ways to improve the chances of the service detecting some objects, which gives credit to my “plain old english formats” theory (more on this hopefully in a coming post).

One current limitation of this technology is the detection of the relations between individual pieces of data. For instance, in Yahoo! Mail, if I have a phone number next to a name next to an email, the email and the phones will be detected as individual pieces of data, and I will be given the possibility to create a new contact for the phone number or to add it to an existing contact. This would not be an issue with a microformatted hCard, but writing an hCard today requires more skills and time than writing plain english.

On the usability side, from a post writer standpoint, I think the whole thing is pretty well designed although it would be nice to have the post reviewing step integrated in WordPress editor (TinyMCE).

The main issue I see is from a reader standpoint: they have no choice as to what to do with the detected content. The only thing you can do with an address is to look it up on Yahoo! map or search related Yahoo! news. For a company, the only thing you can do it to look its stock performance on Yahoo! Finance or search for it, etc. But of course, that is the whole point for Yahoo!: drives more traffic towards Yahoo! properties. I also don’t know if the licensing terms allow the style to be changed (technically, it seems it’s possible since all the style-related files are part of the plugin), but I think that would be a necessity as these Yahoo badges may not satisfy everyone’s taste and may repel some users and change their perception of the quality of the blog.

Microsoft Offers to Buy Yahoo: Semantic analysis by OpenCalais

My first try of OpenCalais semantic extractor made me realize that OpenCalais is currently better suited for business/financial news analysis. So, I wanted to give it a try with such a piece of news, and what better and more relevant example than the recent announcement by Microsoft of their offer to buy Yahoo. I picked the news story as reported by Bloomberg.com and submitted it to the OpenCalais semantic extractor Web service released a couple days ago.

Here is the (huge) RDF XML document returned from this piece of news. According to my HTTP Analyzer, from begin to end, the request/response took 6.711 seconds, the lion’s share of which was spent in transport (~80%). The actual Web server processing time was reported as being almost insignificant:

HTTPAnalyzer report

In terms of resources identified, OpenCalais did a very impressive job at identifying all the persons, companies, positions of persons at companies, relationship between companies, etc. Here is the list of entities identified as viewed in MIT SMILE Welkin application and the Circle RDF graph visualization:

List of entities identified

Here is the first issue though. This looks to me like… a haystack, and finding out the needle, i.e. what is actually relevant, would seem to be a big challenge. This is why I think it would be valuable for a semantic extractor like OpenCalais to weigh RDF statements, for instance according to their position in the text. Typically for instance, the title and first few paragraph would contain the most relevant information. I believe it would be great (maybe I missed it though) for the RDF document to return the most relevant RDF statement first.

Here are the other issues I noticed with regards to the facts extracted:

First, the name of the company who offered to buy Yahoo is “Microsoft offers”:

<rdf:Description rdf:about="http://d.opencalais.com/genericHasher-1/2fafdca3-9c94-3cec-a757-ab4b5d4ace74"> <rdf:type rdf:resource="http://s.opencalais.com/1/type/em/r/Acquisition"/>  <c:company_acquirer rdf:resource="http://d.opencalais.com/comphash-1/a58f842b-c683-36e8-bc8f-458fd86ca664"/>  <c:company_beingacquired rdf:resource="http://d.opencalais.com/comphash-1/a7244973-6b02-36b0-8a6c-18cf4b59ee4b"/> <c:status>planned</c:status> </rdf:Description> <rdf:Description rdf:about="http://d.opencalais.com/comphash-1/a58f842b-c683-36e8-bc8f-458fd86ca664"> <rdf:type rdf:resource="http://s.opencalais.com/1/type/em/e/Company"/> <c:name>Microsoft Offers</c:name> </rdf:Description>

Second, Jerry Yang is reported as Co-founder of Facebook:

<rdf:Description rdf:about="http://d.opencalais.com/genericHasher-1/d53dfdcd-ed76-3722-82bc-8b55cf65cc2f"> <rdf:type rdf:resource="http://s.opencalais.com/1/type/em/r/PersonProfessional"/>  <c:person rdf:resource="http://d.opencalais.com/pershash-1/f9e2dce6-da09-3734-a86e-e4cd8e7b6df8"/> <c:position>Co-founder </c:position>  <c:company rdf:resource="http://d.opencalais.com/comphash-1/6e117dae-5f83-3641-a394-b626053412cb"/> </rdf:Description>

Last, it seems that Google is reported as an investor in Microsoft…

<rdf:Description rdf:about="http://d.opencalais.com/genericHasher-1/d63adca9-71c2-3268-a47a-86aa2b35408f"> <rdf:type rdf:resource="http://s.opencalais.com/1/type/em/r/CompanyInvestment"/>  <c:company rdf:resource="http://d.opencalais.com/comphash-1/49bf454b-3fed-3244-94fc-b3d5115f7df4"/>  <c:company_investor rdf:resource="http://d.opencalais.com/comphash-1/6a5c8712-9c0f-39bd-8655-9a100c09ecba"/> <c:status>known</c:status> </rdf:Description> <rdf:Description rdf:about="http://d.opencalais.com/dochash-1/990c3a0a-89f3-31cc-901c-d563fdcc9aa6/Instance/125"> <rdf:type rdf:resource="http://s.opencalais.com/1/type/sys/InstanceInfo"/> <c:docId rdf:resource="http://d.opencalais.com/dochash-1/990c3a0a-89f3-31cc-901c-d563fdcc9aa6"/> <c:subject rdf:resource="http://d.opencalais.com/genericHasher-1/d63adca9-71c2-3268-a47a-86aa2b35408f"/>  <c:detection>[in the past 12 months after years of ]investments in Microsoft's own business[ failed to help the company gain share. ]</c:detection> <c:offset>3852</c:offset> <c:length>39</c:length> </rdf:Description>

Obviously the service has access to little to no context besides the document itself.

My conclusion is that OpenCalais would benefit from a statement weighting computation engine that would take into account both known facts, document structure and level of confidence in the analysis to return the developer with the most relevant information. I wonder if OpenCalais sees this on their roadmap, if they have it already (and missed it, but I don’t think so), or if they don’t consider this as part of the functionality they provide and expect the application developer to do the work.

Another question is why the software is offered as a service when most of the time is spent on transport and when there seems to be little context used outside of the submitted document.

Punctuation as markup

Like many, I spent most of my days in markup. Sometimes to the point that I forget what it was invented for, to address what problem. During these times of doubt and confusion, I like to go back in time and read the works of the pioneers.

Yesterday night, I read this great article Markup Systems and the Future of Scholarly Text Processing, which dates back from before XML, HTML, SGML, or GML even!

There is a section on the different kinds of markup, in particular punctuational markup and descriptive markup. After reading this, it occurred to me that the following three representations are strictly equivalent (markup is highlighted in bold):

Plain text markup using the period “markup” to signify the end a sentence that is a statement (see. this section of HyperGrammar for a grammar refresher): The teacher asked who was chewing gum.
XML markup that specifies that a piece of text is a sentence and a statement (notice no end punctuation here): <sentence><statement>The teacher asked who was chewing gum</statement></sentence>
Plain old semantic HTML that specifies that a piece of text is a sentence and a statement (notice no end punctuation here): <span class="sentence statement">The teacher asked who was chewing gum</span>

In the last case, you can use CSS code to add a period at the end of each sentence that is a statement:
.sentence.statement:after { content: '.' }

This means also that if we are being strict, combining punctuation with HTML or XML markup, when descriptive markup and CSS styling suffice, is a bad practice since it is semantically redundant, or in other words, one of the two is useless.

There are plenty other examples to explore: for instance, quotes (“) as markup that is an alternative to the <q></q> or <blockquote></blockquote> HTML markup. This may be pushed to the extreme that each space in plain text is viewed as markup to distinguishes words from one another.

This is probably an epiphany just for me, but I thought I’d post it anyway!