My first try of OpenCalais semantic extractor made me realize that OpenCalais is currently better suited for business/financial news analysis. So, I wanted to give it a try with such a piece of news, and what better and more relevant example than the recent announcement by Microsoft of their offer to buy Yahoo. I picked the news story as reported by Bloomberg.com and submitted it to the OpenCalais semantic extractor Web service released a couple days ago.
Here is the (huge) RDF XML document returned from this piece of news. According to my HTTP Analyzer, from begin to end, the request/response took 6.711 seconds, the lion’s share of which was spent in transport (~80%). The actual Web server processing time was reported as being almost insignificant:
In terms of resources identified, OpenCalais did a very impressive job at identifying all the persons, companies, positions of persons at companies, relationship between companies, etc. Here is the list of entities identified as viewed in MIT SMILE Welkin application and the Circle RDF graph visualization:
Here is the first issue though. This looks to me like… a haystack, and finding out the needle, i.e. what is actually relevant, would seem to be a big challenge. This is why I think it would be valuable for a semantic extractor like OpenCalais to weigh RDF statements, for instance according to their position in the text. Typically for instance, the title and first few paragraph would contain the most relevant information. I believe it would be great (maybe I missed it though) for the RDF document to return the most relevant RDF statement first.
Here are the other issues I noticed with regards to the facts extracted:
First, the name of the company who offered to buy Yahoo is “Microsoft offers”:
Second, Jerry Yang is reported as Co-founder of Facebook:
Last, it seems that Google is reported as an investor in Microsoft…
<!--CompanyInvestment: company: Microsoft; company_investor: Google Inc.; status: known; -->
<c:detection>[in the past 12 months after years of ]investments in Microsoft's own business[
failed to help the company gain share. ]</c:detection>
Obviously the service has access to little to no context besides the document itself.
My conclusion is that OpenCalais would benefit from a statement weighting computation engine that would take into account both known facts, document structure and level of confidence in the analysis to return the developer with the most relevant information. I wonder if OpenCalais sees this on their roadmap, if they have it already (and missed it, but I don’t think so), or if they don’t consider this as part of the functionality they provide and expect the application developer to do the work.
Another question is why the software is offered as a service when most of the time is spent on transport and when there seems to be little context used outside of the submitted document.