February | 2008 | Guillaume's blog

Trying out Yahoo! Shortcuts

Yahoo! Shortcuts is a new service launched by Yahoo! on or around December 13th 2007. Yahoo! Shortcuts make it easy for bloggers to link the content in their blog posts to Yahoo! resources, such as maps, products for sale or stock quotes.

The interesting part of Yahoo! Shortcuts is in the WordPress plugin provided to insert these links. The plugin detects things as you type your post, and allows you to link them to Yahoo! resources. In this page, all these dotted underline links are Yahoo Shortcuts you can try out by passing your mouse pointer over. Examples include addresses (1600 Pennsylvania Avenue NW Washington, DC 20500), products (Garmin Nuvi 660), and companies (Apple).

Whenever a “Shortcut” is found, the Y! Powered Shortcuts widget is updated with the number of Shortcuts found. Also, the text of the shortcut is markup like: <span id="lw_1202511969_3" class="yshortcuts">Apple. The last digit of the id seems to be the number of the shortcut, and the number “1202511969” seems to be an id that is unique to the post. It’s worth noting the cryptic nature of these tags: there seems to be no way to tell that a particular shortcut is an address or product or company name.

I also noticed that it seems that when an entity like Apple is mentioned several times, it is only detected once. I guess that this is to avoid cluttering the post with tons of shortcuts.

The next step consists in reviewing the post. For each Shortcut detected, it is possible to remove it, convert it to a badge or keep it as a link (default). A badge means that the Yahoo! content will be embedded in the page itself. A link means that the Yahoo! content will appear when the link is hovered.

I have to say I’m quite impressed with the annotation technology. “999 Mission, San Francisco” is not detected, but “999 Mission Street, San Francisco” is. “123 Mission St, San Francisco” is as well. There seems to be support for detection in other languages as well: 150 rue saint-jacques, paris was detected correctly for instance. I imagine that this technology is the same one I noticed Yahoo uses in Yahoo! Mail to detect emails, phone numbers, addresses, events, etc. The FAQ also mentions that there are ways to improve the chances of the service detecting some objects, which gives credit to my “plain old english formats” theory (more on this hopefully in a coming post).

One current limitation of this technology is the detection of the relations between individual pieces of data. For instance, in Yahoo! Mail, if I have a phone number next to a name next to an email, the email and the phones will be detected as individual pieces of data, and I will be given the possibility to create a new contact for the phone number or to add it to an existing contact. This would not be an issue with a microformatted hCard, but writing an hCard today requires more skills and time than writing plain english.

On the usability side, from a post writer standpoint, I think the whole thing is pretty well designed although it would be nice to have the post reviewing step integrated in WordPress editor (TinyMCE).

The main issue I see is from a reader standpoint: they have no choice as to what to do with the detected content. The only thing you can do with an address is to look it up on Yahoo! map or search related Yahoo! news. For a company, the only thing you can do it to look its stock performance on Yahoo! Finance or search for it, etc. But of course, that is the whole point for Yahoo!: drives more traffic towards Yahoo! properties. I also don’t know if the licensing terms allow the style to be changed (technically, it seems it’s possible since all the style-related files are part of the plugin), but I think that would be a necessity as these Yahoo badges may not satisfy everyone’s taste and may repel some users and change their perception of the quality of the blog.

Microsoft Offers to Buy Yahoo: Semantic analysis by OpenCalais

My first try of OpenCalais semantic extractor made me realize that OpenCalais is currently better suited for business/financial news analysis. So, I wanted to give it a try with such a piece of news, and what better and more relevant example than the recent announcement by Microsoft of their offer to buy Yahoo. I picked the news story as reported by Bloomberg.com and submitted it to the OpenCalais semantic extractor Web service released a couple days ago.

Here is the (huge) RDF XML document returned from this piece of news. According to my HTTP Analyzer, from begin to end, the request/response took 6.711 seconds, the lion’s share of which was spent in transport (~80%). The actual Web server processing time was reported as being almost insignificant:

HTTPAnalyzer report

In terms of resources identified, OpenCalais did a very impressive job at identifying all the persons, companies, positions of persons at companies, relationship between companies, etc. Here is the list of entities identified as viewed in MIT SMILE Welkin application and the Circle RDF graph visualization:

List of entities identified

Here is the first issue though. This looks to me like… a haystack, and finding out the needle, i.e. what is actually relevant, would seem to be a big challenge. This is why I think it would be valuable for a semantic extractor like OpenCalais to weigh RDF statements, for instance according to their position in the text. Typically for instance, the title and first few paragraph would contain the most relevant information. I believe it would be great (maybe I missed it though) for the RDF document to return the most relevant RDF statement first.

Here are the other issues I noticed with regards to the facts extracted:

First, the name of the company who offered to buy Yahoo is “Microsoft offers”:

<rdf:Description rdf:about="http://d.opencalais.com/genericHasher-1/2fafdca3-9c94-3cec-a757-ab4b5d4ace74"> <rdf:type rdf:resource="http://s.opencalais.com/1/type/em/r/Acquisition"/>  <c:company_acquirer rdf:resource="http://d.opencalais.com/comphash-1/a58f842b-c683-36e8-bc8f-458fd86ca664"/>  <c:company_beingacquired rdf:resource="http://d.opencalais.com/comphash-1/a7244973-6b02-36b0-8a6c-18cf4b59ee4b"/> <c:status>planned</c:status> </rdf:Description> <rdf:Description rdf:about="http://d.opencalais.com/comphash-1/a58f842b-c683-36e8-bc8f-458fd86ca664"> <rdf:type rdf:resource="http://s.opencalais.com/1/type/em/e/Company"/> <c:name>Microsoft Offers</c:name> </rdf:Description>

Second, Jerry Yang is reported as Co-founder of Facebook:

<rdf:Description rdf:about="http://d.opencalais.com/genericHasher-1/d53dfdcd-ed76-3722-82bc-8b55cf65cc2f"> <rdf:type rdf:resource="http://s.opencalais.com/1/type/em/r/PersonProfessional"/>  <c:person rdf:resource="http://d.opencalais.com/pershash-1/f9e2dce6-da09-3734-a86e-e4cd8e7b6df8"/> <c:position>Co-founder </c:position>  <c:company rdf:resource="http://d.opencalais.com/comphash-1/6e117dae-5f83-3641-a394-b626053412cb"/> </rdf:Description>

Last, it seems that Google is reported as an investor in Microsoft…

<rdf:Description rdf:about="http://d.opencalais.com/genericHasher-1/d63adca9-71c2-3268-a47a-86aa2b35408f"> <rdf:type rdf:resource="http://s.opencalais.com/1/type/em/r/CompanyInvestment"/>  <c:company rdf:resource="http://d.opencalais.com/comphash-1/49bf454b-3fed-3244-94fc-b3d5115f7df4"/>  <c:company_investor rdf:resource="http://d.opencalais.com/comphash-1/6a5c8712-9c0f-39bd-8655-9a100c09ecba"/> <c:status>known</c:status> </rdf:Description> <rdf:Description rdf:about="http://d.opencalais.com/dochash-1/990c3a0a-89f3-31cc-901c-d563fdcc9aa6/Instance/125"> <rdf:type rdf:resource="http://s.opencalais.com/1/type/sys/InstanceInfo"/> <c:docId rdf:resource="http://d.opencalais.com/dochash-1/990c3a0a-89f3-31cc-901c-d563fdcc9aa6"/> <c:subject rdf:resource="http://d.opencalais.com/genericHasher-1/d63adca9-71c2-3268-a47a-86aa2b35408f"/>  <c:detection>[in the past 12 months after years of ]investments in Microsoft's own business[ failed to help the company gain share. ]</c:detection> <c:offset>3852</c:offset> <c:length>39</c:length> </rdf:Description>

Obviously the service has access to little to no context besides the document itself.

My conclusion is that OpenCalais would benefit from a statement weighting computation engine that would take into account both known facts, document structure and level of confidence in the analysis to return the developer with the most relevant information. I wonder if OpenCalais sees this on their roadmap, if they have it already (and missed it, but I don’t think so), or if they don’t consider this as part of the functionality they provide and expect the application developer to do the work.

Another question is why the software is offered as a service when most of the time is spent on transport and when there seems to be little context used outside of the submitted document.

Alexander Payne’s 14e Arondissement

This short from the movie Paris, je t’aime is the one that moved me the most. I don’t know why it is on YouTube, but I’m so happy I found it. Enjoy!

The meaning of vcard’s “fn”

Martin McEvoy recently resurrected a thread on the replacement of “fn” by “title” in the hAudio microformat. The main point is that “fn” (formatted name) is a bit cumbersome for a song’s name/title. This offered me the opportunity to give my interpretation of the meaning of “formatted name”, which I will summarize here.

A formatted name is a locale-specific (typically of the locale the name is from) serialization of a structured representation of a person’ name. It is useful for display and print, for instance on the label of an envelope, where conformance to local name ordering practices is desired for politeness reasons.

Now some explanation of why formatted names are important for people’s names.

For those who don’t know, there are different name ordering conventions in different parts of the world. Just as an example, given name first, family name second is common in Western countries, whereas family name first, given name second is common in Eastern countries.

So, computer people who want to store names of persons for different places in the world have to deal with the following problem: they want to be able to distinguish family names from given names and other names (middle, mother’s, etc.) since it helps for searching, for identification and for avoiding duplicate entries, but they don’t want to be impolite either and send a letter with a name formatting that does not comply with the locale of the person.

One solution to this problem could be to identify all the different types of name ordering conventions, for instance, by locale and locale region, then code these rules in some programming language, then keep the information about the locale of the person, or infer it from the country they were born in, or the place they live, or something else, and then compute the formatted name from the database or structured or tokenized representation.

That is obviously a lot of work, and also not completely fail-proof. For instance, a Japanese person living in the U.S. might still want their name to be printed on letters with the last name first. If you add honorific titles, prefixes, suffixes, abbreviated forms, etc. to the problem mix, it is even more work. Usually at this point, what the computer people do is go back to the problem they were addressing (usually not an international name storage problem, but something else like a customer data storage problem for a U.S. bank or an electronic business card problem) and realize that if they spend so much time on each issue (“Why are we doing this again?”), and that no much will come out of their work if they continue on this path (at least no fast enough for the next quarter). This is the exact situation that myself and my IFX colleagues faced a couple years ago, and I’m hypothesizing that this is the same problem that the vCard people faced.

The only easy solution is to store the name of the person in a structured format, but keep a copy of the preferred formatting in a separate field. This is what we did at IFX, and this is what I’m guessing the vCard people did.

All this to say that the meaning of “formatted name” is to me very specific to those names for which there is value in maintaining two representations, one structured and one serialized, because reconstructing one from the other is difficult. To go back to the original thread raised by Martin, and given the above, I don’t think that “fn” should be used for a song’s name.

Punctuation as markup

Like many, I spent most of my days in markup. Sometimes to the point that I forget what it was invented for, to address what problem. During these times of doubt and confusion, I like to go back in time and read the works of the pioneers.

Yesterday night, I read this great article Markup Systems and the Future of Scholarly Text Processing, which dates back from before XML, HTML, SGML, or GML even!

There is a section on the different kinds of markup, in particular punctuational markup and descriptive markup. After reading this, it occurred to me that the following three representations are strictly equivalent (markup is highlighted in bold):

Plain text markup using the period “markup” to signify the end a sentence that is a statement (see. this section of HyperGrammar for a grammar refresher): The teacher asked who was chewing gum.
XML markup that specifies that a piece of text is a sentence and a statement (notice no end punctuation here): <sentence><statement>The teacher asked who was chewing gum</statement></sentence>
Plain old semantic HTML that specifies that a piece of text is a sentence and a statement (notice no end punctuation here): <span class="sentence statement">The teacher asked who was chewing gum</span>

In the last case, you can use CSS code to add a period at the end of each sentence that is a statement:
.sentence.statement:after { content: '.' }

This means also that if we are being strict, combining punctuation with HTML or XML markup, when descriptive markup and CSS styling suffice, is a bad practice since it is semantically redundant, or in other words, one of the two is useless.

There are plenty other examples to explore: for instance, quotes (“) as markup that is an alternative to the <q></q> or <blockquote></blockquote> HTML markup. This may be pushed to the extreme that each space in plain text is viewed as markup to distinguishes words from one another.

This is probably an epiphany just for me, but I thought I’d post it anyway!