Johannes Hoffart & Gerhard Weikum & Mohamed Amir Yosef
AIDA: Resolving the Name Ambiguity
Have you ever googled for your own name to find out what the Web knows about you? Chances are that you’re not the only person with your name. The right Web pages are buried among others, unless you are very famous indeed! This is of course not the only scenario where ambiguity makes life difficult. When we read our daily news, most of the names mentioned are ambiguous. As a human being, we deal with ambiguity without thinking; the right meaning seems obvious to us. Only in the most difficult cases – take for example the sentence “Bush was a US president”. – we notice it. Without further information we cannot know if “Bush” means George H. W. Bush (the 41 st president) or his son George W. Bush (the 43 rd president).
The knowledge which person (or organization, or place, or film, song, etc.) is mentioned where on the Web, or indeed in any given text, is very useful for a multitude of applications. Where previously search engines could only look for a string of characters, now they can actually understand what exactly the user is looking for, giving much more precise results. Knowledge about the real meaning allows you to actually specify that you are looking for the rock group called “Bush” and not a US president. Or imagine a researcher looking for differences in the media reception of Bush Sr. and Bush Jr. She can easily get all articles and how often they were mentioned in each of them without having to look at a single one with her own eyes.
Our AIDA disambiguation system resolves the ambiguity by linking names in text to a canonical entity representation in a knowledge base, for example YAGO. YAGO contains nearly 10 million unique entities, among them nearly 1 million persons, but it also contains locations, organizations, products and events – see the YAGO article in this report for more details. The disambiguation process consists of different pieces of data and directives, each of which gives additional clues as to which entity is actually referred to in the text. Combining all of them in the right manner identifies the correct entity.
The most important insights for a correct disambiguation are the following. The name probably refers to the most prominent entity. When the name “Paris” is found in a text it generally refers to the French capital. There has to be really good contextual evidence to suggest otherwise. Take for example the sentence “Paris had to steal Helen from her husband, the king of Sparta”. The words in the sentence suggest that this is a per- son from Greek mythology. This kind of contextual evidence is the second feature used by our disambiguation mechanism. Every entity in our knowledge base is associated with a textual description in key- word form that is compared to the surrounding context of the name. The more the context and the description overlap, the better the indication for the entity. However, in some cases, the words alone are not enough, especially when the contextual evidence is very limited. To deal with these cases our methodology enforces coherence among the resolved entities, preferring candidates that go well together. In the example sentence “Paris met Helen”, Paris and Helen of Troy are a better fit than Paris Hilton and Helen, Georgia, a small US city.
The interplay of the described features can be seen very well in the sentence “Bush did not handle the after- math of Katrina in New Orleans very well”. New Orleans is easy to resolve, as the city is very prominently associated with the name. The first name “Katrina” is highly ambiguous, but the way it is placed in the context of “aftermath” gives a strong indication that it is not a person but indeed a natural disaster. Once both “Katrina” and “New Orleans” are resolved to the hurricane and the city, it becomes clear that “Bush” refers to Bush Jr., as the hurricane hit during his presidency.
The disambiguation quality of AIDA was tested on a collection of newswire articles and AIDA achieves better results than any other existing disambiguation method. The resulting knowledge of entities in a text allows a more powerful search in these texts, and additionally serves as a basis for the extraction of further knowledge about the entities, for example how they relate to each other.
DEPT. 5 Databases and Information Systems
Phone +49 681 9325-5028
DEPT. 5 Databases and Information Systems
Phone +49 681 9325-5000