STICS – Search and Analysis with Strings, Things, and Cats

Johannes Hoffart & Dragan Milchevski & Gerhard Weikum

STICS – Search and Analysis with Strings, Things, and Cats

„Things, not Strings” has been Google’s motto when introducing the Knowledge Graph and the entity-awareness of its search engine. When you type in the keyword “Klitschko” as a query, Google still returns Web and news pages, but also explicit entities like Wladimir Klitschko and his brother Vitali (including structured attributes like date of birth, profession, and relations to other entities, from the Knowledge Graph). Moreover, while typing, the query auto-completion method suggests the two brothers in entity form with the additional hints that one is an active boxer and the other a politician.

However, the Google approach still has limitations. First, recognizing entities in a keyword query and returning entity results seems to be limited to prominent entities. Unlike the Klitschko example, a query for the Ukrainian pop singer “Iryna Bilyk” does not show any entity suggestions (neither for auto-comple tion nor in the search results). Second, Google seems to understand only individual entities, but cannot handle sets of entities that are described by a type name or category phrase. For example, queries like “Ukrainian celebrities” or “East European politicians” return only the usual ten blue links: Web pages that match these phrases. The search engine does not understand the user’s intention to obtain lists of people in these categories.

STICS, short for “Searching with Strings, Things, and Cats”, is a novel search engine that extends entity awareness in Web and news searches by tapping into long-tail entities and understanding and expanding phrases that refer to semantic categories. STICS supports users in searching for strings, things, and cats (short for categories) in a seamless and convenient manner. For example, when posing the query “Merkel Ukrainian opposition”, the user is automatically guided, through auto- completion, to the entity ‘Angela_Merkel’ and the category ‘Ukrainian_politicians’, which is automatically expanded into ‘Vitali_Klitschko’, ‘Arseniy_Yatsenyuk’, etc. The search results include texts like “the German chancellor met the Ukrainian opposition leader and former heavy-weight champion”, even if these texts never mention the strings “Angela Merkel” and “Vitali Klitschko”. STICS achieves this by using the named entity recognition and disambiguation system AIDA, which links ambiguous words to entities in YAGO, which in turn contains the entities’ categories. The inner workings of AIDA are detailed in “AIDA – Resolving the Name Ambiguity”.

The same technology can also be used to improve the analysis of large archives. Consider the task of visualizing trends around the recent Ukrainian crisis, which originated from the Maidan, the square in Kiev where thousands of Ukrainians protested in early 2014. A search for “Maidan” quickly reveals that the name is highly ambiguous, as it means ‘square’ not only in Ukrainian, but also in Hindi and Arabic. Thus, simply counting the string “Maidan” will result in a large number of false positives, leading to an imprecise analysis. By specifying the canonicalized entity ‘Maidan_Nezalezhnosti’, not only do we get rid of spurious mentions of other Maidans, but we also fi nd articles where the square is mentioned only by its English name “Independence Square”. Thus, entity-level analytics, as supported by STICS, is the only way to get accurate numbers.

Additionally, as we now have the full potential of a structured knowledge base in the background, further opportunities are opened up. In all semantic knowledge bases, entities are organized in a category hierarchy, e.g., ‘Greenpeace’ is an ‘environmental_organization’, which is in turn a subclass of a general ‘organization’. Using this category hierarchy, we can conduct analyses for entire groups of entities, for example, comparing the presence of ‘environmental_organizations’ and ‘power_companies’ in news from different parts of the world, deriving a picture of how their importance changes over time. 

Johannes Hoffart

DEPT. 5 Databases and Information Systems
Phone
+49 681 9325-5004
Email jhoffart (at) mpi-inf.mpg.de

Dragan Milchevski

DEPT. 5 Databases and Information Systems
Phone
+49 681 9325-5013
Email dmilchev mpi-inf.mpg.de

Gerhard Weikum

DEPT. 5 Databases and Information Systems
Phone
+49 681 9325-5000
Email weikum (at) mpi-inf.mpg.de