Association Rule Mining under Incomplete Evidence in Ontological Knowledge Bases
AMIE is a system that extracts supported and confident logical rules from a knowledge base (KB). Logical rules encode frequent correlations in the data. For example the rule:
?x <hasChild> ?c ?y <hasChild> ?c => ?x <isMarriedTo> ?y
states that people having children in common are frequently married. Logical rules have potential in a broad range of applications such as data prediction, irregularities detection, automatic schema generation, ontologies reconciliation, etc. AMIE can mine these patterns in medium-sized KBs, several orders of magnitude faster than state-of-the-art approaches to mine logical rules from KBs. The first application of AMIE uses logical rules to address the problem of incompleteness in KBs (particularly web-extracted KBs)
AMIE can extract closed horn rules from medium-sized ontologies in a few minutes. We report the runtimes for AMIE+, the latest version of AMIE, that includes a set of runtime enhancements. AMIE and AMIE+ can sort and threshold on support, head coverage, standard confidence and PCA confidence. By default, AMIE+ uses a head coverage threshold of 0.01 and a minimum PCA confidence of 0.1 and disables the instantiation operator (atoms do not contain constants). Any deviations from these settings are explicitly mentioned.
|Dataset||# of facts||Settings||Latest runtime||Rules|
|YAGO2||948048||Default||28.19s||Sorted by: Std. Conf, PCA Conf, All rules|
|YAGO2||948048||Support 2 facts||3.76 min||All rules|
|YAGO2 sample||46654||Support 2 facts||2.90s||Sorted by PCA conf, All rules|
|YAGO2||948048||Default + constants||9.93 min||Some interesting examples, All rules|
|DBpedia 2.0||6704524||Default||46.88 min||Rules|
|DBpedia 3.8||11024066||Default||7h 6 min||Rules|
|Wikidata (Dec 2014)||11296834||Default||25.50 min||Rules|
YAGO is a semantic knowledge base derived from Wikipedia, WordNet and GeoNames. The latest version, YAGO2s, contains 120M facts describing properties of 10M different entities. Since the rules output by AMIE are used for prediction, we used the previous version, YAGO2 (released in 2010), to predict facts in YAGO2s. YAGO contains 120M facts about 2.6M entities. For both versions of the ontology we did not consider either facts with literal objects or any type of schema information (rdf:type statements, relation signatures and descriptions). For YAGO2s, this is equivalent to use the file yagoFacts with 4.12M triples. For YAGO2 we use the file yago core which contains 948K facts after cleaning. The clean testing versions of [YAGO2] and [YAGO2s] are available for download.
Our experiments included comparisons against state-of-the-art systems which could not handle even our clean version of YAGO2. For this reason, we built a sample of this KB by randomly picking 10K entities and collecting their 3 hops subgraphs. In contrast to a random sample of facts, this method preserves the original graph topology. This procedure resulted in a [47K facts sample].
DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia. The English version of DBpedia contains 1.89 billion facts about 2.45M entities. In the spirit of our data prediction endevours, we mined rules from DBpedia 2.0 to predict facts in DBpedia 3.8 (in English). In both cases we used the person data and infoboxes datasets and removed facts with literal objects and rdf:type statements. We also removed relations with less than 100 facts. This produced a clean subset of 6.7M facts for [DBpedia 2.0] and 11.02M for [DBpedia 3.8].
Wikidata is a free, community-based knowledge base maintained by the Wikimedia Foundation. The goal of the Wikidata project is to provide the same information as Wikipedia but in a computer-readable format, that is, Wikidata can be seen as the structured sibling of Wikipedia. For our experiments we used a dump of Wikidata from December 2014. As with the other datasets, we removed literal facts and type information leading to a clean set of 8.4M facts on which we ran AMIE.
Standard vs PCA confidence
In order to support the suitability of the PCA Confidence metric for prediction of new facts, we carried an experiment which uses the rules mined by AMIE on YAGO2 (training KB) to predict facts in the newer YAGO2s (testing KB). We took all rules mined by AMIE with head coverage threshold 0.01 and ranked them by standard and PCA confidence. Then we took every rule and generated new facts by taking all bindings of the head variables in the body of the rule which are not in the head (sets B, C and D in our mining model). For instance, for the rule ?s <livesIn> ?o => ?s <isCitizenOf> ?o, we produce predictions of type A <isCitizenOf> B where A and B correspond to bindings of people and places in the the body of the rule (<livesIn> relation). Some of those citizenship predictions are already in YAGO and constitute the positive examples, i.e., the support of the rule. However, in this experiment we are interested in the precision of those predictions that are beyong YAGO2.
Since a fact can be predicted by multiple rules with the same head relation, we have to handle with duplicate predictions. In the original version of this experiment, we removed duplicates naively by ignoring subsequent occurences of the same prediction. This is equivalent to rank predictions by the score of the most confident rule that concluded them. This, however, disregards the fact that predictions made by multiple rules should be prioritized because they count on multiple signals of evidence (still we are not arguing about the independence of such signals). We also had type checking problems due to granularity problems in the signatures of some relations, e.g., the rule ?s <livesIn> ?o => ?s <isCitizenOf> ?o could sometimes predict people were citizens of cities. We overcame these issues by (a) devising a aggregated score for predictions that considers the confidence of all the rules that made a prediction (b) by enhancing the rules with type constraints in order to avoid spurious predictions. The type information (rdf:type statements) was taken from YAGO3 because this version of the ontology solves the aforementioned granularity issues.
We report the cumulative precision of our predictions using the naive and the enhanced experimental setup. For the first case, we use both the PCA and standard confidence as ranking metric. For the enhanced experimental setup we use only the PCA confidence as ranking metric. We verified the correctness of the predictions by automatically looking them up in YAGO2s or by manual evaluation in Wikipedia and the Web. We produced around 400K predictions, so estimated the real precision using a sample.
- Predictiveness comparison between standard and PCA confidence
- Sample and evaluation of predictions used to calculate the precision.
AMIE vs ALEPH
We also conducted a series of experiments to compare the predictive power of AMIE and the PCA confidence against ALEPH and its positives-only confidence metric. ALEPH is an ILP (Inductive Logic Programming) rule miner written in Prolog that implements multiple confidence metrics. The positives-only confidence metric is the only one suitable for our setup as does not require explicit counter-examples as in traditional ILP methods.
- Predictiveness comparison between AMIE and ALEPH
- All input and output files can be downloaded here
- Luis Galárraga, Christina TeflioudiFabian Suchanek, Katja Hose
Fast Rule Mining in Ontological Knowledge Bases with AMIE+.
VLDB Journal 2015. [pdf]
- Luis Galárraga
Interactive Rule Mining in Knowledge Bases.
31ème Conférence sur la Gestion de Données (BDA 2015), Île de Porquerolles, 2015.
- Luis Galárraga, Danai Symeonidou, Jean-Claude Moissinac.
Rule Mining for Semantifying Wikilinks.
Linked Open Data Workshop (LODW 2015). [pdf]
- Luis Galárraga
Applications of Rule Mining in Knowledge Bases.
Invited paper at the 7th Workshop for Ph.D. Students at CIKM 2014. [pdf]
- Luis Galárraga, Geremy Heitz, Kevin Murphy, Fabian Suchanek
Canonicalizing Open Knowledge Bases.
Conference on Information and Knowledge Management (CIKM 2014). [pdf]
- Luis Galárraga, Nicoleta Preda, Fabian M. Suchanek
Mining Rules to Align Knowledge Bases.
Workshop on Automated Knowledge Base Construction (AKBC 2013) at CIKM 2013 [pdf]
- Luis Galárraga, Christina TeflioudiFabian Suchanek, Katja Hose
AMIE: Association Rule Mining under Incomplete Evidence in Ontological Knowledge Bases.
20th International World Wide Web Conference (WWW 2013). Best student paper award of the conference [pdf]
AMIE accepts RDF files in TSV format (like this). To run it, just write in your comand line:
java -jar amie+.jar [TSV file]
In case of memory issues, try to increase the virtual machine's memory resources using the arguments -XX:-UseGCOverheadLimit -Xmx[MAX_HEAP_SPACE], e.g:
java -XX:-UseGCOverheadLimit -Xmx2G -jar amie+.jar [TSV file]
MAX_HEAP_SPACE depends on your input size and the system's available memory. The package also contains the utilities to generate and evaluate predictions from the rules mined by AMIE. Without additional arguments AMIE+ thresholds using PCA confidence 0.1 and head coverage 0.01. You can change these default settings. Runjava -jar amie+.jar (without an input file) to see a detailed description of the available options.
|AMIE+ Source code and documentation||2015-08-26||Package containing AMIE's source code as well as its library dependencies and documentation.|
* This program is released under the terms of the Creative Commons Attribution-NonComercial license v3.0