Watermarking and Provenance for Ontologies

This project is developed jointly with the DBWeb team of Télécom ParisTech.

Motivation

A great number of ontologies are available on the Internet. This includes numerous large RDF ontologies. These ontologies are often available for free. However, in most cases, their use is governed by a license: If a user re-publishes the data or part of the data, he has to give credit to the creators of the original ontology. If he does not, then this constitutes ontology plagiarism. In some cases, re-publication may be prohibited completely (e.g., for commercially licensed ontologies).

This raises the question of how we can prove if someone re-published the data. Since ontological statements are usually world knowledge, there is no way we can show that someone took the data from us. The other person might as well have taken the data from a different source. He ight even claim that we took the data from him. Thus, in short, there is no established means to prove data provenance in ontologies.

We propose to address this problem through watermarking. We propose two approaches for watermarking ontologies: Additive Watermarking and Subtractive Watermarking. This work was done in the Webdam Project at INRIA Saclay in France.

Additive Watermarking

Additive Watermarking works by adding a small number of wrong statements to the ontology ("fake facts"). If these fake facts appear in another ontology, then the other ontology most likely took the data from our ontology. The fake facts have to be plausible enough in order not to be spotted by a machine or by a human. At the same time, they may not be so plausible that they are correct. We provide a theoretical analysis of how many facts we have to add in order to ensure plausibility and security at the same time.

The main objection to this approach is that it compromises the data quality of the ontology. It is true that watermarking is always a trade-off between the data quality and the ability to prove provenance. Our technique has to add only very few fake facts: Usually only a handful or a dozen. Large, automatically constructed RDF ontologies contain anyway several thousands of wrong facts. YAGO, for example, one of the ontologies with a particularly rigorous quality assessment, has a guaranteed correctness of 95%. Since YAGO contains millions of facts, thousands are wrong. Adding a few more might be a valuable trade-off.

For more information, see our paper:

Subtractive Watermarking

Subtractive Watermarking works by removing a small number of statements from the ontology. The ontology is then published without these statements. This creates a pattern of "holes", like in a cheese. If this pattern of holes appears in another ontology, then the data has likely been taken from the source ontology.

The main advantage of this approach is that it does not compromise the precision of the data. It just removes statements. The Semantic Web is governed by the Open World Assumption, which states that the absence of a statement implies neither its truth nor its falsehood. Thus, the removal of a statement does not influence the correctness of the data. It does influence its completeness, though. As always, watermarking remains a trade-off between the quality of the data and the ability to prove provenance.

For more information, see our paper: