A | B | C | D | E | F 
 G | H | I | J | K | L | M 
 N | O | P | Q | R | S | T 
 U | V | W | X | Y | Z 
max planck institut
informatik
mpii logoMinerva of the Max Planck Society

Citation Influence Project


Get your bearings in the world of research!


Code and Data


Java source

The code is a bit messy but you should have a look at the directory
de\hu_berlin\wm\torel\topicextraction\citetopic\sampler\citinf\
which contains a data structure, the sampler, and a wrapper that
translates the words into integers.

Labels

We asked authors to rate the influence on papers they cited on a ++/+/-/-- scale (see instructions)

Citeseer Publication Data

We crawled the author's papers and citations from citesseer (since citeseer does not have a perfect coverage, some citations are missing).

Format:
citeseer_title: <pubid> <title of the publication> 
citeseer_text: <pubid> <abstract>
citeseer_pub2authorid: <pubid> <authorid> (multiple authors are reflected in multiple lines starting with the same pubid)
citeseer_id2author: <authorid> <author's name>
citeseer_cites: <citing pubid> <cited pubid> (multiple citations are reflected in multiple lines starting with the same citing pubid)



Motivation


Will Rogers (1879-1935) used to say: "You know everybody is ignorant, only on different subjects." Obviously, even experts can not know all domains perfectly well. Today's work requirements demand that every expert has to read on new topics frequently and quickly. The citation influence project supports this task in visualizing concise and meaningful evolution charts of scientific ideas -- a useful tool for students, researchers as well as engineers in industry.

If one has to read on a new topic the setting is as follows. One starts to read on a set of initial publications, usually given by an adviser or found on a search engine. Soon questions arise such as: Which of the cites inspired the ideas in the initial publications? Which publications indirectly influence others? What other ideas followed from the initial literature? Which related ideas have been worked out? What are the connections to strongly impacting work? How is the scientific landscape structured?

In order to satisfy such information needs, lots of people have been suggesting pictures of citation graphs. Since publications contain a lot of citations, even a radius of two citations from a pivotal paper contains hundreds of publications. Such a graph is too big to give any insights.

An even larger issue with citation graphs is that often cited literature is rather loosely related. This is because papers are cited for different reasons, which results in different strengths of influence on the citing paper. For example, cited work that is extended by the citing publication has a stronger influence than work that is cited for background reading. When one is about to explore a new topic community with the goal to read on a new domain, one is primarily interested in work with a strong influence.

In order to prevent information glut, we propose to visualize only a subgraph of the full citation graph. The subgraph we are interested in contains publications that are linked to the initial publications via highly influential citations. To find the strong citation links, we developed an approach for unsupervised prediction of citation influences.

The approach devises a new probabilistic model that explains the generation of documents and its cited publications. The intuition behind the model is that parts of the content in a citing publication are taken from its citations. It is assumed that the more text is associated to one citation, the stronger is the influence it has on the citing publication. The model allows to leave some passages unassociated to any cites, as such passages represent innovational aspects. The model incorporates an evolutionary process, where words represent topics, and topics are copied from the citations rather than text on a word basis. Topic model, citation links, strength of influence, and innovation are estimated conjointly to foster mutual benefit.

In contrast to ``Authority and Hubs''-methods which predict global influences (i.e. authorities), our approach assumes local influence patterns, i.e. authorities may have a low influence on some publications.

Our initial work on this subject hast been published at 24th International Machine Learning Conference, 2007 (see below).


Citation Influence Browser


Screenshot

The Citation Influence Browser is an application front end, to browse precalculated citation graphs.

The application is centered on satisfying information needs in reading on a new topic. It satisfies information needs such as: Which of the cites inspired the initial publications? Which publications inspired the initial literature indirectly? What other ideas followed from the initial literature? Which related ideas have been worked out? What are the connections to strongly impacting work?

After selecting a precalculated citation graph that spins around some initial publications, the application visualizes the citation vicinity. The strength of influence of citation links is depicted by the thickness of links in the citation graph. The main topics of each publication are represented by a spectrum of colors. The user is invited to explore the web of publications at varying levels of detail, filtering out low influential links. The full citation vicinity of selected publications can be displayed on demand. Further information about the publication is provided, such as full title, authors, abstract and the link to the corresponding web page on CiteSeer. A fine-grained perspective on single citing publications is provided, that indicates which parts of the abstract fit to one of the cites.


Download, Installation and Usage


Installation

  1. You must have java 1.5 (or later) installed and in your path
  2. You must have graphviz installed and the graphviz' tool dot in your path. Your graphviz version must support html labels.
  3. Download CitInfViewer.jar
  4. Download any of the computed citation graphs (*.result files)
  5. Put jar and result files in the same directory (the process must have write access to this directory)

Usage

Execute java -jar CitInfViewer.jar or doubleclick CitInfViewer.jar.

Use File -> Load Citation Graph to load one of the *.result files.

In the center you see the citation graph. The boxes represent publications, an arc represents an influence, the thicker the arc, the stronger the predicted influence. Note, that the influence arc is the reverse of the cites-relation: A citing publication is potentially influenced by the publications it cites. You may note that some cited publications are missing. This is probably because CiteSeer-Corpus (on which data we rely here) does not contain this paper.

The colored bars of the publications represent the topic spectrum. Each color represents a common topic (such as 'football' or 'knitting') and the more of a topic's color is shown in the bar, the more is this publication about the topic. Each publication is typically about several topics.

In the bottom panel, you choose the zoom factor (by setting withs/height of the image). If too many publications are contained in the graph, you may want to increase the filter threshold. All influence arcs that have an influence strengths below the threshold are removed from the picture. If you enter some IDs of publications (the numbers inside the boxes) into the Seed PubIds field, you restrict the view to only contain the clusters containing these seed publications. You confirm any changes in this setting by clicking update.

You export the citation graph as it is currently viewed to gif by using File -> Export Image.

The right panel displays further information about the currently selected publication (How to select? click on it!). You can view the publications's web page in CiteSeer in your standard browser (View in CiteSeer); highlight / scroll back to this publication if you lost your bearings in the citation graph (Scroll into View); or remove the selection (Clear Selection).

For the selected node, you may view the full neighborhood (even those cites, that have an influence below the given threshold) by clicking Expand Node Vicinity. The expanded node is colored in the graph.

Furthermore, by clicking on Analyze Abstract you see titles, authors and abstracts for the selected node and its cites, as well as the numeric values of the relative influence strength. In the bottom line you see because of which words a strong citation influence was infered (to be precise: it lists words in the abstract of the citing publication that were associated to one of the cites). If you think the predicted influence is wrong, or if you want to contribute to my study, you can rate the influence according to your opinion (html version only). Use the radiobuttons for that and submit your rating. If would be great if you would leave your name and email address so I can get in contact with you, but this is optional. The Abstract Analysis is written either in MS Excel readable format (csv) or HTML (where it is automatically opened in your default web browser). The files will be stored on your hard drive.

Your own Citation Graphs

To get a citation graph about the publications you are interested in, do the following:

  1. Go to the CiteSeer web site
  2. Collect the URLs of the papers you are interested in (up to 30 is okay)
  3. Send me those URLs in an email along with a description if you are rather interested in publications that are cited by your papers, or those that are citing your papers.
  4. Wait.... I will send the *.result file back in a few days.

Approach


To estimate the strength of a citation link, we devise a probabilistic model that explains the generation of documents; Gibbs sampling is then used to estimate latent parameters such as the influence strength from abstracts of publications in a citation graph.

Roughly speaking, the influence strength is measured as follows. Each word in the abstract of a citing publication D is associated to one of the cites C1C2, ... . If more words are associated to C2 than to C1, then it is said that C2 has a stronger influence on D than C1.

The association of words is done using an underlying probabilistic topic model. For this purpose, a number of word clusters is extracted from all publications, so that the words within one clusters co-occur frequently (such as foot and ball). Since such a cluster contains words used in a similar topical context, is referred to as "a topic". For each cited publication C, its mixture of topics is estimated, such as being 20% about education and 80% about soccer.

A word is now associated to a cited publications, if it fits the words in the mixture of the topics respectively the words in the topic clusters. After the association, the associated word is treated as if being inside the cited publication.

Topic mixtures, word clusters, and the association of words in citing publications to cites are all estimated together, and thus influence each other.

For more details on the approach see our paper or watch the talk.

Dietz, Laura; Bickel, Steffen; Scheffer, Tobias : Unsupervised Prediction of Citation Influences. In: Proceedings of the 24th International Conference on Machine Learning. Corvallis, Oregon, USA, June 2007
[.pdf ]

Publication repositories contain an abundance of information about the evolution of scientific research areas. We address the problem of creating a visualization of a research area that describes the flow of topics between papers, quantifies the impact that papers have on each other, and helps to identify key contributions. To this end, we devise a probabilistic topic model that explains the generation of documents; the model incorporates the aspects of topical innovation and topical inheritance via citations. We evaluate the model's ability to predict the strength of influence of citations against manually rated citations.

Search MPII (type ? for help)