Deco
max planck institut
informatik
mpii logoMinerva of the Max Planck Society
 

Class Consistency

Class Consistency used as a Measure of Goodness of 2-D Orthogonal Projections

Many visualization techniques involve mapping high-dimensional data spaces to lower-dimensional views. Unfortunately, mapping a high-dimensional data space into a scatterplot involves a loss of information; or, even worse, it can give a misleading picture of valuable structure in higher dimensions. As a result of our research efforts, we propose class consistency as a measure of the quality of the mapping. Class consistency enforces the constraint that classes of n--D data are shown clearly in 2-D scatterplots. We propose two quantitative measures of class consistency, one based on the distance to the class's center of gravity, and another based on the entropies of the spatial distributions of classes. We performed an experiment where users choose good views, and show that class consistency has good precision and recall. We also evaluate both consistency measures over a range of data sets and show that these measures are efficient and robust.

Figure 1: Class consistency scores the utility of a 2-D view to faithfully convey a class structure to the user

The data set contains three clusters representing three classes of wine, and 13 attributes describing chemical properties of the wine. The left figure shows the scatterplot for dimensions alcohol and flavanoids. The classes are separated in this view and most data points are located close to class centers, resulting in a consistent view (consistency score 90). In the right figure in contrast, in the scatterplot of dimensions ash and magnesium classes are cluttered and not separated, resulting in a poor consistency rating (consistency score 49).


Application Scenario: Exploring Large Scatterplot Matrices

Figure 2 shows a typical analysis scenario. The mapping of the 159 dimensions of the WHO data space into 2-D scatterplots results in over 12.000 unique views to the 6 HIV risk groups. An analyst typically inspects views till interesting patterns are found. Views that are cluttered or where the clusters mix provide little insight and are often considered uninteresting. Clearly, a human analyst cannot afford to look at every scatterplot in that huge SPLOM to explore mutual relationships of HIV risk groups because of his/her limited attention. Again, after the consistency threshold is set to 80, nearly 97% of the scatterplots are faded out. Figure shows a small part of the SPLOM of the 159-dimensional WHO data set. Scatterplots with low consistency scores are faded-out, and even the distribution of highlighted views across the SPLOM can reveal relations. In the WHO's SPLOM, many rows exclusively contain views with high consistency scores. A closer look at the dimension of one of these rows shows surprisingly shows that total expenditure on health as percentage of gross domestic product separates high-risk and low-risk cluster well. Besides this filtering step our method allows to rank views from high to low consistency values as shown in this figure. Our system also allows the user to identify regions of interest in the SPLOM. The user may now zoom into regions of interest for detailed analysis of individual scatterplots.

Figure 2: Class consistency is used to fade-out poor scatterplots (visual interface)




Downloads



Paper

 


EuroVis'09 Slides

 
Tool (comming soon)