AccessMyLibrary provides FREE access to over 30 million articles from top publications available through your library.
Create a link to this page
Copy and paste this link tag into your Web page or blog:
Many graphical methods for displaying multivariate data consist of arrangements of multiple displays of one or two variables: scatterplot matrices and parallel coordinates plots are two such methods. In principle these methods generalize to arbitrary numbers of variables but become difficult to interpret for even moderate numbers of variables. This article demonstrates that the impact of high dimensions is much less severe when the component displays are clustered together according to some index of merit. Effectively, this clustering reduces the dimensionality and makes interpretation easier. For scatterplot matrices and parallel coordinates plots clustering of component displays is achieved by finding suitable permutations of the variables. I discuss algorithms based on cluster analysis for finding permutations, and present examples using various indices of merit.
Key Words: Parallel coordinates: Permutation of variables; Projection pursuit; Scatterplot matrices.
1. INTRODUCTION
Datasets of three or more dimensions are notoriously difficult to display on a two-dimensional screen or on a piece of paper. Many graphical methods for displaying multivariate data consist of arrangements of multiple displays of one or two variables--for example, a scatterplot matrix consists of all pairwise scatterplots of two variables arranged in a square matrix, and a parallel coordinates display is a sequence of one-dimensional dotplots where line segments are drawn to connect the dots pertaining to a particular case. While in principle these methods generalize to arbitrary numbers of variables, in practice as the dimensions increase, they become less effective, presenting us with an overwhelming amount of information that is difficult to absorb. Usually, the ordering of the variables in these displays is arbitrary and corresponds to the order in which the variables were listed in the data file. However, the interpretability and effectiveness of visualizations often improve dramatically when the variables are reordered in some systematic way.
A scatterplot matrix shows all pairwise scatterplots of p variables, while a parallel coordinate display shows p - 1 of the [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII.] pairwise line plots. Some of these pairwise plots are more interesting or informative than others, and an effective visualization should help us to focus on these. Our basic idea is that each pairwise display (a panel) is awarded a merit score measuring its "interestingness." Then the variables are reordered so that the viewer's attention will be focused on the most interesting panels, which are placed in prominent positions. For the scatterplot matrix, we consider positions close to the diagonal to be the most prominent, while for the parallel coordinate display interesting panels should be among the p - 1 visible panels. Suitable merit measures will depend on the context of the data and the type of display, but correlation is often a good starting point. Then the visualizations will help us identify clusters of similar (highly correlated) variables, effectively reducing the dimensionality of the visualization problem.
Ideally, the panel merit scores are combined into an overall merit score for the entire display. We could then find the permutation of the variables maximizing this overall score. A brute-force approach to solving this problem evaluates the criterion on all possible permutations of the variables, but this is slow except for small numbers of variables. Because our goal is effective data visualization, it is probably better to find a good display quickly rather than wait around for a slightly better but optimal display. Therefore, we use a fast ad-hoc algorithm based on cluster analysis (Gruvaeus and Wainer 1972) to come up with suitable permutations of the variables. In our experience the resulting visualizations are often far more effective than those using standard variable order.
The problem of choosing an ordering of variables for displays of multivariate data has received surprisingly little attention in the literature. The work of Bertin is an exception in this regard; ordering variables, cases, and categories in so-called "matrix displays" is a major theme of his work (Bertin 1983).
Source: HighBeam Research, Clustering visualizations of multidimensional data.