AccessMyLibrary : Search Information that Libraries Trust AccessMyLibrary | News, Research, and Information that Libraries Trust

AccessMyLibrary    Browse    J    Journal of Computational & Graphical Statistics    Visual comparison of datasets using mixture decompositions.

Visual comparison of datasets using mixture decompositions.

Publication: Journal of Computational & Graphical Statistics

Publication Date: 01-MAR-04

Author: Gous, Alan ; Buja, Andreas
How to access the full article: Free access to all articles is available courtesy of your local library. To access the full article click the "See the full article" button below. You will need your US library barcode or password.

Bookmark this article

Print this article

Link to this article

Email this article

Digg It!

Add to del.icio.us

RSS

COPYRIGHT 2004 American Statistical Association

This article describes how a mixture of two densities, [f.sub.0] and [f.sub.1], may be decomposed into a different mixture consisting of three densities. These new densities, [f.sub.+], [f.sub.-], and [f.sub.=], summarize differences between [f.sub.0] and [f.sub.1]: [f.sub.+] is high in areas of excess of [f.sub.1] compared to [f.sub.0]; [f.sub.-] represents deficiency of [f.sub.1] compared to [f.sub.0] in the same way; [f.sub.=] represents commonality between [f.sub.1] and [f.sub.0]. The supports of [f.sub.+] and [f.sub-0] are disjoint. This decomposition of the mixture of [f.sub.0] and [f.sub.1] is similar to the set-theoretic decomposition of the union of two sets A and B into the disjoint sets A\B, B\A, and A [intersection] B. Sample points from [f.sub.0] and [f.sub.1] can be assigned to one of these three densities, allowing the differences between [f.sub.0] and [f.sub.1] to be visualized in a single plot, a visual hypothesis test of whether [f.sub.0] is equal to [f.sub.1]. We describe two similar such decompositions and contrast their behavior under the null hypothesis [f.sub.0] = [f.sub.1] giving some insight into how such plots may be interpreted.

We present two examples of uses of these methods: visualization of departures from independence, and of a two-class classification problem. Other potential applications are discussed.

Key Words: Classification; Data visualization; Density estimation; Exploratory data analysis; Mixture decomposition.

Editor's note: A color version of this article is available to JCGS subscribers at http: //www.ingentaselect.com. It is publicly available at www-stat.wharton.upenn.edu/~buja/.

1. INTRODUCTION

Figure 1 is a plot of n = 329 metropolitan areas in the United States. A score measuring housing cost in the area is plotted on the y-axis, and a score for the quality of the transportation infrastructure is plotted on the x-axis.

[FIGURE 1 OMITTED]

Are these two scores independent of one another? A standard test of, say, a zero correlation, confirms that they are not. This is also clear purely from visual evidence, if we compare this plot to Figure 2. The latter plot is of the same data, but with the x-values of all the points randomly permuted, while keeping the y-values fixed. This then is a sample of size n from the permutation distribution defined by the data, the product distribution of the data margins, for which the x- and y-values are independent. There seems to be visual evidence that the two datasets creating the two plots are from different distributions, particularly when comparing the lower center and lower right-hand sides of the plots.

[FIGURE 2 OMITTED]

Figure 3 plots both sets of data on the same axes, and colors the points according to a scheme which provides immediate visual information about the differences between the two distributions. The combined data are regarded as having been drawn from a mixture of three distributions, and are colored accordingly. The distribution from which the large light points are drawn has been defined so that it has high density where the original density is high compared to the permutation density. The large dark points, on the other hand, are drawn from a distribution defined so that it has high density where the original density is low compared to the (draw from the) permutation density. Areas of large dark or light points in the plot then provide evidence for differences between the two distributions. Points classified as being drawn from the third density in this mixture are drawn small and lightest. This density is defined to be high where the two distributions are similar, and so reflects a sort of consensus between the two.

[FIGURE 3 OMITTED]

With this interpretation, Figure 3 shows clearly the dependence between the x and y variables: The large light points are funneled from the lower left to the upper right between two groups of dark points in the the upper left and lower right. Keeping in mind that the large points indicate preponderance/deficiency, of actual compared to permuted data, the plot lets us perceive the nature of the dependence in more detail than the plot of the raw data in Figure 1.

While interpreting the data, we must bear in mind the sampling variation that has been introduced into Figure 3 by the single draw from the permutation distribution used in its creation. The patterns described above do, however, persist through multiple draws. Section 5 presents a second application of this visualization scheme in which such permutation sampling variation does not appear. The reader should keep in mind that the comparison of real and permuted data is just an illustration, albeit a useful one we think, of the proposed method for decomposing and visually comparing two distributions.

There exists a second source of sampling variation, given the original data, which is always present in these visualizations, and which will be described with the algorithm itself in Section 3.

A remark on plotting: Graphs such as Figure 3 are prone to overplotting. The order in which points are drawn is therefore important. It is recommended that points be drawn in reverse order of importance, such that more important points are drawn later to allow them to overplot the previously drawn less important points. In the present situation this means plotting large points after the small ones because agreement between two distributions is less informative than their disagreement, represented here by the large points.

This...

Read the full article for free courtesy of your local library.


More Articles from Journal of Computational & Graphical Statistics
Likelihood estimation and inference for the autologistic model.
March 01, 2004
Nonparametric Bayesian assessment of the order of dependence for binar...
March 01, 2004
Bayesian P-splines.
March 01, 2004
A split-merge Markov chain Monte Carlo procedure for the Dirichlet pro...
March 01, 2004
Bayesian variable selection and the Swendsen-Wang algorithm.
March 01, 2004

What's on AccessMyLibrary?

32,075,336 articles
in the following categories:

Arts, Business, Consumer News, Culture & Society, Education, Government, Personal Interest, Health, News, Science & Technology


© 2008 Gale, a part of Cengage Learning  | All Rights Reserved | About this Service | About The Gale Group, a part of Cengage Learning
                                            Privacy Policy | Site Map | Content Licensing | Contact Us | Link to us
      Other Gale sites: Books & Authors | Goliath | MovieRetriever.com | WiseTo Social Issues