AccessMyLibrary : Search Information that Libraries Trust AccessMyLibrary | News, Research, and Information that Libraries Trust

AccessMyLibrary    Browse    G    Geographical Analysis    A genetic approach to detecting clusters in point data sets.

A genetic approach to detecting clusters in point data sets.

Publication: Geographical Analysis

Publication Date: 01-JUL-05

Author: Conley, Jamison ; Gahegan, Mark ; Macgill, James
How to access the full article: Free access to all articles is available courtesy of your local library. To access the full article click the "See the full article" button below. You will need your US library barcode or password.

Bookmark this article

Print this article

Link to this article

Email this article

Digg It!

Add to del.icio.us

RSS

COPYRIGHT 2005 Ohio State University Press

Spatial analysis techniques are widely used throughout geography. However, as the size of geographic data sets increases exponentially, limitations to the traditional methods of spatial analysis become apparent. To overcome some of these limitations, many algorithms for exploratory spatial analysis have been developed. This article presents both a new cluster detection method based on a genetic algorithm, and Programs for Cluster Detection, a toolkit application containing the new method as well as implementations of three established methods: Openshaw's Geographical Analysis Machine (GAM), case point-centered searching (proposed by Besag and Newell), and randomized GAM (proposed by Fotheringham and Zhan). We compare the effectiveness of cluster detection and the runtime performance of these four methods and Kulldorf's spatial scan statistic on a synthetic point data set simulating incidence of a rare disease among a spatially variable background population. The proposed method has faster average running times than the other methods and significantly reduces overreporting of the underlying clusters, thus reducing the user's postprocessing burden. Therefore, the proposed method improves upon previous methods for automated cluster detection. The results of our method are also compared with those of Map Explorer (MAPEX), a previous attempt to develop a genetic algorithm for cluster detection. The results of these comparisons indicate that our method overcomes many of the problems faced by MAPEX, thus, we believe, establishing that genetic algorithms can indeed offer a viable approach to cluster detection.

**********

Introduction

As geographic data sets grow both in complexity and size, our current spatial analysis toolboxes must expand to include methods that are more computationally efficient, and methods able to uncover previously unknown patterns or relationships. We concentrate here on methods for detecting and analyzing geographical clusters. Many societal problems, such as understanding urban crime patterns and detecting clusters of rare diseases, require methods for cluster detection and analysis that are able to address very large and complex data sets (Openshaw 1995). It is the purpose of this article to explore the utility of a new cluster detection method based around a genetic algorithm, and to compare its performance with several established methods, with the aim of improving computational efficiency and flexibility, without sacrificing the basic ability to detect and report clusters.

Cluster analysis of point data has been used within geography for many years and for many different purposes. One application that has received much attention is the detection and analysis of clusters of disease (Pinkel and Nefzger 1959; Ederer, Myers, and Mantel 1964; Mantel 1967; Besag and Newell 1991; Ahrens et al. 2001). In this application, it is not sufficient to simply find where the individual cases of the disease are clustered, but to find regions that have high rates of the disease. This requires that the disease rate be compared with the background population that is susceptible to the disease, so that regions with higher-than-expected disease rates can be found. As such, cluster detection methods that fail to account for a spatially variable background population are less useful. Many currently available techniques are also computationally slow and tend to produce many positive solutions (i.e., circles or ellipses) for each cluster in the data set. This overreporting of clusters can lead to a prohibitively large workload for postprocessing. Finally, many methods make assumptions about the distribution and shape of clusters that may not apply to the data set. Therefore, techniques that can account for a spatially variable background population, scale easily to large data sets, minimize or generalize assumptions, and minimize the number of solutions returned per cluster are needed.

This article develops an approach to detecting spatial point clusters in a background population that is based on a genetic algorithm, which is a heuristic search process. Heuristic search is often used in machine learning to approximate a deterministic solution that is too costly or too complex to compute (Mitchell 1997; Gahegan 2000, 2003). We begin by reviewing existing cluster detection methods, focusing on four methods that are particularly relevant in the next section. In the subsequent section, we introduce a new genetic method for cluster detection and PROgrams for CLUster DEtection (PROCLUDE), a toolkit application that contains four cluster detection methods, including our new method. In the penultimate section, we contrast this new method with the four established methods reviewed in the next section in terms of both the solutions obtained and the runtime performance. Although genetic algorithms have been successfully used to address many types of problems, they have shown some promise in cluster detection (Openshaw 1995; Openshaw and Perree 1996; Estivill-Castro 2000) but have been limited due to the difficulties in configuring the various parameters that affect their learning and heuristic search characteristics. Our contribution in this regard is to provide configuration settings and a number of enhancements that allows genetic algorithms to function successfully as a cluster detection tool in spatial settings.

What has been done before?

Statistical methods to detect clustering have been used for many years. For example, quadrat counts have been employed to study plant distributions (Gleason 1920; Skellam 1952; Clark and Evans 1954), land use patterns (Getis 1964), and the distribution of towns (Dacey 1964) to determine whether or not a point pattern is clustered. In recent years, this approach has been made more exploratory to find clusters on a grid of small quadrats, known as grid-based cluster detection (Morphet 1997; Schikuta and Erhart 1997; Wang, Yang, and Muntz 1997). Another traditional approach is the use of nearest-neighbor statistics (Clark and Evans 1954), and this has since been extended to use any number of nearest neighbors (Thompson 1956) and to detect clusters in both space and time (Ederer, Myers, and Mantel 1964; Knox and Bartlett 1964; David and Barton 1966). Except for grid-based cluster detection, these summary statistics typically report a single number indicating whether or not the data set is clustered, but are unable to indicate where the clusters are.

More recent statistical approaches include the [G*.sub.i] statistic (Getis and Ord 1992; Ord and Getis 1995), geographically weighted regression (Brunsdon, Fotheringham, and Charlton 1998), the Moran's / (Moran 1948; Anselin 1995), and model-based approaches (Banfield and Raftery 1993; Celeux and Govaert 1995; Rogerson 2001). As mentioned in the introduction, many cluster detection methods are difficult to use in geographic cluster detection. This is the case for four reasons: (1) the need to account for a spatially variable background population, (2) the exponential increase in the size of data sets, (3) assumptions made by statistical methods, and (4) the need to reduce overreporting of clusters.

In several geographic applications, particularly in the detection of disease clustering, it is important to account for a spatially variable background population. Instead of finding regions with large numbers of cases, analyses should find regions with a greater-than-expected ratio of cases to the background population. To address this concern, Levison and Hadden (1965) alter the standard practice of plotting disease points on a map by instead plotting them on an area-by-population cartogram. Several other methods have been developed to address this problem of an unevenly distributed background population in both traditional statistics (Whittemore et al. 1987; Turnbull et al. 1990; Kulldorff 1997) and more computational approaches (Openshaw et al. 1987; Besag and Newell 1991; Fotheringham and Zhan 1996). Some researchers ignore the background population, claiming that instead of treating a high concentration of disease being "not a cluster when it overlays on a densely populated city," it is instead "a cluster explained by the presence of a city" (Estivill-Castro and Lee 2002, p. 316). However, this treatment disregards a tenet of exploratory data mining: that the algorithm tells the researcher something that the researcher does not already know (Fayyad, Piatetsky-Shapiro, and Smyth 1996). Epidemiologists already expect to find increased numbers of disease in urban areas because of the higher population. A cluster detection method that can account for this expected increase is of greater use than one that consistently reports high incident counts where the explanation is already known.

Also, the size and dimensionality of data sets have rapidly increased in recent years (Gahegan 2003) and these increases can cause many algorithms to become virtually unusable because of the computation time they require. For example, any function that constructs a matrix of the distances between all possible points, such as the K function (Ripley 1976), will have a time complexity of O([n.sup.2]), that is, proportional to the square of the number of points. While improvements have certainly been made, running times for many spatial analysis functions remain large, sometimes prohibitively so, on very large data sets.

Another serious limitation is that statistical assumptions about the data may not be true or applicable. Many analytical functions, even those as simple as Z-scores, assume that the distribution of the data is Gaussian. However, this is not necessarily the case within geography (Gould 1970; Gahegan 2003), where skewed distributions are common. Although some methods accommodate non-Gaussian distributions (Banfield and Raftery 1993), they still require that the researcher know what distribution to apply to the data. In exploratory settings, this distribution, whether Gaussian or not, is often unknown. Also, many methods assume a particular cluster shape, such as a circle, whereas many geographical processes might be reasonably expected to give rise to less regular shapes. While any phenomenon that exhibits distance decay will give an approximately circular distribution around its source, other factors can distort this circle. For example, an air pollution plume may be more triangular with a point near its source instead of a circle centered on the source due to a prevailing wind in the area. Alternatively, water pollution following a stream would be more elongated along the stream instead of circular. In epidemiology, an uneven background population can skew the shape of clusters as well.

The final limitation of many algorithms for cluster detection is that they can return dozens, if not hundreds of potential solutions for...

Read the full article for free courtesy of your local library.


More Articles from Geographical Analysis
An analytical model to assess the visibility of landmarks.
July 01, 2005

What's on AccessMyLibrary?

32,394,273 articles
in the following categories:

Arts, Business, Consumer News, Culture & Society, Education, Government, Personal Interest, Health, News, Science & Technology


© 2008 Gale, a part of Cengage Learning  | All Rights Reserved | About this Service | About The Gale Group, a part of Cengage Learning
                                            Privacy Policy | Site Map | Content Licensing | Contact Us | Link to us
      Other Gale sites: Books & Authors | Goliath | MovieRetriever.com | WiseTo Social Issues