|
COPYRIGHT 2005 Ohio State University Press
In the analysis of spatially referenced data, interest often focuses not on prediction of the spatially indexed variable itself, but on boundary analysis, that is, the determination of boundaries on the map that separate areas of higher and lower values. Existing boundary analysis methods are sometimes generically referred to as wombling, after a foundational article by Womble (1951). When data are available at point level (e.g., exact latitude and longitude of disease cases), such boundaries are most naturally obtained by locating the points of steepest ascent or descent on the fitted spatial surface (Banerjee, Gelfand, and Sirmans 2003). In this article, we propose related methods for areal data (i.e., data which consist only of sums or averages over geopolitical regions). Such methods are valuable in determining boundaries for data sets that, perhaps due to confidentiality concerns, are available only in ecological (aggregated) format, or are only collected this way (e.g., delivery of health-care or cost information). After a brief review of existing algorithmic techniques (including that implemented in the commercial software BoundarySeer), we propose a fully model-based framework for areal wombling, using Bayesian hierarchical models with posterior summaries computed using Markov chain Monte Carlo methods. We explore the suitability of various existing hierarchical and spatial software packages (notably S-plus and WinBUGS) to the task, and show the approach's superiority over existing nonstochastic alternatives, both in terms of utility and average mean square error behavior. We also illustrate our methods (as well as the solution of advanced modeling issues such as simultaneous inference) using colorectal cancer late detection data collected at the county level in the state of Minnesota.
**********
Background
The analysis and modeling of spatially referenced data sets have occupied statisticians and geographers for decades. Typically, such data consist of variables that have been observed at different spatial locations, and one seeks models that capture possible spatial associations among them. Depending upon the nature of spatial referencing, spatial data are classified as either point-referenced (often called geo-statistical) or areal (often called lattice). In the former, the spatial locations are points with known coordinates, such as latitude-longitude or easting-northing pairs. For the latter, locations are usually geographic regions (such as counties, census tracts, or ZIP codes) along with information on the neighborhood structure (contiguity) of these regions.
Statistical models for spatial data depend upon the nature of the spatial referencing. The role of spatial models for point-referenced data has historically been the making of predictions and spatial interpolations over the entire domain, while models for areal data have been used primarily for smoothing raw rate maps to reveal broad spatial trends. Recently, however, spatial analysts have shown a growing interest in detecting zones or boundaries that reveal sharp changes in the values of spatially oriented variables. For example, in contour maps of spatial surfaces, regions with highly compact contour lines indicate zones of abrupt change; significant changes in gradients are observed as one cuts across these contours. Similarly, for areal data, the boundary separating two regions with drastically different measurements or fitted values is a boundary of abrupt change.
The general problem of identifying zones of abrupt change is known as wombling, after a foundational article by Womble (1951) that discusses the scientific importance of this problem. Since then, wombling has become a popular technique among geneticists, demographers, linguists, ecologists, environmental scientists, and many others in analyzing spatial relationships. Notable articles in this area include Barbujani, Jacquez, and Ligi (1990), Barbujani, Oden, and Sokal (1989), Oden et al. (1993), Bocquet-Appel and Bacro (1994), Fortin (1994, 1997), and Fortin and Drapeau (1995), with Jacquez, Maruca, and Fortin (2000) offering an excellent recent review. This literature deals primarily with point-referenced data (regularly or irregularly spaced), often investigating suitable tessellations of the spatial domain and the appropriate interpolators for approximating the surface.
The field we broadly refer to as "wombling" is also known as barrier analysis or edge detection in fields such as landscape topography, systematic biology, sociology, ecology, and public health. In all of these fields, the research goal is to identify regional differences across shared boundaries, in order to identify homogenous regions or discover important "barriers." Ultimately, the underlying influences responsible for these barriers are typically of greatest scientific interest. For instance, the genetic study by Sokal and Thompson (1998) locates barriers (areas) over which genetic flow (population movement, through changing allele frequencies) is reduced or stopped.
As with spatial statistical models, wombling approaches must depend upon whether the data are point-referenced or areal. A significant amount of both methodological and applied research on wombling with point- (or raster-) level data exists. For example, Bocquet-Appel and Jakobi (1996) used point wombling analysis to identify the barriers to the spatial diffusion for the demographic transition in western Europe. Barbujani, Oden, and Sokal (1989) detect a zone of main discontinuity using an algorithm operating on a local spatial variance statistic.
In studies of human populations, patient confidentiality laws often restrict available data to counts or rates over geopolitical regions. As such, in this article we restrict our attention to the areal data case. Areal wombling (also known as polygonal wombling) is not as well-developed in the literature as point or raster wombling, but some notable articles exist. Oden et al. (1993) provide a wombling algorithm for multivariate categorical data defined on a lattice. The statistic chosen is the average proportion of category mismatches at each pair of neighboring sites, with significance relative to an independence or particular spatial null distribution judged by a randomization test. Csillag et al. (2001) developed a procedure for characterizing the strength of boundaries examined at neighborhood level. In this method, a topological or a metric distance [delta] defines a neighborhood of the candidate set of polygons (say, [p.sub.i]), A weighted local statistic is attached to each [p.sub.i]. The difference statistic calculated as the squared difference between any two sets of polygons' local statistic and its quantile measure are used as a relative measure of the distinctiveness of the boundary at the scale of neighborhood size [delta]. Jacquez and Greiling (2003) estimate boundaries of rapid change for colorectal, lung, and breast cancer incidence in Nassau, Suffolk, and Queens counties in New York.
Several criteria for determining what constitutes a "barrier" have been suggested in previous boundary analysis research. Most researchers select a dissimilarity metric (Euclidean distance, squared Euclidean distance, Manhattan distance, Steinhaus statistics, etc.) to measure the difference in response between the values at (say) adjacent polygon centroids. An absolute (dissimilarity metrics greater than C) or relative (dissimilarity metrics in the top k%) threshold then determines which borders are considered actual barriers, or parts of the boundary.
The relative (top k%) thresholding method for determining boundary elements (BEs) is easily criticized, because for a given threshold, a fixed number of BEs will always be found regardless of whether or not the responses separated by the boundary are statistically different. Jacquez and Maruca (1998) suggest use of both local and global statistics to determine where statistically significant BEs are, and a randomization test (with or without spatial constraints) for whether the boundaries for the entire surface are statistically unusual.
All of the aforementioned areal wombling approaches (like traditional, point-level wombling itself) appear to be algorithmic, rather than model-based. That is, they do not involve a probability distribution for the data, and therefore permit statements...
Read the full article for free courtesy of your local library.
|