|
COPYRIGHT 2001 Institute for Operations Research and the Management Sciences
Kohonen's self-organizing map (SOM) network is one of the most important network architectures developed during the 1980s. The main function of SOM networks is to map the input data from an n-dimensional space to a lower dimensional (usually one- or two-dimensional) plot while maintaining the original topological relations. Therefore, it can be viewed as an analog of factor analysis. In this research, we evaluate the feasibility of using SOM networks as a robust alternative to factor analysis and clustering for data mining applications. Specifically, we compare SOM network solutions to factor analytic and K-Means clustering solutions on simulated data sets with known underlying factor and cluster structures. The comparisons indicate that the SOM networks provide solutions superior to unrotated factor solutions in general and provide more accurate recovery of underlying cluster structures when the input data are skewed. Our findings suggest that SOM networks can provide robust alternatives to traditional factor analysis and clustering techniques in data mining applications.
(Data Mining; Kohonen Networks; Factor Analysis; Data Reductive; Clustering Analysis)
1. Introduction
With the increased availability of data collected from the Internet and other sources and the implementation of enterprise-wide databases, the amount of data that companies possess is growing at a phenomenal rate. Hence, it becomes increasingly important for the companies to be able to better manage their databases. Data mining is concerned with identifying interesting patterns and presenting them in a concise and meaningful manner (Piatetsky-Shapiro and Frawley 1991). Data mining tools and techniques that facilitate automated and intelligent database analysis and interpretation have been proposed, and some have been successfully implemented (Fayyad et al. 1996, Westphal & Blaxton 1998, Balachandran et al. 1999).
The widespread availability of data mining software has given practitioners a variety of new alternatives to traditional, statistical data analytic techniques. These alternatives include several techniques based on concepts from machine learning, pattern recognition, and neural networks (Chen et al. 2000, Vanecko and Russo 1999, Spangler et al. 1999, Cooper and Giuffrida 2000). Many of these newer techniques typically serve to achieve the same set of data analytic objectives as those sought to be accomplished by traditional statistical analysis: regression, data reduction, clustering, etc. Often, results obtained using newer data mining techniques are interpreted and utilized in the same manner as those obtained with statistical modeling. For example, the problem of market segmentation involves partitioning a population (of consumers) into relatively homogeneous subsets, so that each subset (segment) can be targeted using a marketing program tailored specifically to the needs of consumers in that subset. In practice, data from a sample of customers (drawn from the relevant population) are analyzed to estimate the segments (number and relative sizes) using a clustering procedure such as K-Means clustering; frequently, the data are preprocessed using factor analysis to reduce dimensionality and facilitate managerial interpretability, and the clustering is done using factor scores (e.g., Dillon et al. 1985, Doyle and Saunders 1985). Now the preprocessing for data reduction and the clustering task can be accomplished using algorithms based on neural networks.
The substitution of neural network-based techniques in the place of statistical modeling techniques needs justification on grounds other than that of novelty. A general a priori justification for preferring neural network-based approaches to statistical ones is that they do not require the invocation of assumptions about the underlying data generating mechanisms (e.g., the distributional assumption of multivariate normality that is invoked to justify the use of several multivariate statistical modeling procedures). On the other hand, statistical techniques provide a wealth of diagnostics that can be used to rigorously evaluate alternative solutions (e.g., error bounds and confidence intervals for parameter estimates, hypothesis testing, etc.). In this paper, we attempt to provide additional justification by presenting preliminary evidence that the SOM network is a robust alternative to factor analysis.
While Kohonen's self-organizing networks have been successfully applied as a classification tool to various problem domains, including speech recognition (Zhao and Rowden 1992, Leinonen et al. 1993), image data compression (Manikopoulos 1993), image or character recognition (Bimbo et al. 1993, Sabourin and Mitiche 1993), robot control (Walter and Schulen 1993, Ritter et al. 1989), and medical diagnosis (Vercauteren et al. 1990), its potential as a robust substitute for factor analysis and clustering tool remains relatively unresearched. Murtagh and Hernandez-Pajares (1995) examined a number of properties of SOM networks and compared them with various methods of data analysis including principal components and K-Means clustering. Clustering technique is considered an important data mining algorithm that can be applied to various problem domains. However, when the dimensionality of the problem is high--there is very large number of attributes (variables) involved--the size of the search space for model induction grows in a combinatorially explosive manner. Moreover, it increases the chances that a data mining algorithm will find spurious patterns that are not valid. Approaches to this problem include methods to reduce the effective dimensionality of the problem and the use of prior knowledge to identify irrelevant variables (Fayyad et al. 1996). The application of SOM networks as an alternative to factor analysis can reduce the problem space from several to few dimensions.
Factor analytic techniques are typically used to capture information in a set of k variables using a smaller set of new variables (called factors). While several methods are available in the statistical literature for extracting factors, these fall into one of two categories, depending on certain underlying assumptions about the relationships between the (high-dimensional) set of input variables and the (low-dimensional) set of output variables (i.e., the factors). Some methods (e.g., the principal components approach to factor estimation) assume that the statistical information in the input data (i.e., variation) can be adequately captured by a smaller set of variables that are hypothesized to be linear combinations of the input variables. Other methods (e.g., maximum likelihood estimation-based procedures) are based on the assumption that certain statistical relationships found in the input data (e.g., correlations among the variables) occur because the input variables are linear functions of a smaller set of unobserved variables (i.e., common factors). The common theme underlying both sets of methods is one of reducing the dimensionality of the data with minimal loss of information. This provides one basis for suggesting that SOM networks may be used to extract factors. In essence, SOM networks map a set of, say, N p-variate data points (i.e., cases) into a smaller (typically one or two) dimensional array. This mapping is usually interpreted as providing a classification or clustering of the N data points, because the number of elements in the array tends to be much smaller than N. On the other hand, the coordinates of the array elements can be viewed as analogs of factor scores because each k-dimensional input data point is associated with an element in the array characterized by a reduced (i.e., smaller than k) set of coordinates. Van Hulle (2000) provides an empirical basis for considering SOM networks as an alternative to factor analysis. Using a single sample of simulated data, he demonstrates that the SOM algorithm can be used for principal curve extraction. A principal curve is a one-dimensional nonlinear generalization of a principal component (Hastie and Stuetzle 1989). Van Hulle's (2000) demonstration, however, is limited to showing that the empirical principal curve generated by the SOM algorithm closely approximates the theoretical principal curve.
In this paper we explore three issues using simulated data. First, we address the question of whether SOM-generated coordinate scores are related to the input variables in the same fashion as factor scores are in traditional factor analysis--an important issue because factors need to be interpreted appropriately in business applications if they are to be useful to managers. Second, we examine the consequences of substituting SOM output for factor scores as input into the K-Means clustering procedure. Third, we cluster SOM output using a contiguity-constrained clustering approach suggested by Murtagh (1995) as a preferable way for interpreting the SOM feature map. The balance of the paper is organized as follows: 2 reviews data mining tasks and techniques. Section 3 presents the basic concepts of the SOM network and illustrates its use as a data-reduction tool (analogous to factor analysis). Section 4 describes the experimental procedure that was used to generate the data sets and determine the network configurations. In 5 we compare the performance of SOM and factor analysis using the simulated data sets. The paper concludes with a summary of our findings.
2. Data Mining Tasks and Techniques
Data mining is the process of automated discovery of interesting patterns, trends, and correlations hidden in a corporate database by sifting through large amount of data. It provides a means of extracting new nonobvious information from the growing base of data warehouses to create competitive advantages for organizations. A survey by both Bose and Sugumaran (1999) has shown that data mining is being used for broad decision making instead of tactical solution with the emphasis on macro decisions.
Data mining tasks can be broadly categorized into four dimensions: classification, estimation, segmentation/ clustering, and description/summarization (Westphal and Blaxton 1998). Depending on the primary goal of the task, the outcome of the data mining process can be predictive models or descriptive information. The predictive models produced by data mining process are good for classification or estimation tasks, while the descriptive information is used for segmentation/ clustering and summarization type of problems. Predictive data mining is a learning process that finds trends, patterns, and subtle relationships in data that allow the prediction of future results. Descriptive data mining process focuses on exploring and visualizing the data to find interesting patterns or relationships (Linoff 1999). Each data mining application has different goals and circumstances and, hence, requires different sets of data mining techniques. The most widely used data mining techniques come from a variety of fields, including query tools; statistical methods such as regression, discriminant analysis, logistic models, multidimensional analysis, factor analysis, and data visualization; also machine learning techniques...
Read the full article for free courtesy of your local library.
|