AccessMyLibrary provides FREE access to over 30 million articles from top publications available through your library.
Create a link to this page
Copy and paste this link tag into your Web page or blog:
Kohonen's self-organizing map (SOM) network is one of the most important network architectures developed during the 1980s. The main function of SOM networks is to map the input data from an n-dimensional space to a lower dimensional (usually one- or two-dimensional) plot while maintaining the original topological relations. Therefore, it can be viewed as an analog of factor analysis. In this research, we evaluate the feasibility of using SOM networks as a robust alternative to factor analysis and clustering for data mining applications. Specifically, we compare SOM network solutions to factor analytic and K-Means clustering solutions on simulated data sets with known underlying factor and cluster structures. The comparisons indicate that the SOM networks provide solutions superior to unrotated factor solutions in general and provide more accurate recovery of underlying cluster structures when the input data are skewed. Our findings suggest that SOM networks can provide robust alternatives to traditional factor analysis and clustering techniques in data mining applications.
(Data Mining; Kohonen Networks; Factor Analysis; Data Reductive; Clustering Analysis)
1. Introduction
With the increased availability of data collected from the Internet and other sources and the implementation of enterprise-wide databases, the amount of data that companies possess is growing at a phenomenal rate. Hence, it becomes increasingly important for the companies to be able to better manage their databases. Data mining is concerned with identifying interesting patterns and presenting them in a concise and meaningful manner (Piatetsky-Shapiro and Frawley 1991). Data mining tools and techniques that facilitate automated and intelligent database analysis and interpretation have been proposed, and some have been successfully implemented (Fayyad et al. 1996, Westphal & Blaxton 1998, Balachandran et al. 1999).
The widespread availability of data mining software has given practitioners a variety of new alternatives to traditional, statistical data analytic techniques. These alternatives include several techniques based on concepts from machine learning, pattern recognition, and neural networks (Chen et al. 2000, Vanecko and Russo 1999, Spangler et al. 1999, Cooper and Giuffrida 2000). Many of these newer techniques typically serve to achieve the same set of data analytic objectives as those sought to be accomplished by traditional statistical analysis: regression, data reduction, clustering, etc. Often, results obtained using newer data mining techniques are interpreted and utilized in the same manner as those obtained with statistical modeling. For example, the problem of market segmentation involves partitioning a population (of consumers) into relatively homogeneous subsets, so that each subset (segment) can be targeted using a marketing program tailored specifically to the needs of consumers in that subset. In practice, data from a sample of customers (drawn from the relevant population) are analyzed to estimate the segments (number and relative sizes) using a clustering procedure such as K-Means clustering; frequently, the data are preprocessed using factor analysis to reduce dimensionality and facilitate managerial interpretability, and the clustering is done using factor scores (e.g., Dillon et al. 1985, Doyle and Saunders 1985). Now the preprocessing for data reduction and the clustering task can be accomplished using algorithms based on neural networks.
The substitution of neural network-based techniques in the place of statistical modeling techniques needs justification on grounds other than that of novelty. A general a priori justification for preferring neural network-based approaches to statistical ones is that they do not require the invocation of assumptions about the underlying data generating mechanisms (e.g., the distributional assumption of multivariate normality that is invoked to justify the use of several multivariate statistical modeling procedures). On the other hand, statistical techniques provide a wealth of diagnostics that can be used to rigorously evaluate alternative solutions (e.g., error bounds and confidence intervals for parameter estimates, hypothesis testing, etc.). In this paper, we attempt to provide additional justification by presenting preliminary evidence that the SOM network is a robust alternative to factor analysis.
While Kohonen's self-organizing networks have been successfully applied as a classification tool to various problem domains, including speech recognition (Zhao and Rowden 1992, Leinonen et al. 1993), image data compression (Manikopoulos 1993), image or character recognition (Bimbo et al. 1993, Sabourin and Mitiche 1993), robot control (Walter and Schulen 1993, Ritter et al. 1989), and medical diagnosis (Vercauteren et al. 1990), its potential as a robust substitute for factor analysis and clustering tool remains relatively unresearched. Murtagh and Hernandez-Pajares (1995) examined a number of properties of SOM networks and compared them with various methods of data analysis including principal components and K-Means clustering. Clustering technique is considered an important data mining algorithm that can be applied to various problem domains. However, when the dimensionality of the problem is high--there is very large number of attributes (variables) involved--the size of the search space for model induction grows in a combinatorially explosive manner. Moreover, it increases the chances that a data mining algorithm will find spurious patterns that are not valid. Approaches to this problem include methods to reduce the effective dimensionality of the problem and the use of prior knowledge to identify irrelevant variables (Fayyad et al. 1996). The application of SOM networks as an alternative to factor analysis can reduce the problem space from several to few dimensions.