AccessMyLibrary provides FREE access to over 30 million articles from top publications available through your library.
Create a link to this page
Copy and paste this link tag into your Web page or blog:
We present CARTscans, a graphical tool that displays predicted values across a four-dimensional subspace. We show how these plots are useful for understanding the structure and relationships between variables in a wide variety of models, including (but not limited to) regression trees, ensembles of trees, and linear regressions with varying degrees of interactions. In addition, the common visualization framework allows diverse complex models to be visually compared in a way that illuminates the similarities and differences in the underlying methods, facilitates the choice of a particular model structure, and provides a useful check for implausible predictions of future observations in regions with little or no data.
Key Words: Bagging; Boosting; Classification and regression trees; Color coding; Graphics; Linear regression; Random forests; Visualization.
1. INTRODUCTION
This article describes graphical techniques inspired by the use of CT Scans in medical imaging. In CT scans, a series of two-dimensional images, taken in successive slices, each one right above or below the one before, is used to depict the three-dimensional structure being scanned, such as a head or body. The images are then printed in a line or a grid, and the viewer must mentally restack the images on top of each other to form a mental representation of the three dimensions. With practice, physicians can get quite adept at visualizing the size, shape, and density of a wide range of three-dimensional structures.
The tools we present take a similar approach to medical CT scans by presenting a series of "slices" of the predictor space. These tools can help visualize any space of between three and six dimensions in many contexts, linear and nonlinear. Our focus is on how these tools can aid in understanding the structure of diverse models, and facilitate the comparison of different types of models, in a way that parallels what can be seen in a simple linear regression context. We begin by showing how these tools can be used to visualize a relatively simple single regression tree, as popularized in the statistical literature by Breiman, Friedman, Olshen, and Stone (1984). Regression and classification trees are flexible, partition-based models that are well-suited to describing complex interactions. At their most basic, these models recursively partition the predictor space into disjoint rectangles that are successively more homogeneous with respect to an outcome variable. As the trees get more complicated, however, so does the description and interpretation: thus, we broaden our scope to demonstrate the use of these tools in understanding, visualizing, and comparing a range of models, including linear regression models with various degrees of interactions, and models comprised of aggregates of trees, such as bagged (bootstrapped aggregate, Breiman 1996) or boosted (Friedman 2001). The tools discussed here can be applied to a wide variety of models because they work only on the predictor space and the predicted values. We discuss several issues that arise in the specification of these plots, such as measures of variable importance and their function in choosing which variables to display and in what roles, and how these tools can help in assessing the variability behind the predictions in these models.
1.1 CURRENT GRAPHICAL TOOLS
There are many tools for visualizing data, some of which were described by Tukey and Tukey (1981). Although many of the concepts and ideas behind visualizations have not changed since the publication of that article, the rapid growth of computing power has led to a corresponding growth of new visualization methods. Some of these are general, like XGobi's well-known "Grand Tour" (Swayne, Cook, and Buja 1998), and some are specific to certain types of models. For tree-based models, for instance, there exist a range of tools, both commercial and free. Many of these focus on the hierarchical structure, ranging from traditional flowchart-like plots, to interactive "drill-down" tools such as those popular in commercial datamining software. Other efforts to display hierarchical structure include "Treemaps" (http://www.cs.umd.edu/hcil/treemap/), which recursively partition a viewing screen into color-coded rectangles based on the relative importance of a node. These displays, which have proven useful in visualizing file directory structure and financial databases, depend on an explicit hierarchy in the organization of the data. They have been used for statistical trees, and are useful for showing the organization and size of the leaves, but none of these plots are geared towards representing main effects and overall effects of certain variables. Furthermore, none of these plots truly achieve our goal of tools that allow an understanding of the structure of a tree, or of an ensemble of trees, that parallels the understanding of the equivalent linear model. Few visualizations give equal insight to a single regression tree, an ensemble of trees, or a linear model, in terms of the structure of the model and the corresponding insights about relationships between variables. Indeed, visualizing linear regression models has not been a main goal of visualization packages, as these parametric models are often considered well summarized by parameter estimates and standard errors.
Source: HighBeam Research, CARTscans: a tool for visualizing complex models.