|
COPYRIGHT 2004 American Statistical Association
The need to explore model uncertainty in linear regression models with many predictors has motivated improvements in Markov chain Monte Carlo sampling algorithms for Bayesian variable selection. Currently used sampling algorithms for Bayesian variable selection may perform poorly when there are severe multicollinearities among the predictors. This article describes a new sampling method based on an analogy with the Swendsen-Wang algorithm for the Ising model, and which can give substantial improvements over alternative sampling schemes in the presence of multicollinearity. In linear regression with a given set of potential predictors we can index possible models by a binary parameter vector that indicates which of the predictors are included or excluded. By thinking of the posterior distribution of this parameter as a binary spatial field, we can use auxiliary variable methods inspired by the Swendsen-Wang algorithm for the Ising model to sample from the posterior where dependence among parameters is reduced by conditioning on auxiliary variables. Performance of the method is described for both simulated and real data.
Key Words: Auxiliary variables; Data augmentation; Ising model; Markov chain Monte Carlo.
1. INTRODUCTION
Let y = [([y.sub.1], ..., [y.sub.n]).sup.T] be a vector of responses, X be an n x p design matrix, and consider a linear model
y = X[beta] + [epsilon],
where [beta] = [([[beta].sub.1], ..., [[beta].sub.p]).sup.T] is a vector of parameters and [epsilon] ~ N(0, [[sigma].sup.2]I) is a vector of zero mean errors. We consider Bayesian inference in this model with a hierarchical prior on [beta] which allows some components of [beta] to be zero. If [[beta].sub.i] = 0, this excludes the ith predictor from the model. The problem of variable selection is to decide which predictors should be included in a model for the mean of the responses; the Bayesian approach does this coherently, integrating out all uncertainties.
Markov chain Monte Carlo sampling methods for exploring model uncertainty in Bayesian variable selection problems have received a lot of recent attention: see George and McCulloch (1997), Denison, Mallick, and Smith (1998) and Kohn, Smith, and Chan (2001) for a discussion of different approaches and recent developments.
Currently used sampling schemes for exploring the posterior distribution may mix slowly when there are severe multicollinearities among the predictors. We describe an algorithm which offers improvements over alternative sampling schemes in this situation, and which is based on an analogy with the Swendsen-Wang algorithm for the Ising model (Swendsen and Wang 1987). We formulate our hierarchical prior for [beta] in terms of a vector of binary variables in which the components indicate whether a predictor is included in the model or not. By thinking of this binary parameter vector as a spatial process, we are motivated to use a sampling algorithm inspired by the Swendsen-Wang algorithm for the Ising model, where dependence between parameters is reduced by conditioning on some auxiliary variables. For a review of the Swendsen-Wang algorithm and some extensions to general Bayesian inference see Higdon (1998).
We will be concerned with Bayesian methods for variable selection and accounting for model uncertainty in linear models. However, similar ideas find application in many other areas, such as generalized linear models (Raftery 1996), survival analysis (Volinsky, Madigan, Raftery, and Kronmal 1997) and graphical models (Madigan and York 1995).
The structure of this article is as follows. Section 2 specifies the model and the priors. Section 3 reviews the Swendsen-Wang algorithm, and Section 4 extends the algorithm to the problem of Bayesian variable selection. Section 5 describes our method for defining the auxiliary variables in our algorithm. Section 6 describes performance of the method for real and simulated data, and Section 7 gives some discussion and conclusions.
2. BAYESIAN VARIABLE SELECTION
We consider the model and the prior specification given by Kohn et al. (2001) for Bayesian variable selection problems, although the methods we describe are applicable with other prior specifications. Following their notation, let [gamma] = [([[gamma].sub.1], ..., [[gamma].sub.p]).sup.T] be a binary vector, and write [q.sub.[gamma]] = [[summation of].sub.i][[gamma].sub.i] for the number of nonzero elements of [gamma]. Let [X.sub.[gamma]] be the n x [q.sub.[gamma]] design matrix obtained by removing those columns i from X for which [[gamma].sub.i] = 0. Similarly let [[beta].sub.[gamma]] be the subvector of [beta] obtained by removing components [[beta].sub.i] of [beta] for which [[gamma].sub.i] = 0.
We assume that
y|[gamma], [X.sub.[gamma]], [[beta].sub.[gamma]], [[sigma].sup.2] ~ N([X.sub.[gamma]], [[beta].sub.[gamma]], [[sigma].sup.2]I).
For Bayesian inference on the model parameters we use a hierarchical prior. The prior for [[beta].sub.[gamma]] given [gamma] and [[sigma].sup.2] is normal
(2.1) p([[beta].sub.[gamma]]|[gamma], [[sigma].sup.2]) ~ [N(0, n[[sigma].sup.2][([X.sup.T.sub.[gamma]][X.sub.[gamma]]).sup.-1]).
This prior was used by Smith and Kohn (1996), is related to the g-prior of Zellner (1986), and has some attractive invariance properties under rescaling of X and y (see Kohn et al. 2001 for further discussion).
The prior on [[sigma].sup.2] is p([[sigma].sup.2]) [varies] [[sigma].sup.2], and for our prior on [gamma] we use
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII.],
where [pi] is a hyperparameter with a beta prior, [pi] ~ Beta(a, b). The prior on [gamma] (integrating out [pi]) is
p([gamma]) = B([q.sub.[gamma]] + a, p - [q.sub.[gamma]] + b) / B(a, b),
where B(.,.) denotes the beta function.
We are interested in the posterior distribution on [gamma] with [[beta].sub.[gamma]] and [[sigma].sup.2] integrated out,
(2.2) p([gamma]|y) [varies] p([gamma])p(y|[gamma]).
We have that
p(y|[gamma]) = [integral][integral]p(y|[[beta].sub.[gamma]], [gamma], [[sigma].sup.2])p([[beta].sub.[gamma]]|[gamma], [[sigma].sup.2])p ([[sigma].sup.2])d[[beta].sub.[gamma]]d[[sigma].sup.2]
and Smith and Kohn (1996) observed that [[beta].sub.[gamma]] can be integrated out as a normal integral, and [[sigma].sup.2] as an inverse gamma integral, to give
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII.].
For alternative prior specifications it may not be possible to integrate out [[beta].sub.[gamma]] and [[sigma].sup.2] analytically: if this is the case, then we can still approximate p(y|[gamma]) via a Laplace approximation and apply the methods described later. This may also be useful in developing sampling schemes for other Bayesian model selection problems.
For p relatively small, we can compute the posterior p([gamma]|y) exactly, obtaining the normalizing constant in (2.2) by summing over all possible values of [gamma]. For large p, this is not feasible due to the number of terms in the sum, and we use Markov chain Monte Carlo algorithms to identify high posterior probability models.
When there are high posterior correlations between components of [gamma], the usual Markov chain Monte Carlo methods for exploring the posterior, which update one component of [gamma] at a time, can mix slowly. High posterior correlations can occur, for instance, in the situation where there is multicollinearity. Updating components of [gamma] in blocks rather than one at a time can alleviate problems of slow convergence, but it may be difficult to decide how to choose blocks.
Although the focus of this article is on variable selection, also of interest is estimation of the vector of regression coefficients [beta] without conditioning on any single model, but by averaging over different models (so-called model averaging, see Hoeting, Madigan, Raftery, and Volinsky 1999). The sampling schemes we discuss are relevant for this objective also.
3. SWENDSEN-WANG ALGORITHM
Let [eta] = ([[eta].sub.2], ..., [[eta].sub.p]) be a binary spatial process with joint distribution p([eta]) specified by
(3.1) p([eta]) [varies] exp ([summation over i][[alpha].sub.i] ([[eta].sub.i]) + [summation over i where [[psi].sub.ij] [greater than or equal to] 0, i < j, and I(A) is the indicator function which is one when A occurs and zero otherwise. Ratios of the expression (3.1) are easy to compute for [eta] vectors which differ at a single site, and this allows a single site Metropolis-Hastings algorithm such as the Gibbs sampler to be easily implemented. However, single-site updating...
Read the full article for free courtesy of your local library.
|