AccessMyLibrary provides FREE access to over 30 million articles from top publications available through your library.
Create a link to this page
Copy and paste this link tag into your Web page or blog:
1. INTRODUCTION
Until recently, "simultaneous inference" meant considering two or five or perhaps even 10 hypothesis tests at the same time, as in Miller's classic text (Miller 1981). Rapid progress in technology, particularly in genomics and imaging, has vastly upped the ante for simultaneous inference problems. Now 500 or 5,000 or even 50,000 tests may need to be evaluated simultaneously, raising new problems for the statistician, but also opening new analytic opportunities. This article explores choosing an appropriate null hypothesis in large-scale testing situations, and how this choice affects well-known inference methods, such as the false discovery rate (FDR).
Simultaneous hypothesis testing begins with a collection of null hypotheses,
[H.sub.1],[H.sub.2],...,[H.sub.N]; (1)
corresponding test statistics, possibly not independent,
[Y.sub.1],[Y.sub.2],...,[Y.sub.N]; (2)
and their p values, [P.sub.1], [P.sub.2],..., [P.sub.N], with [P.sub.i] measuring how strongly [y.sub.i], the observed value of [Y.sub.i], contradicts [H.sub.i]; for instance, [P.sub.i] = [Pr.sub.H.sub.i]{|[Y.sub.i]| > |[y.sub.i]|}. "Large-scale" means that N is a big number, say at least N > 100.
It is convenient, although not necessary, to work with z-values instead of the [Y.sub.i]'s or [P.sub.i]'s,
[z.sub.i] = [[PHI].sup.-1]([P.sub.i]), i = 1, 2,...,N, (3)
with [PHI] indicating the standard normal cumulative distribution function (cdf), for example, [[PHI].sup.-1](.95) = 1.645. If [H.sub.i] is exactly true, then [z.sub.i] will have a standard normal distribution,
[z.sub.i]|[H.sub.i] [approximately] N(0,1). (4)
I call (4) the theoretical null hypothesis.
Our motivating example concerns a study of 1,391 patients with human immunodeficiency virus (HIV) infection, investigating which of 6 protease inhibitor (PI) drugs cause mutations at which of 74 sites on the viral genome. Each patient provided a vector of predictors,
x = ([x.sub.1], [x.sub.2],...,[x.sub.6]), (5)
with [x.sub.j] = 1 or 0 indicating whether or not the patient used P[I.sub.j], 1 [less than or equal to] [[summation].sub.1.sup.6][x.sub.j] [less than or equal to] 6; and a vector of responses,
v = ([v.sub.1], [v.sub.2],...,[v.sub.74]), (6)
[v.sub.k] = 1 or 0 indicating whether or not a mutation occurred at site k. Remark A of Section 7 describes the study in more detail.
For each of the 74 genomic sites, a separate logistic regression analysis was run using all 1,391 cases, with that site's mutation indicators as responses and the PI indicators as predictors. Together these yielded 444 = 6 X 74 z-values, one for testing each null hypothesis that drug j does not cause mutations at site k, j = 1, 2,...,6 and k = 1, 2,...,74. The z-values were based on the usual approximation
[z.sub.i] = [y.sub.i]/s[e.sub.i], i = 1, 2,...,444, (7)
[using a single subscript i in place of (j, k)] where [y.sub.i] is the maximum likelihood estimate (MLE) of the logistic regression coefficient and s[e.sub.i] is its approximate large-sample standard error.
Figure 1 shows a histogram of the 444 z-values, with negative [z.sub.i]'s indicating greater mutational effects. The smooth curve, f(z), is a natural spline with 7 df, fit to the histogram counts by Poisson regression. It emphasizes the central peak near z = 0, presumably the large majority of uninteresting drug-site combinations that have negligible mutation effects. Near its center, the peak is well described by a normal density with mean -.35 and standard deviation 1.20, which will be called the empirical null hypothesis,
[z.sub.i]|[H.sub.i] [approximately] N(-.35, [1.20.sup.2]). (8)
Section 3 describes the estimation methodology for (8), with a brief discussion of the normality assumption in Remark D of Section 7.
The difference between the theoretical null N(0, 1) and empirical null N(-.35, [1.20.sup.2]) may not seem worrisome here, but it will be shown that it substantially affects any simultaneous inference procedure. More dramatic example is given in Section 6, for a microarray analysis in which going from the theoretical to empirical null totally negates any findings of significance. Situations going in the reverse direction can also occur.
[FIGURE 1 OMITTED]
In classic situations involving only a single hypothesis test, one must, out of necessity, use the theoretical null hypothesis, z [approximately] N(0, 1). The main point of this article is that large-scale testing situations permit empirical estimation of the null distribution. Sections 3-5 explore reasons why the empirical and theoretical null might differ, and which might be preferable in different situations.
There are scientific as well as statistical differences between small-scale and large-scale hypothesis testing situations. A single hypothesis test is most often run with the expectation and hope of rejecting the null, "with 80% power" in a typical clinical trial. Nobody wants to reject 80% of N = 5,000 null hypotheses. The usual point of large-scale testing is to identify a small percentage of interesting cases that deserve further investigation. Although we are not exactly looking for a needle in a haystack, we do not want the whole haystack either. An important assumption of what follows is that the proportion of interesting cases is small, perhaps 1% or 5% of N, but not more than 10%. This is made explicit in Section 2, in the description of the local false discovery rate as an analytic tool for large-scale testing. There are situations in which the 10% limit is irrelevant (e.g., in constructing prediction models), but these lie outside our purpose here.
The terminology "Interesting/Uninteresting" used in this article in preference to "Significant/Nonsignificant" is discussed near the end of Section 5. We conclude in Sections 7 and 8 with remarks, including most of the technical details, and a summary.
2. THE LOCAL FALSE DISCOVERY RATE
It is convenient to discuss large-scale testing problems in terms of the local false discovery rate (fdr), an empirical Bayes version of Benjamini and Hochberg's (1995) methodology focusing on densities rather than tail areas (see Efron et al. 2001; Efron and Tibshirani 2002; Storey 2002, 2003).
We begin with a simple Bayes model. Suppose that each of the N z-values falls into one of two classes, "Uninteresting" or "Interesting," corresponding to whether or not [z.sub.i] is generated according to the null hypothesis, with prior probabilities [p.sub.0] and [p.sub.1] = 1 - [p.sub.0] for the classes. Assume that [z.sub.i] has density either [f.sub.0](z) or [f.sub.1](z), depending on its class,
[p.sub.0] = Pr{Uninteresting}, [f.sub.0](z) density if Uninteresting (Null), (9)
[p.sub.1] = Pr{Interesting}, [f.sub.1](z) density if Interesting (Nonnull).
The smooth curve in Figure 1 estimates the mixture density, f(z),
f(z) = [p.sub.0][f.sub.0](z) + …
Source: HighBeam Research, Large-scale simultaneous hypothesis testing: the choice of a null...