Multiple Hypothesis Testing

Published on January 2017 | Categories: Documents | Downloads: 88 | Comments: 0 | Views: 764
of 53
Download PDF   Embed   Report

Comments

Content

Chapter 14

MULTIPLE HYPOTHESIS TESTING
N . E. SAVIN*
Trinity College, Cumbridge

Contents

1. Introduction
2. t and F tests
2.1.
2.2.
2.3.
2.4.

The model
Tests
Critical values-finite induced test
Acceptance regions

3. Induced tests and simultaneous confidence intervals
3.1.
3.2.
3.3.
3.4.
3.5.

Separate hypotheses
Finite induced test- 4 of primary interest
Infinite induced test-Scheffe test
Finite induced test- 4 of secondary interest
Simultaneous confidence intervals

4. The power of the Bonferroni and Scheffe tests
4.1. Background
4.2. Power contours
4.3. Average powers
4.4. The problem of multicollinearity

5. Large sample induced tests
6. Empirical examples
6.1. Textile example
6.2. Klein's Model I example

References

*This work was supported by National Science Foundation Grant SES 79-12965 at the Institute for
Mathematical Studies in the Social Sciences, Stanford University. The assistance of G. B. A. Evans is
gratefully acknowledged. I am also indebted to the following people for valuable comments: T. W.
Anderson, K. J. Arrow, R. W. Farebrother, P. J. Hammond, D. F. Hendry, D. W. Jorgenson, L. J.
Lau, B. J. McCormick, and J. Richmond.
Handbook of Econometrics, Volume I I , Edited by 2. Griliches and M. D. Intriligufor
@Elsewer Science Publishers B V, 1984

N . E. Surin

1.

Introduction

The t and F tests are the most frequently used tests in econometrics. In regression
analysis there are two different procedures whch can be used to test the
hypothesis that all the coefficients are zero. One procedure is to test each
coefficient separately with a t test and the other is to test all coefficients jointly
using an F test. The investigator usually performs both procedures when analyzing the sample data. The obvious questions are what is the relation between the
two procedures and whch procedure is better. Scheffe (1953) provided the key to
the answers when he proved that the F test is equivalent to carrying out a set of
simultaneous t tests. More than 25 years have passed since this result was
published and yet the full implications have barely penetrated the econometric
literature. Aside from a brief mention in Theil (1971) the Scheffe result has not
been discussed in the econometric textbooks; the exceptions appear to be Seber
(1977) and Dhrymes (1978). Hence, it is perhaps no surprise there are so few
applications of multiple hypothesis testing procedures in empirical econometric
research.
This chapter presents a survey of multiple hypothesis testing procedures with
an emphasis on those procedures which can be applied in the context of the
classical linear regression model. Multiple hypothesis testing is the testing of two
or more separate hypotheses simultaneously. For example, suppose we wish to
test the hypothesis H: ,8, = ,8, = 0 where ,8, and ,8, are coefficients in a multiple
regression. In situations in whch we only wish to test whether H is true or not we
can use the F test. It is more usual that when H is rejected we want to know
whether ,8, or ,8, or both are nonzero. In this situation we have a multiple
decision problem and the natural solution is to test the separate hypotheses H,:
,8, = 0 and H,: ,8, = 0 with a t test. Since H is true if and only if the separate
hypotheses H, : P1= 0 and Hz: P2= 0 are both true, this suggests accepting H if
and only if we accept H, and HZ.Testing the two hypotheses H, and H2 when
or both are different from zero induces a
we are interested in whether P1 or
multiple decision problem in which the four possible decisions are:
doo: H, and H, are both true,
d0': Hl is true, H2 is false,
dl0: Hl is false, H2 is true,
dl1: Hl and H2 are both false.
Now suppose that a test of HI is defined by the acceptance region A, and the
rejection regon R,, and similarly for H2. These two separate tests induce a

Ch. 14: Multiple Hypothesis Testing

829

decision procedure for the four decision problem, thls induced procedure being
defined by assigning the decision do" to the intersection of A, and A,, do' to the
intersection of A, and R, and so on. Thls induced procedure accepts H:
Dl = P2 = 0 if and only if HI and H , are accepted.
More generally suppose that the hypothesis H is true if and only if the separate
hypotheses H , , H,, ... are true. The induced test accepts H if and only if all the
separate hypotheses are accepted. An induced test is either finite or infinite
depending on whether there are a finite or infinite number of separate hypotheses.
In the case of finite induced tests the exact sampling distributions of the test
statistics can be complicated, so that in practice the critical regions of the tests are
based on probability inequalities. On the other hand, infinite induced tests are
commonly constructed such that the correct critical value can be readily calculated.
Induced tests were developed by Roy (1953), Roy and Bose (1953), Scheffe
(1953) and Tukey (1953). Roy referred to induced tests as union-intersection
tests. Procedures for constructing simultaneous confidence intervals are closely
associated with induced tests and such procedures are often called multiple
comparison procedures. Induced tests and their properties are discussed in two
papers by Lehmann (1957a, 1957b) and subsequently by Darroch and Silvey
(1963) and Seber (1964). A lucid presentation of the union-intersection principle
of test construction is given in Morrison (1976). I recommend Scheffe (1959) for a
discussion of the contributions of Scheffe and Tukey. A good reference for finite
induced tests is Krishnaiah (1979). Miller (1966, 1977) presents an excellent
survey of induced tests and simultaneous confidence interval procedures.
The induced tests I will discuss in detail are the Bonferroni test and the Scheffe
test. These two induced tests employ the usual t statistics and can always be
applied to the classical linear regression model. The Bonferroni test is a finite
induced test where the critical value is computed using the well known Bonferroni
inequality. While there are inequalities which give a slightly more accurate
approximation, the Bonferroni inequality has the advantage that it is very simple
to apply. In addition, the Bonferroni test behaves very similarly to finite induced
tests based on more accurate approximations. I refer to the F test as the Scheffe
test when the F test is used as an infinite induced test. Associated with the
Bonferroni and Scheffe tests are the B and S simultaneous confidence intervals,
respectively. The Bonferroni test and the B intervals are discussed in Miller (1966)
and applications in econometrics are found in Jorgenson and Lau (1975),
Chnstensen, Jorgenson and Lau (1975) and Sargan (1976). The Scheffk test and
the S intervals are explained in Scheffe (1959) and the S method is reformulated
as the S procedure in Scheffe (1977a). Applications of the Scheffe test and the S
intervals in econometrics are given in Jorgenson (1971, 1974) and Jorgenson and
Lau (1982). Both the Bonferroni and Scheffe tests are also discussed in Savin
(1980).

The organization of the chapter is the following. The relationship between t and
F tests is discussed in Section 2. In this section I present a detailed comparison of
the acceptance regions of the Bonferroni test and the F test for a special situation.
In Section 3 the notion of linear combinations of parameters of primary and
secondary interest is introduced. The Bonferroni test is first developed for linear
combinations of primary interest and then for linear combinations of secondary
interest. The Scheffk test is discussed and the lengths of the B and S intervals are
compared. The powers of the Bonferroni test and the Scheffe test are compared in
Section 4. The effect of multicollinearity on the power of the tests is also
examined. Large sample analogues of the Bonferroni and Scheffe tests can be
developed for more complicated models. In Section 5 large sample analogues are
derived for a nonlinear regression model. Section 6 presents two empirical
applications of the Bonferroni and Scheffe tests.
2.

2. I .

t and F tests

The model

Consider the regression model:

where y is a T X 1 vector of observations on the dependent variable, Xis a T x k
nonstochastic matrix of rank k, p is an unknown k x 1 parameter vector and u is
a T x 1 vector of random disturbances which is distributed as multivariate normal
with mean vector zero and covariance matrix a21 where a 2 > 0 is unknown.
Suppose we wish to test the hypothesis:

where C is a known q x k matrix of rank q _< k and c is a known q X 1 vector.
The minimum variance linear unbiased estimator of 0 is:

where b = (X'X)-'X'y is the least squares estimator of P. This estimator is
distributed as multivariate normal with mean vector 0 and covariance matrix u2v,
where V = C(X'X)-'C'. An unbiased estimator of a 2 is s 2 where (T- k ) s 2 = ( y
- Xb)'(y - Xb).
I will compare the acceptance regions of two tests of H. One test is the F test
and the other is a finite induced test based on t tests of the separate hypotheses.

Ch. 14: Multiple Hypothesis Testing

831

When H is rejected we usually want to know which individual restrictions are
responsible for rejection. Hence, I assume that the separate hypotheses are H,:
8, = 0, i = 1,... ,q. It is well known that the F test and the separate t tests can
produce conflicting inferences; for example, see Maddala (1977, pp. 122-124).
The purpose of comparing the acceptance regions of the two testing procedures is
to explain these conflicts.
I first introduce the F test and the finite induced test. Next, I briefly review the
distributions and probability inequalities involved in calculating the critical value
and significance level of a finite induced test. Then the acceptance regions of the
two tests are compared for the case of two restrictions; the exact and Bonferroni
critical values are used to perform the finite induced test. Finally, I discuss the
effect of a nonsingular linear transformation of the hypothesis H on the acceptance regions of the F test and the finite induced test.

2.2. Tests

2.2.1.

F test

The familiar F statistic is

For an a level F test of H the acceptance region is:

where F,(q, T - k ) is the upper a significance point of an F distribution with q
and T - k degrees of freedom. The F test of H is equivalent to one derived from
the confidence region:

where s2= qF,(q, T - k). The inequality determines an ellipsoid in the 8,, . .. ,8,
space with center at z. The probability that this random ellipsoid covers 8 is
1 - a. The F test of H accepts H if and only if the ellipsoid covers the origin.
The F test has power against alternatives in all directions. Accordingly, 1
consider a finite induced test with the same property. It will become apparent the
acceptance region of the finite induced test is not the same as the acceptance
region of the F test.

2.2.2. Finite induced test

Assume the finite induced test of H accepts H if and only if all the separate
hypotheses HI,.. ., Hq are accepted. The t statistic for testing the separate hypothesis HI: 8, = 0 is:

where yl is the ith diagonal element of V. The acceptance region of a 6 level
equal-tailed test of H, against the two-sided alternative HI*: 8, # 0 is:

where t,,,(T- k ) is the upper S / 2 significance point of a t distribution with
T - k degrees of freedom.
When all the equal-tailed t tests have the same significance level the acceptance
region for an a level finite induced test of H is:

where the critical value M is such that:

In words, thls finite induced test rejects H if the largest squared t statistic is
greater than the square of the critical value M. The significance level S of each
equal-tailed t test is given by:

The acceptance region of the a level finite induced test is the intersection of the
separate acceptance regions (2.9). For this reason Knshnaiah (1979) refers to the
above test as the finite intersection test. The acceptance region of the finite
induced test is a cube in the z,, .. . ,z, space with center at the origin and similarly
in the t , ,.. . ,tq space.
The finite induced test of H is equivalent to one based on a confidence region.
The simultaneous confidence intervals associated with the finite induced test are
given by:

Ch. 14: Multiple Npothesis Test~ng

833

I call these intervals M intervals. The intersection of the M intervals is the finite
induced confidence region. This region is a cube in the B,, . .. ,8,space with center
zl, . . ., z., The probability that this random cube covers the true parameter point 8
is 1 - a. The a level finite induced test accepts H if and only if all the M intervals
cover zero, i.e. if and only if the finite induced confidence region covers the origin.
2.3.

Critical values --finite induced test

To perform an a level finite induced test we need to know the upper a percentage
point of the max(1t, I,. .., I tpl) distribution. The multivariate t and F distributions
are briefly reviewed since these distributions are used in the calculation of the
exact percentage points. The exact percentage points are difficult to compute
except in special cases. In practice inequalities are used to obtain a bound on the
probability integral of rnax((t,l,,.., Itp(), when t,,. . ., t, have a central multivariate t distribution. Three such inequalities are discussed.
2.3.1.

Multivariate t and F distributions

Let x = (x,, . . . ,x,)' be distributed as a multivariate normal with mean vector p
and covariance matrix 2 = a2S2 where S2 = ( p ) is the correlation matrix. Also, let
IJ.
s 2 /0 2 be distributed independently of x as chl-square with n degrees of freedom.
In addition, let t, = x,\ln/s, i = 1,... , p . Then the joint distribution of t,, . . ., t, is a
central or noncentral multivariate t distribution with n degrees according as p = 0
or p # 0. The matrix S2 is referred to as the correlation matrix of the "accompanying" multivariate normal. In the central case, the above distribution was derived
by Cornish (1954) and by Dunnett and Sobel (1954) independently. Krishnaiah
and Arrnitage (1965a, 1966) gave the percentage points of the central multivariate
t distribution in the equicorrelated case p,, = p(i # j ) . Tables of P[max(t,, t,) 5 a]
were computed by Krishnaiah, Armitage and Breiter (1969a). The tables are used
for a finite induced test against one-sided alternatives. Such a test is discussed in
Section 3.
Krishnaiah (1963, 1964, 1965) has investigated the multivariate F distribution.
Let x u= (x,,, . .. ,xp,)', u = 1,... , m , be m independent random vectors which are
distributed as multivariate normal with mean vector p and covariance matrix
2 = (a,,). Also let:

The joint distribution of w,,.. .,wp is a central or noncentral multivariate chi-square

distribution with m degrees of freedom and with 2 as the covariance matrix of the
"accompanying" multivariate normal according as p = 0 or p # 0. Let F, =
nw,u,/mwou,, and let wo/u, be distributed independently of (w,,..., w,) as
chl-square with n degrees of freedom. Then the joint distribution of I;,,. .. ,F, is a
multivariate F distribution with m and n degrees of freedom with 52 as the
correlation matrix of the "accompanying" multivariate normal. When m = 1, the
multivariate F distribution is equivalent to the multivariate t 2 distribution.
Krishnaiah (1964) gave an exact expression for the density of the central
multivariate F distribution when 2 is nonsingular. Krishnaiah and Armitage
(1965b, 1970) computed the percentage points of the central multivariate F
distribution in the equicorrelated case when m = 1. Extensive tables of
P[max((t,l,(t,l) _< c ] have been prepared by Krishnaiah, Armitage and Breiter
(1969b). Hahn and Hendrickson (1971) gave the square roots of the percentage
points of the central multivariate F distribution with I and n degrees of freedom
in the equicorrelated case. For further details on the multivariate t and F
distributions see Johnson and Kotz (1972).

2.3.2.

Probability inequalities

The well known Bonferroni inequality states that:

where A, is an event and A; its complement. Letting A, be the event lt,l I ~ ~ , ~ ( n )
i = I,. .. ,p, the Bonferroni inequality gves:

i.e. the probability that the point (t,, ..., t,) falls in the cube is 2 1 - 6 p . The
probability is 21 - a when the significance level 6 is a / p . Tables of the
percentage points of the Bonferroni t statistic have been prepared by Dunn (1961)
and are reproduced in Miller (1966). A more extensive set of tables has been
calculated by Bailey (1977).
Sidak (1967) has proved a general inequality which can be specialized to give a
slight improvement over the Bonferroni inequality when both are applicable. The
Sidak inequality gives:

In words, the probability that a multivariate t vector (t,, ...,t,) with arbitrary
correlations falls inside a p-dimensional cube centered at the orign is always at

Ch. 14: Multiple Hypothesis Testing

835

least as large as the corresponding probability for the case where the correlations
are zero, i.e. where x,,... , x p are independent. When the critical value c is t , / , ( n )
the Sidak inequality gves:
P [ ~ = ( I ~ , I , . ltPl)
. . , ~ t , / ~ ( n2)(]1 - 8 ) ' .

(2.15)

The probability is 2 1 - a when the significance level 6 is 1 - (1 - a)'/p. The
Sidak inequality produces slightly sharper tests or intervals than the Bonferroni
inequality because (1 - 6 ) p 2 1 - 6 p . Games (1977) has prepared tables of the
percentage points of the Sidak t statistic. Charts by Moses (1976) may be used to
find the appropriate t critical value with either the Bonferroni or Sidak inequality.
In the special case where the correlations are zero, i.e. S2 = I , max((t , 1 , . .., I t p J )
has the studentized maximum modulus distribution with parameter p and n
degrees of freedom. The upper a percentage point of this distribution is denoted
m( p, n ) . Using a result by Sidak (1967), Hochberg (1974) has proved that:

where S2 is an arbitrary correlation matrix, i.e. 52 # I. Stoline and Ury (1979) have
shown that if 6 =1 - ( I - a)'/p, then m , ( p , n ) 5 t 8 / 2 ( n )with a strict inequality
holding when n = oo. This inequality produces a slight improvement over the
Sidak inequality. Hahn and Hendrickson (1971) gave tables of the upper percentage points of the studentized maximum modulus distribution. More extensive
tables have been prepared by Stoline and Ury (1979).
A finite induced test with significance level exactly equal to a is called an exact
finite induced test and the corresponding critical value is called the exact critical
value. For a nominal a level test of p separate hypotheses the Bonferroni critical
value is t , / , ( T - k ) with 6 = a / p , the Sidak critical value is t , / , ( T - k ) with
6 = 1 - ( 1 - a)'/P and the studentized maximum modulus critical value is ma( p, T
- k ) . When the exact critical value is approximated by the Bonferroni critical
value the finite induced test is called the Bonferroni test. The Sidak test and the
studentized maximum modulus test are defined similarly. For the purpose of thls
paper we use the Bonferroni test since the Bonferroni inequality is familiar and
simple to apply. However, the exposition would be essentially unchanged if the
Sidak test or the studentized. maximum modulus test were used instead of the
Bonferroni test.
2.4.

Acceptance regions

2.4.1. Case of two restrictions

The acceptance regions of the F test, the Bonferroni test and the exact finite
induced test are now compared for the case of q = 2 restrictions. It is assumed

that a 2 is known and that

Christensen (1973) compared the powers of the F test and the Bonferroni test for
this case. I will discuss the power comparisons in Section 5.
Since a 2 is assumed known the t statistics are distributed N ( 0 , l ) under the null
hypothesis and the F statistic is replaced by the X 2 statistic. These changes do not
change any important features of the tests, at least for the purpose of comparison.
The covariance matrix a 2 v where V is given by (2.17) has a simple interpretation. Consider a model with K = 3 regressors:

where e is a T x 1 vector of ones, X, is T X 2 and p = (Po,Dl, P,)'. Suppose the
hypothesis is H: P1 = /3, = 0. If both of the columns of Xl have mean zero and
length one, then a 2 V = a 2 ( X ; X l ) ' , where

and where r is the correlation between the columns of X I . In a model with K > 3
regressors (including an intercept) the covariance matrix of the least squares
estimates of the last two regression coefficients is given by a 2 V with V as in (2.17)
provided that the last two regressors have mean zero, length one and are
orthogonal to the remaining regressors.
Consider the acceptance regons of the tests in the z1 and z , space. The
acceptance regon of an a level X 2 test is the elliptical region:

where s2= x i ( 2 ) is the upper a significant point of the X 2 distribution with two
degrees of freedom. The acceptance region of a nominal a level Bonferroni test is
the square region:

k ) with 8 = a / 2 . This region is a square with sides
where B = t,,,(T2 ~ o / J and
s
center at the origin. The length of the major axis of the
elliptical region (2.20) and the length of the sides of the square become infinite as
the absolute value of r tends to one.

Ch. 14: Multiple Hjyothesis Testing

837

It will prove to be more convenient to study the acceptance regions of the tests
in the t , and t2 space. The t statistic for testing the separate hypotheses HI: 8, = 0
is:

where u / d l - r 2 is the standard deviation of z , and where t , and t , are N ( 0 , l )
variates since u2 is known. Dividing both sides of (2.20) by the standard deviation
of z, the acceptance region of the X 2 test becomes:

which is an elliptical region in the t , and t , space. Rewriting the boundary of the
elliptical region (2.23) as:

we see that the maximum absolute value of t , satisfying the equation of the ellipse
is S. By symmetry the same is true for the maximum absolute value of t , . Hence
the elliptical region (2.23) is bounded by a square region with sides 2S and center
at the origin. I refer to this region as the X 2 box. Dividing (2.21) by the standard
deviation of z, the acceptance region of the Bonferroni test becomes:

which is a square region in the t , and t2 space with sides 2 B and center at the
origin. I call t h s region the Bonferroni box. In this special case B < S so that the
Bonferroni box is inside the X 2 box. The acceptance region of the exact ci level
finite induced test is a square region which 1 refer to as the exact box. The exact
box is inside the Bonferroni box. The dimensions of the ellipse and the exact box
are conditional on r. Since the dimensions of the X 2 box and the Bonferroni box
are independent of r , the dimensions of the ellipse and the exact box remain
bounded as the absolute value of r tends to one.
Savin (1980) gives an example of a 0.05 level test of H when r = 0 . The
acceptance region of a 0.05 level X 2 test of H is:

T h s region is a circle with radius S = 2.448 and center at the origin. The
acceptance region of a nominal 0.05 level Bonferroni test in the t , and t 2 space is
a square with sides 2 B = 4.482 since 6 = 0.05/2 gives B = 2.241. Both the circle

and the Bonferroni box are shown in Figure 2.1. When V = I (and a 2 is known)
the t statistics are independent, so the probability that both t tests accept when H
If (1 is true is (1 = 0.95, then 6 = 0.0253. Hence, for an exact 0.05
level finite induced test the critical value is M = 2.236 and the exact box has sides
2 M = 4.472. The difference between the sides of the Bonferroni and the exact box
is 0.005. The true significance level of the Bonferroni box is 1 - (0.975)2= 0.0494,
wluch is quite close to 0.05.
A comparison of the acceptance regions of the X 2 test and the finite induced
test shows that there are six possible situations:
(1)
(2)

X2
X2

and both t tests reject.
and one but not both t tests reject.

Figure 2.1 The acceptance regions of the Bonferroni and X 2 tests where the correlation
r = 0 and the nominal size is a = 0.05.

Ch. 14: Multiple Hypothesis Testing

( 3 ) X 2 test rejects but not the t tests.
(4) Both t tests reject but not X 2 test.
( 5 ) One, but not both t tests reject, nor X 2 test.
( 6 ) Neither the t tests nor X 2 test reject.
Cases 1 and 6 are cases of agreement while the remaining are cases of disagreement. The X 2 test and the finite induced test can produce conflicting inferences
since they use different acceptance regions. These six cases are discussed in the
context of the F test and the finite induced test by Geary and Leser (1968) and
Maddala (1977, pp. 122-124).
From Figure 2.1 we see that H is accepted by the Bonferroni test and rejected
by the X 2 when A is the point ( t , , t , ) and vice versa when B is the point ( t , , t , ) .
Case 3 is illustrated by point A and Case 5 by point B. Maddala (1977) remarks
that Case 3 occurs often in econometric applications while Case 4 is not
commonly observed. Maddala refers to Case 3 as multicollinearity. Figure 2.1
illustrates that Case 3 can occur when r = 0 , i.e. when the regressors are
orthogonal.
Next consider the acceptance regions of the tests when r + 0 . The following
discussion is based on the work of Evans and Savin (1980). When r is different
from zero the acceptance region of the X 2 test is an ellipse. The acceptance
regions of a 0.05 level x2 test in the t , and t , space are shown in Figure 2.2 for
r = 0.0 (0.2) 1.0. In Figure 2.2 the inner box is the nominal 0.05 level Bonferroni
box and the outer box is the X 2 box. The ellipse collapses to a line as r increases
from zero to one.
Observe that the case where both t tests reject and the X 2 test accepts (Case 4 )
cannot be illustrated in Figure 2.1. From Figure 2.2 we see that Case 4 can be
illustrated by point C. Clearly, r 2 must be high for Case 4 to occur. Maddala
notes that thls case is not commonly observed in econometric work.
The true level of significance of the Bonferroni box decreases as r increases in
absolute value. The true significance level of a nominal a level Bonferroni box for
selected values of a and r are given in Table 2.1. When a = 0.05 the true levels are
roughly constant for r < 0.6. For r > 0.6, there is a noticeable decrease in the true
level. Thls suggests that the nominal 0.5 level Bonferroni box is a satisfactory
approximation to the exact box for r < 0.6. The results are similar when the
nominal sizes are a = 0.10 and a = 0.01.
As noted earlier the x2 test and the Bonferroni test can produce conflicting
inferences because the tests do not have the same acceptance regions. The
probability of conflict is one minus the probability that the tests agree. When H is
true the probability that the tests agree and that they conflict are gven in Table
2.1 for selected values of a and r. For the case where the nominal size is a = 0.05,
although the probability of conflict increases as r increases (for r > O), thls
probability remains quite small, i.e. less than the significance level. This result

Figure 2.2 The acceptance regions of the Bonferroni and X 2 tests in the r-ratio space for
various correlations r and nominal size a = 0.05.

appears to be at variance with the widely held belief that conflict between the
Bonferroni and F tests is a common occurrence. Of course, thls belief may simply
be due to a biased memory, i.e. agreement is easily forgotten, but conflict is
remembered. On the other hand, the small probability of conflict may be a special
feature of the two parameter case.
Figure 2.2 shows a big decrease in the area of intersection of the two
acceptance regions as r increases and hence gives a misleading impression that
there is a big decrease in the probability that both tests accept as r increases. In
fact, the probability that both tests accept is remarkably constant. The results are
similar when the nominal sizes are a = 0.10 and a = 0.01. As can be seen from

Ch. 14: Multiple Hvpothesis Testing
T a b l e 2.1 T h e Probability o f C o n f l i c t b e t w e e n t h e C h l S q u a r e
and F l n i t e Induced T e s t s and between
t h e Chi Square and Bonferroni T e s t s .

Nomlnal
Size

r

F i n i t e Induced T e s t
Conflict
Agree

Agree

Bonferroni Test
Conflict
True S i z e

0.964
0.964
0.963
0.961
0.958
0.954
0.948
0.941
0.934
0.926
0.920
0.913
0.909

0.036
0.036
0.037
0.039
0.042
0.046
0.052
0.059
0.066
0.074
0.080
0.087
0.091

0.965
0.965
0.964
0.962
0.961
0.958
0.955
0.951
0.947
0.942
0.939
0.936
0.934

0.035
0.035
0.036
0.038
0.039
0.042
0.045
0.049
0.053
0.058
0.061
0.064
0.066

0.098
0.097
0.096
0.095
0.093
0.091
0.088
0.083
0.078
0.070
0.065
0.057
0.051

0.978
0.978
0.977
0.976
0.975
0.973
0.971
0.967
0.963
0.959
0.956
0.952
0.950

0.022
0.022
0.023
0.024
0.025
0.027
0.029
0.033
0.037
0.041
0.044
0.048
0.050

0.978
0.978
0.978
0.977
0.976
0.975
0.974
0.972
0.970
0.968
0.966
0.965
0.964

0.022
0.022
0.022
0.023
0.024
0.025
0.026
0.028
0.030
0.032
0.034
0.035
0.036

0.049
0.049
0.049
0.048
0.048
0.046
0.045
0.043
0.040
0.036
0.033
0.029
0.025

0.994
0.994
0.994
0.994
0.994
0.993
0.993
0.992
0.992
0.99i
0.990
0.989
0.989

0.006
0.006
0.006
0.006
0.006
0.007
0.007
0.008
0.008
0.009
0.010
0.011
0.011

0.994
0.994
0.994
0.994
0.994
0.994
0.993
0.993
0.993
0.992
0.992
0.992
0.992

0.006
0.006
0.006
0.006
0.006
0.006
0.007
0.007
0.007
0.008
0.008
0.008
0.008

0.010
0.010
0.010
0.010
0.010
0.010
0.009
0.009
0.008
0.008
0.007
0.006
0.005

Table 2.1 the results are also similar when the Bonferroni box is replaced by the
exact box.
2.4.2. Equivalent hypotheses and invariance

In thls section I discuss the effect of a nonsingular linear transformation of the
hypothesis H on the acceptance regions of the F test and the Bonferroni test.
Consider the hypothesis:
H*: ~ * p - ~ * = g * = o
(2.27)

where C* is a known q* x k matrix of rank q* 5 k and c* is a known q* X k
vector so that 8:, 8,*,... ,8$ are a set of q* linearly independent functions. The
hypotheses H* and H are equivalent when H* is true if and only if H is true.
Hence, H* and H are equivalent if the set of /3 for which 8 = 0 is the same as the
set for which 8* = 0.
We now show that H and H* are equivalent if and only if there exists a
nonsingular q X q matrix A such that [C*c*]= A[Cc] and hence q* = q. Our proof
follows Scheffe (1959, pp. 31-32]. Suppose first that a q X q nonsingular matrix A
exists such that [C*c*]= A[Cc]. Then H* is true implies that 8* = C*P - c* =
A(CP - c ) = 0. Thus, CP - c = 8 = 0 which implies that H is true. Similarly if H i s
true, then H* is true.
Suppose next that the equations C*P = c* have the same solution space as the
equations C b = c. Then the rows of [C*c*]span the same space as the rows of
[Cc].The q* rows of C* are linearly independent and so constitute a basis for thls
space. Similarly, the q rows of C constitute a basis for the same space. Hence
q* = q and the q rows of C* must be linear combinations of the q rows of C.
Therefore [C*c*]= A[Cc], where A is nonsingular since rank C* = Rank C = q.
If the hypotheses H* and H are equivalent, the F statistic for testing H* is the
same as the F statistic for testing H. Assume that H* and H are equivalent. The
numerator of the F statistic for testing H* is
[C*b - c*]'[c*(x'x)-'c*'] l [ C * b- c*]
=

[ C b - c ] ' A ' ( A ' ) - 1 [ ~ ( ~ ' ~' )A1 -~' 'A] [ c ~- c ]

=

[ c b - c][c(x'x)-'c']
-'[Cb - c ] .

(2.28)

Thls is the same as the numerator of the F statistic for testing H, the denominator
of the two test statistics being qs2. Hence the F tests of H* and H employ the
same acceptance region with the result that we accept H* if and only if we accept
H. This can be summarized by saying that the F test has the property that it is
invariant to a nonsingular transformation of the hypothesis.
The finite induced test and hence the Bonferroni test does not possess t h s
invariance property. As an example consider the case where q = 2 and a 2 v= I
whch is known. First suppose the hypothesis is H: 8, = = 0. Then the acceptance region of the nominal 0.05 level Bonferroni test of H is the intersection of
the separate acceptance regions J z ,J 5 2.24 and ( z , J5 2 . 2 4 . Now suppose the
hypothesis H* is 8: = 8, + 8, = 0 and 8; = 8, - 8, = 0. The acceptance region of
the nominal 0.05 level Bonferroni test of H* is the intersection of the separate
regions lzl + z 2 J5 (2)'l22.24 and lzl - z 2 )1 (2)'122.24. The hypotheses H* and
H are equivalent, but the acceptance region for testing H* is not the same as the
region for testing H. Therefore, if the same sample is used to test both hypotheses,
H* may be accepted and H rejected and vice versa.

Ch. 14: Multiple Hypothesis Testing

843

If all hypotheses equivalent to H are of equal interest we want to accept all
these hypotheses if and only if we accept H. In this situation the F test is the
natural test. However, hypotheses which are equivalent may not be of equal
interest. When this is the case the F test may no longer be an intuitively appealing
procedure. Testing linear combinations of the restrictions is discussed in detail in
the next section.

3. Induced tests and simultaneous confidence intervals
3.1. Separate hypotheses
An important step in the construction of an induced test is the choice of the
separate hypotheses. So far, I have only considered separate hypotheses about
individual restrictions. In general, the separate hypotheses can be about linear
combinations of the restrictions as well as the individual restrictions. This means
that there can be many induced tests of H, each test being conditional on a
different set of separate hypotheses. The set of separate hypotheses chosen should
include those hypotheses which are of economic interest. Economic theory may
not be sufficient to determine a unique set of separate hypotheses and hence a
unique induced test of H.
Let L be the set of linear combinations J, such that every J, in L is of the form
J, = a'8 where a is any known q X 1 non-null vector. In other words, L is the set of
all linear combinations of 4 , . . ,8, (excluding the case of a = 0). The set L is
called a q-dimensional space of functions if the functions 8,,. .. ,8, are linearly
independent, i.e. if rank C = q where C is defined in (2.2).
The investigator may not have an equal interest in all the 11, in L. For example,
in economic studies the individual regression coefficients are commonly of most
interest. Let G be the set of J, of primary interest and the complement of G
relative to L, denoted by L - G, be the set of J, in L of secondary interest. It is
assumed that thls twofold partition is fine enough that all 11, in G are of equal
interest and similarly for all J, in L - G. Furthermore, it is assumed that G
contains q linearly independent combinations J,.
The set G is either a finite or an infinite set. If G is infinite, then G is either a
proper subset of L or equal to L. In the latter case all the J, in L are of primary
interest. All told there are three possible situations: (i) G finite, L - G infinite; (ii)
G infinite, L - G infinite; (iii) G infinite, L - G finite. The induced test is referred
to as a finite or infinite induced test accordingly as G is finite or infinite.
Let G be a finite set and let J,,, i =1, ...,m, be the linear combinations in G.
The finite induced test of
H(G): +I=-.. =J,,=O

(3.1)

accepts H(G) if and only if all the separate hypotheses,
H,: J , , = o ,

i = l , ...,m,

(34

are accepted and rejects H(G) otherwise. Since there are q linearly independent
combinations J,i, i = 1,... ,q, in G, the hypotheses H(G) and H : 8 = 0 are
equivalent and H(G) is true if and only if H is true. Hence, we accept H if all the
separate hypotheses HI, i =1,.. . ,m are accepted and reject H otherwise. Thls test
procedure is also referred to as the finite induced test of H. Similar remarks apply
when G is an infinite set. Since the induced test of H is conditional on the choice
of G, it is important that G be selected before analyzing the data.
The set G may be thought of as the set of eligible voters. A linear combination
of primary interest votes for (against) H if the corresponding separate hypothesis
H ( a ) is accepted (rejected). A unanimous decision is required for H to be
accepted, i.e. all J, in G must vote for H. Conversely, each 4in G has the power to
veto H. If all J, in L are of equal interest, then all J, in L are also in G so there is
universal suffrage. On the other hand, the set of eligible voters may have as few as
q members. The reason for restricting the right to vote is to prevent the veto
power from being exercised by J, in which we have only a secondary interest.
Instead of having only one class of eligible voters it may be more desirable to
have several classes of eligible voters where the weight of each vote depends on
the class of the voter. Then the hypothesis H is accepted or rejected depending on
the size of the vote. However such voting schemes have not been developed in the
statistical literature. In this paper I only discuss the simple voting scheme
indicated above.
It is worth remarking that when the number of J, in G is greater than q the
induced test produces decisions which at first sight may appear puzzling. As an
example suppose q = 2 and that the J, in G are 4, = e,, 4, = 8,, and J,, = 8, 8,.
Testing the three separate hypotheses H,: J,, = 0, i = 1,2,3, induces a decision
problem in whlch one of the eight possible decisions is:

+

Hl and H, are both true, H3is false.

(3.3)

Clearly, when Hl and H2 are both known to be true, then H, is necessarily true.
On the other hand, when testing these three hypotheses it may be quite reasonable
to accept that Hl and H2 are both true and that H, is false. In other words, there
is a difference between logical and statistical inference.
3.2. Finite induced test - J, of primary interest
3.2.1. Exact test

Suppose that a finite number m of 4 in L are of primary interest. In this case G is
a finite set. Let the J, in G be J,, =a:@, i =1, ..., m. The t statistic for testing the

Ch. 14: Multiple H.vpothesis Testing

separate hypothesis H ( a , ) : 4, = a)9 = 0 is:

4,

where = ajz is the minimum variance unbiased estimator of, ) I and 6$, = s2ajva,
is an unbiased estimator of its variance where z and V are defined in Section 2.1.
For an equal-tailed 6 level test of H ( a , ) the acceptance region is:

The finite induced test of H accepts H if and only if all the separate hypotheses
H ( a , ) , ..., H ( a m ) are accepted. When all the equal-tailed tests have the same
significance level the acceptance region for an a level finite induced test of H is:

where

The significance level of the separate tests is 6, where t,,,(T - k ) = M. The
acceptance regon of the finite induced test is the intersection of the separate
acceptance regons (3.6). This region is a polyhedron in the z,, . .., z , space and a
cube in the t o ( a l ) ., .., t o ( a m )space.
Simultaneous confidence intervals can be constructed for all IC/ in G. The finite
induced procedure is based on the following result. The probability is 1 - a that
simultaneously

I call these intervals M intervals. The intersection of the M intervals is a
polyhedron in the 6 space with center at z. The a level finite induced test accepts
H if and only if all the M intervals (3.8) cover zero, i.e. if and only if the finite
induced confidence region covers the origin.
An estimate
of 4, is said to be significantly different from zero ( s d f i )
according to the M criterion if the M interval does not cover 4, = 0, i.e. if
2 M64,. Hence, H is rejected if and only if the estimate of at least one 4, in G
is sdfz according to the M criterion.
The finite induced test can be tailored to provide high power against certain
alternatives. Thls can be achieved by using t tests which have unequal tails and

4,

N. E. Savin

846

different significance levels. For example, a finite induced test can be used to test
against the alternative H**: 0 > 0. The acceptance region of a 6 level one-tailed t
test against Hi**: ei> 0 is:

When all the one-tailed t tests have the same significance level the acceptance
region for an a level finite induced test of H is

where

The simultaneous confidence intervals associated with the above test procedure
are given by:

A finite induced test against the one-sided alternatives H,* *: 8 < 0, i =I,...,q,
can also be developed. In the remainder of this chapter I only consider two-sided
alternatives.

3.2.2.

Bonferroni test

The Bonferroni test is obtained from the exact test by replacing the exact critical
value M by the critical value B given by the Bonferroni inequality. For a nominal
a level Bonferroni induced test of H the acceptance region is:

where
B = t,,,,(T-

k).

(3.14)

The significance level of the separate tests is 6 = a / m and the significance level of
the Bonferroni test is 5 a. The Bonferroni test consists of testing the separate
hypotheses using the acceptance region (3.13) where the critical value B is given
by (3.14). The acceptance regon of the Bonferroni test in the z,, ..., z, space is
referred to as the Bonferroni polyhedron and in the t,(a,),..., to(a,) space as the
Bonferroni box. The Bonferroni polyhedron contains the polyhedron of the exact
finite induced test and similarly for the Bonferroni box.

Ch. 14: Multiple Hypothesis Testing

The probability is

r 1- a that simultaneously

where these intervals are called B intervals. The B procedure consists in using
these B intervals. The Bonferroni test accepts H if and only if all the B intervals
cover zero, i.e. if and only if the Bonferroni confidence region covers the origin.
An estimate of
of J,, is said to be sdfi according to the B criterion if the B
interval does not cover zero, i.e. l$,( 2 B6+,.
The Bonferroni test can be used to illustrate a finite induced test when m > q,
i.e. the number of separate hypotheses is greater than the number of linear
restrictions specified by H. Consider the case where m = 3, q = 2, and u2V= I
which is known. Suppose that the three J, in G are $9= 4 , J,, = 8,, and
J/,
= 8,
8, and that tests of the three separate hypotheses Hi: J,, = 0, i = 1,2,3,
are defined by the three separate acceptance regions:

4,

+

respectively, where 2.39 is the upper 0.05/2(3) = 0.00833 significance point of a
N(0,l) distribution. The probability is 2 0.95 that the Bonferroni test accepts H
when H is true.
The acceptance region of the Bonferroni test of H, which is the intersection of
the three separate acceptance regions, is shown in Figure 3.1. When A is the point
(z,, z,) the hypothesis H is rejected and the decision is that H, and H2 are both
true and H3 is false.
For comparison consider the case where m = q = 2. The tests of the two
)
, = 82 = 0 are now defined by the two
separate hypotheses J,, = 8, = 0 and I
acceptance regions:

respectively, where 2.24 is the upper 0.05/2(2) = 0.0125 significance point of a
N(0,l) distribution. The acceptance region of this Bonferroni t ~ s of
t H is the
inner square region shown in Figure 3.1. With this region we accept H when A is
the point (z,, z,). When B is the point (z,, z,) the hypothesis H is accepted if 4,
is of primary interest and rejected if J,, is of secondary interest. This comparison
shows that the Bonferroni test can accept H for one set of J, of primary interest
and reject H for another set.

Figure 3.1 Acceptance regions of the Bonferroni test for the cases rn = 2 and m
q = 2 and 0 2 V = I which is known. The nominal size is a = 0.05.

=

3 when

3.3. Injinite induced test - Schefi test
3.3.1. Schefi test

The Scheffe test is an infinite induced test where all I)in L are of primary interest.
This induced test accepts H if and only if the separate hypothesis,

Ch. 14: Multiple ffvpothesis Testing

849

is accepted for all non-null a. For a 6 level equal-tailed test of H ( a ) the
acceptance region is:

where

When all the equal-tailed tests have the same significance level the acceptance
regon for an a level infinite induced test of H is:
I t o ( a ) J s S , allnon-nulla,

(3.21)

where

What is surprising is that the critical value S is given by the relatively simple
expression:

The significance level 6 of the separate tests is given by t,/,(T - k ) = S.
The acceptance region is the intersection of the separate acceptance regions
(3.21) for all non-null a. A remarkable fact is that the acceptance region of an a
level Scheffe test of H is the same as the acceptance region of an a level F test of
H. As a consequence we start the Scheffk test with an F test of H. If the F test
rejects H the next step is to find the separate hypotheses responsible for rejection.
The test procedure consists of testing the separate hypotheses using the acceptance region (3.21) where the critical value S is given by (3.23).
The Scheffe test assumes that all 4 in L are of equal interest, i.e. every rC, in L
has the power to veto H. When the Scheffe test is used in empirical econometrics
we are implicitly assuming that all J/ in L are of equal economic interest. In
practice, this assumption is seldom satisfied. As a consequence, if the Scheffe test
rejects, the linear combinations which are responsible for rejection may have no
economically meaningful interpretation. A solution to the interpretation problem
is to use the appropriate finite induced test.
Simultaneous confidence intervals can be constructed for all 4 in L. The
probability is 1- a that simultaneously for all 4 in L:

where S is given by (3.23). These intervals are called S intervals. In words, the
probability is 1- a! that simultaneously for all J, in L the S intervals cover J,. The
intersection of the S intervals for all J, in L is the confidence region (2.6). This is
an ellipsoidal region in 8 space with center at z.
An estimate 4 of J, is said to be sdfz if the S interval does not cover = 0 , i.e. if
($1 > St+,. Hence, H is rejected if and only if the estimate of at least one J, in L is
sdfz according to the S criterion.
The Scheffe test and the S intervals are based on the following result:

+

where t 2 ( a ) is the squared t ratio:

and where S is given by (3.23). The result is proved in Scheffe (1959, pp. 69-70). I
will now give a simple proof.
Observe that the result is proved by showing that the maximum squared t ratio
is distributed as qF(q, T - k ) . There is no loss in generality in maximizing t 2 ( a )
subject to the normalization a'Va =1 since t 2 ( a ) is not affected by a change of
scale of the elements of a. Form the Lagrangan:

where h is the Lagrange multiplier. Setting the derivative of L ( a , A ) with respect
to a equal to zero gives:

[ ( z- B ) ( z - 8 ) ' - h s 2 v ]a = 0 .

(3.28)

Premultiplying (3.28) by a' and dividing by s2a'Va shows that A = t 2 ( a ) . Hence,
the deterrninantal equation:

is solved for the greatest characteristic root A*. Since (3.29) has only one non-zero
root-the matrix ( z - 8 ) ( z - 8)' has rank one-the greatest root is:

851

Ch. 14: Multiple Hypothesis Testing

which is distributed as qF(q, T - k ) . The solutions to (3.28),i.e. the characteristic
vectors corresponding to A*, are proportional to ( s 2 V ) - ' ( z- e ) and the characteristic vector satisfying the normalization a Va =1 is a* = V - ' ( z The Scheffe induced test accepts H if and only if:

e)/m.

f

where t i ( a ) is t 2 ( a ) with 8 = 0. It follows from (3.30) that:

where a,* is the vector which maximizes t i ( a ) . Since this t ratio is distributed as
qF(q, T - k ) when H is true, the a level Scheffe test accepts H if and only if the a
level F test accepts H .
When the F test rejects H we want to find whlch \i. are sdfz. Since a,* can be
calculated from (3.30) we can always find at least one which is sdfz, namely
\i., = a,*'z. Unfortunately, computer programs for regression analysis calculate the
F statistic, but do not calculate a,*.
When the hypothesis H is that all the slope coefficients are zero the components
of the a,* vector have a simple statistical interpretation. Suppose that the first
column of Xis a column of ones and let D be the T x ( k - 1) matrix of deviations
of the regressors (excluding unity) from their means. Since z is simply the least
squares estimator of the slope coefficients, z = (D'D)-'D'y. Hence a,* =
( ~ ' ~ ) z ( s ~ q F=) -~ ' '/ y~ ( s ~ q ~ SO
) - that
l / ~the components of a,* are proportional to the sample covariances between the dependent variable and the regressors. If the columns of D are orthogonal, then the components of a,* are
proportional to the least squares estimates of the slope coefficients, i.e. z . Thus, in
the orthogonal case is proportional to the sum of the squares of the estimates
of the slope coefficients.
For an example of the Scheffe test I again turn to the case where q = 2 and
a 2 V = I which is known. When a = 0.05 the test of the separate hypothesis H ( a )
is defined by the acceptance region:

4,

where afVa= a'a =1. Thus each separate hypothesis H ( a ) is tested at the 0.014
level to achieve a 0.05 level separate induced test of H. Geometrically the
acceptance region (3.33)is a strip in the z , and z2 space between two parallel lines
orthogonal to the vector a , the origin being midway between the lines. The
acceptance region or strip for testing the separate hypothesis H ( a ) is shown in
Figure 3.2. The intersection of the separate acceptance regions or strips for all

N . E. Saoin

Figure

Separate acceptance regions or confidence intervals when q = 2 and o 2V = I
which is known. The nominal size is a = 0.05.

non-null a is the circular region in Figure 3.2. Recall that t h s circular region is
the acceptance region of a 0.05 level X 2 test of H, i.e. the region shown in Figure
2.1. The square region in Figure 3.2 is the acceptance region of a 0.05 level
Bonferroni separate induced test of H when the only J, in L of primary interest
are J/,= 8, and 4,= 8,. As noted earlier these two acceptance regions can
produce conflicting inferences and hence the same is true for the Bonferroni and
Scheffe separate induced tests of H.
The S interval for J, = a'0 is defined by the confidence region:

whch says that the point 8 lies in a strip of 0, and 8, space between two parallel

Ch. 14: Multiple Hypothesis Testing

853

lines orthogonal to the vector a , the point (z,, z , ) being midway between the
lines. The intersection of the S intervals for all 4 in L is the circular region in
Figure 3.2 when it is centered at ( z , , z , ) in the 8, and 8, space. The S procedure
accepts H if and only if all the S intervals cover zero, i.e. if and only if the circular
region in Figure 3.2 (interpreted as a 95% confidence region) covers zero.
3.3.2. An extension

When the F test rejects, one or more t ratios for individual parameters may be
large enough to explain this rejection. As an extension of tlus result we want to
look at F statistics for subsets of the linear restrictions specified by H. If any of
these are sufficiently large then we would have found subsets of the restrictions
responsible for rejection. To carry out this extension of the S procedure we now
present a result due to Gabriel (1964, 1969).
Consider testing the hypothesis:

H,: C , p

-

c,

= 8 , =0 ,

(3.35)

where [ C , c , ] consists of any q , rows of [ C c ] defined in (2.2). Let F , be the F
statistic for testing H , and let t i ( a , ) be the squared t ratio for testing:

where a , is q , x 1. With no loss of generality we may let [ C , c,] consist of the
last q , rows of [ C c ] .From the development (3.23) to (3.26) we find that

where I is the set of all non-null a vectors such that the first q - q , elements are
zero. Hence:

since the constrained maximum of @ a ) is less than or equal to the unconstrained
maximum. This establishes that when H is true the probability is 1 - a that the
inequality,

is simultaneously satisfied for all hypotheses H, defined in (3.35) where F , is the
F statistic for testing H , .
The implication is that using acceptance region (3.39) we can test any number
of multivariate hypotheses H , with the assurance that all will be simultaneously

854

N. E. Savin

accepted with probability 1 - a when the hypothesis H is true. The hypotheses
H , may be suggested by the data. When we begin the procedure with an a level F
test of H, this is a special case of H , when q, = q. For further discussion see
Scheffe (1977a).
3.3.3. Conditional probability of coverage
The S intervals are usually not calculated when the F test accepts H since none is
considered interesting. In light of this practice Olshen (1973, 1977) has argued
that we should consider the conditional probability that the S intervals cover the
true values given rejection of H. Olshen (1973) has proved that:

for all p and a 2 provided S 2 > 3(T - k ) and ( T - k ) > 2. This means that under
certain mild conditions the conditional probability of coverage is always less than
the unconditional probability. Monte Carlo studies show that the conditional
probability can be substantially less than the unconditional probability.
A simple example will serve to illustrate the difference between the conditional
and unconditional probability of coverage. Let x be an observation from N(p, 1).
The probability that the nominal 95% confidence interval for p covers p given
rejection of the hypothesis p = 0 by a 0.05 level standard normal test is P( lx - pi
<1.961Ixl >1.96). Forp = 1 we have P(lx( >1.96)=0.1700 and P(lx -pi <1.96,
1x1 > 1.96) = 0.1435, so that the conditional probability of coverage is
0.1435/0.1700 = 0.8441. For p = 4 the conditional probability is 0.95/0.9793 =
0.9701. In t h s example the conditional probability is < 0.95 when p < 3.92 and
> 0.95 when p > 3.92.
In general the S procedure is not satisfactory if one wants to control the
conditional probability of coverage since there is no guarantee that the conditional probability is greater than or equal to the unconditional probability, the
latter being the only probability subject to control with the S procedure. Olshen's
theorem shows that the unconditional probability can be a very misleading guide
to the conditional probability. The S intervals are often criticized for being too
wide, but they are two narrow if we want to make the conditional probability at
least as great as the unconditional. Thus, if like Olshen we are interested in
controlling the conditional probability, then we would want to replace the S
procedure with one which controls this probability; see Olshen (1973) for a
discussion of some developments along these lines.
Suppose we decide before analyzing the data that we have a multiple decision
problem. Then the unconditional probability of coverage is of interest. In thls
situation the F test is simply the first step in the S procedure. If the F test accepts

Ch. 14: Multiple Hypothesis Testing

855

H it is not customary to calculate the S intervals since it is known that they all
cover zero and if the F test rejects we do not actually calculate all the S intervals
since this is not feasible. On the other hand, suppose we do not decide before
conducting the F test that we have a multiple decision problem, but decide after
the F test rejects that we have such a problem. In this case the conditional
probability of coverage is relevant. Of course, we may be interested in both the
conditional and the unconditional probabilities. In this paper it has been assumed
that we decided to treat the testing problem as a multiple decision problem prior
to looking at the data, i.e. that the unconditional probabilities are the focus of
attention.

3.4. Finite induced test -J, of secondary interest
Suppose after inspecting the data we wish to make inferences about linear
combinations of secondary interest. I now discuss how the finite induced test can
be generalized so that inferences can be made about all J, in L. For this purpose I
adopt the general approach of Scheffe (1959, pp. 81-83). Following Scheffe the
discussion is in terms of simultaneous confidence intervals.
Let G be a set of J, in L of primary interest and suppose we have a multiple
comparison procedure which gives for each J, in G an interval:

where h , is a constant depending on the vector a but not the unknown 8. The
inequality (3.41), which may be written

can be interpreted geometrically to mean that the point 8 lies in a strip of the
q-dimensional space between two parallel planes orthogonal to the vector a, the
point z being midway between the planes. The intersection of these strips for all J,
in G determines a certain convex set C and (3.41) holds for all J, in G if and only
if the point 8 lies in C. Thus, the problem of simultaneous confidence interval
construction can be approached by starting with a convex set C instead of a set G
of J, in L. For any convex set C we can derive simultaneous confidence intervals
for the infinite set of all J, in L by starting with the relation that the point 8 lies in
C if and only if it lies between every pair of parallel supporting planes of C.

856

N. E. Sarin

Let L* be the set of # in L for which afVa=1 and G* be a set of m linear
combinations J, in L* of primary interest. This normalization is convenient since
the M intervals for all # in G* have length 2 Ms and the S intervals for all # in L*
have length 2Ss. We now define the confidence set C of the M procedure to be the
intersection of the M intervals for all # in G* and the set C of the S procedure to
be the intersection of the S intervals for all # in L*. In the M procedure C is a
polyhedron and in the S procedure C is the confidence ellipsoid defined by (2.6).
When q = 2 the region C is a polygonal region in the B procedure and an
elliptical region in the S procedure. In addition, if m = 2 and if a v = 1, then C is
a square regon in the M and B procedures and a circular regon in the S
procedure, as depicted in Figure 2.1.
Consider the case where the confidence regon C is a square with sides 2Ms.
Starting with a square we can derive simultaneous confidence intervals for all 4in
L*, not just for 8, and 8,. The square has four extreme points which are the four
corner points. There are only two pairs of parallel lines of support where each
supporting line contains two extreme points. These two pairs of lines define the M
intervals for the # of primary interest, i.e. 8, and 8,, respectively, and contain all
the boundary points of the square. In addition to these two pairs of parallel lines
of support, there are an infinite number of pairs of parallel lines of support where
each line contains only one extreme point. One such pair is shown in Figure 3.2.
This pair defines a simultaneous confidence interval for some # of secondary
interest. We can derive a simultaneous confidence interval for every # of
secondary interest by taking into account pairs of supporting lines where each
line contains only one extreme point.
A general method for calculating simultaneous confidence intervals is given by
Richmond (1982). This method can be used to calculate M intervals for linear
combinations of secondary interest. I briefly review this method and present two
examples for the case of B intervals.
Let G be a set of a finite number m of linear combinations of primary interest
and as before denote the linear combinations in G by #, = a:8, i = 1,2,.. ., m. Any
linear combination in L can be written as # = c,#, c,#, + . . . + c,#,, where
c = (c,, ..., c,)', i.e. any # in L is a linear combination of the # in G. The method
is based on the following result. The probability is 1 - cw that simultaneously for
all J, in L:

+

I also call these intervals M intervals. When c = ( 0 , ... , 0 , 1 , 0 , .. ., O)', the 1 occurring in the ith place, the M interval is for #,, a # of primary interest.

857

Ch. 14: Multrple Hvpothesis Testing

Thls result is a special case of Theorem 2 in Richmond (1982). The result (3.43)
is proved by showing that (3.43) is true if and only if:

I will give a sketch of the proof. Suppose (3.44) holds. Multiply both sides of
(3.44) by 1 c, 1 and sum over i =1,.. ., m. Then:

whlch is equivalent to (3.43). Conversely, suppose (3.43) holds for all J/ in L. Take
c r = l a n d c , = O , j = l , ...,m , j # i . Then

which cornpletes the proof.
For both examples I assume that q = 2 and a 2 V = 1 which is known. In the first
exampl: suppose the m = 2 linear combinations in G are 4, = 8, and I/,
= 8,.
Consider the B interval for J/ =
+ I),)= \ / 1 / 2 ( 8 , + 8,). When 6 = 0.05/2
the Bonferroni critical value is B = 2.24, so that the length of the B interval is
2(c, + c , ) B = 2(2)\/1/2(2.24)= 6.336. This is the case shown in Figure 3.2 when
the square region is centered at (z,, z , ) in the 8, and 8, space, i.e. when the square
region is interpreted as a nominal 95% confidence region. In the second example
suppose m = 3 and J/ is of primary interest. When 6 = 0.05/3 the Bonferroni
critical value is B = 2.39 so that the length of the B interval for J/ is 2(2.39)= 4.78,
whlch is considerably less than when is of secondary interest. This shows that
the length of a B interval for a J/ in L can vary considerably depending on
whether J/ is of primary or secondary interest. In particular, the length of a B
interval for a J/ depends critically on the values of the c,'s.

\/1/2(J/,

3.5. Simultaneous con3dence intervals
In this section I compare the lengths of the finite induced intervals and the S
intervals. The lengths are compared for the linear combinations of primary
interest and secondary interest. In many cases the B intervals are shorter for the J/
of primary interest. On the other hand, the S intervals are always shorter for at
least some 4 of secondary interest.

858

N. E. Savin

3.5.1. J, of primary interest

Consider the set G of linear combinations of primary interest in the finite induced
test. The ratio of the length of the M intervals to the length of the S intervals for
J, in G is simply the ratio of M to S. For fixed q the values M and S satisfy the
relation:

where I is a set of m vectors. Since the restricted maximum is equal to or less than
the unrestricted, it follows that M 5 S. Hence, the M intervals are shorter than
the S intervals for all q and m (m 2 q).
The B intervals can be longer than the S intervals for all J, in G. Suppose G is
fixed. Then S is fixed and from the Bonferroni inequality (2.13) we see that B
increases without limit as m increases. Hence, for sufficiently large m the B
intervals are longer than the S intervals for all J, in G. On the other hand,
numerical computations show that for sufficiently small m the B intervals are
shorter than the S intervals for all J, in G. The above also holds for intervals
based on the Sidak or the studentized maximum modulus inequality. Games
(1977) has calculated the maximum number of J, of primary interest (the number
m) such that the intervals based on the Sidak inequality are shorter than the S
intervals for all J, in G.
The effect of varying m (the number of I) of primary interest) is illustrated by
the following examples. Suppose q = 2 and a2v=1 which is known. If G consists
of m = 4 linear combinations and if nominally a = 0.05, then applying the
Bonferroni inequality gves B = 2.50. Since S = 2.448 the S intervals are shorter
than the B intervals for all J, in G; the ratio of B to S is 1.02. The ratio of the
length of the exact finite induced intervals to the S intervals when m = 2 and
a = 0.05 is 0.913 since M = 2.236. If instead of calculating the exact 95% finite
induced confidence regon we use the Bonferroni inequality, then B = 2.241 which
is also less than S. See Figures 4 and 5 in Miller (1966, pp. 15-16).
In the case where m = q and a = 0.05 calculations by Christensen (1973) show
that the B intervals are shorter than the S intervals regardless of the size of q.
Similar results are reported by Morrison (1976, p. 136) for 95% Bonferroni and
Roy-Bose simultaneous confidence intervals on means. The Roy-Bose simultaneous confidence intervals are the same as S intervals in the case of the classical
linear normal regression model.
Investigators object to the length of the S intervals. When the Scheffe test
rejects, the linear combinations responsible for rejection may be of no economic
interest. This may account for the fact that the Scheffe test and the S intervals are
not widely used. In theory the solution is to use a procedure where the set G is

Ch. 14: Multiple Hypothesis Testing

859

suitably restricted. In practice it is difficult to construct such a procedure. One
approach is to use a finite induced test. The drawback is that to be operational we
have to apply approximations based on probability inequalities. As already noted,
when m is large relative to q the B intervals are longer than the S intervals and
similar results hold for intervals based on the Sidak or studentized maximum
modulus inequality. Another approach is to construct an infinite induced test
where G is a proper subset of L. No procedure analogous to the S procedure has
been developed for t h s case. It seems that there is no very satisfactory alternative
to the S intervals when m is sufficiently large.

3.5.2. J, of secondary interest
When the B intervals are shorter for the J, of primary interest and the S intervals
are shorter for some J, of secondary interest there is a trade-off between the B
procedure and the S procedure. It is instructive to compare the length of the
simultaneous confidence intervals derived from the square region with sides
2 B = 4.482 with the intervals derived from the circular region with diameter
2 S = 4.895. The B procedure is the procedure which gves for each J, in L* an
interval derived from the square region. The B intervals for J, in L* include the B
intervals for 6 , and 4 , which are the J, of primary interest. The length of the
shortest B interval is equal to the length of the side of the square region and the
length of the longest B interval is equal to the length of the diagonal which is
6.336. Since the length of the S intervals for all J, in L* is 4.895 the S intervals are
shorter than the B intervals for some J, in L*; in particular, the S interval is
shorter for J, = m ( 6 , + 6,), the B interval for t h s J, being the one shown in
Figure 3.2.
When G is finite there are a few cases in the one-way lay-out of the analysis of
variance where the exact significance level of the induced test of H can be easily
calculated. In these cases it is also easy to calculate the probability that simultaneously for all J, in L the confidence intervals cover the true values. These cases
include the generalized Tukey procedure [see Scheffe (1959, theorem 2, p. 74)]
where the J, of primary interest are the pairwise comparisons ( 6 ,- fl,), i,
j = 1 , ..., q, i # j, and the "extended Dunnett procedure" developed by Schaffer
(1977), where the J, of primary interest are the differences ( 6 , - 6,), i = 2 , ..., q.
Schaffer (1977) found that the Tukey intervals are shorter than the S intervals for
the J, of primary interest in the generalized Tukey procedure and likewise that the
Dunnett intervals are shorter than the S intervals for the $ of primary interest in
the extended Dunnett procedure. On the other hand, the S procedure generally
gwes shorter intervals for the J, of secondary interest.
Richmond (1982) obtained similar results when extending the Schaffer study to
include the case where the of primary interest are taken to be the same as in the
extended Dunnett procedure and the intervals are calculated by applying the

+

Sidak inequality. For further comparisons between Tukey and S intervals see
Scheffe (1959, pp. 75-77) and Hochberg and Rodriquez (1977).
4. The power of the Bonferroni and Scheffe tests

4.1.

Background

Since the power of the Scheffe test is the same as the power of the F test, it is
uniformly most powerful in certain situations. However, it is not uniformly more
powerful than the Bonferroni test. An attractive feature of the Bonferroni test is
that when it rejects, the linear combinations responsible for rejection are of
economic interest. This feature has to be weighed against the power of the test, i.e.
the probability that the test rejects H when H is false.
Christensen (1973) and Evans and Savin (1980) have compared the power of
the X 2 Bonferroni tests for the case where q = 2, 0 2 is known and V is defined as
in (2.17). The acceptance regions of both of these tests have been discussed in
Section 2.4. In this Section I review the power of the F test and the results of the
Christensen study.
The power of the F test is a function of four parameters: the level of
significance a, the numerator and denominator degrees of freedom q and T - k,
and the noncentrality parameter h which is given by:

when 6 is the true parameter vector. The power of the F test depends on 6 and
0 2 V only through thls single parameter. Therefore it has been feasible to table the
power of the F test; for selected cases it can be found from the Pearson and
Hartley (1972) charts or the Fox (1956) charts. In addition, the power can be
calculated for cases of interest using the procedures due to Irnhof (1961) and Tiku
(1965). By contrast, little is known about the power of the Bonferroni test and it
has proved impracticable to construct tables of the power of the test.
Christensen studied the powers of the 0.05 level X 2 test and the nominal 0.05
level Bonferroni test along rays in the parameter space. Power calculations by
Chrsitensen show that neither test is more powerful against all alternatives. For
example, when r = 0 the Bonferroni test is more powerful against the alternative
6, = 82 = 1.5850. This is not surprising since neither of the acceptance regions
contain the other. Despite thls, Christensen found that when the absolute value of
r was small the power of the two tests was approximately the same regardless of
the alternative. However, when the absolute value of r was high the Bonferroni

Ch. 14: Multlple Hvpothesis Testing

861

test had very little power against any alternatives considered by Christensen. If
only 8, or 8, is different from zero then the X 2 test has good power regardless of
the value of r . When both 8, and 8, are different from zero the power of the x2
test is mixed. Against some alternatives the power is extremely good-increasing
with the absolute value of r . On the other hand, the power against other
alternatives decreases badly with increasing absolute value of r . One of the
potential explanations for the power of the Bonferroni test is that the actual level
of significance of the Bonferroni box decreases as the absolute value of r
increases. As noted earlier, for r = 0 the actual level is 0.0494 and as the absolute
value of r approaches one the actual level approaches 0.025.

4.2.

Power contours

The behavior of the power function is described by its contours in the parameter
space. A power contour is the set of all parameter points 8 at which the power is
constant. The power contours of the F test can be obtained from the expression
for the noncentrality parameter (4.1). This is because the power of the F test is the
same at parameter points 8 with a given value of the noncentrality parameter. The
power of the F test is constant on the surfaces of ellipsoids in the 8 space, but
the general properties of the power contours of the Bonferroni test are unknown.
Evans and Savin calculate the power contours of the 0.05 level x2 test and
nominal 0.05 level Bonferroni test in the ( e , J i 7 ) / a and (8,4-)/0
parameter space. The power contours for correlations r = 0.0, 0.9, 0.99 at power
levels 0.90, 0.95, 0.99 are shown in Figure 4.l(a-c). When r = 0.0 [Figure 4.l(a)]
the power contours of the x2 test are circles with center at the origin whde the
contours of the Bonferroni test are nearly circular. At a given power level the X 2
and the Bonferroni power contours are close together. Thus, both tests have
similar powers whlch confirms the results of Christensen. We also see that the
contours for a given power level cross so that neither test is uniformly more
powerful.
When the correlation is r = 0.90 [Figure 4.l(b)] the power contours of the
Bonferroni test are not much changed whereas those of the X 2 test have become
narrow ellipses. Hence for a given power level the contours of the two tests are no
longer close together. The x2 test is more powerful at parameter points in the
upper right hand and lower left hand parts of the space and the Bonferroni test at
points in the extreme upper left-hand and lower right-hand comers of the space.
For r = 0.99 [Figure 4.l(c)] we see that the power contours of the Bonferroni test
continue to remain much fatter than those of the X 2 test even when the power is
quite close to one. In short, when the correlation r is different from zero the X 2
test has higher power than the Bonferroni test at most alternatives.

Figure 4.1 (a) The 90.95 and 99% power contours (in the transformed parameter space) of
the Bonferroni and X 2 tests for r = 0.0 and nominal size a = 0.05.

4.3. Average powers

When neither test is uniformly more powerful the performance of the tests can be
compared on the basis of average power. Since V is a positive definite matrix
there exists a nonsingular matrix P such that P'VP = I and 0* = P-'0. Then the
noncentrality parameter can be written as:

Ch. 14: Multiple Hypothesis Testing

863

Figure 4.1 (b) The 90.95 and 99% power contours (in the transformed parameter space) of
the Bonferroni and x 2 tests for r = 0.90 and nominal size a = 0.05.

Thus, the power of the F test is constant on the surface of spheres with center at
the origin in the 8* space. In other words, in the transformed space the power of
the F test is the same at all alternatives whlch are the same distance from the null
hypothesis (the origin). The F test maximizes the average power on every sphere
in the transformed space where the average power is defined with respect to a
uniform measure over spheres in this space; see Scheffe (1959, pp. 47-49). Hence
the F test is best when we have the same interest in all alternatives which are the
same distance from the null in the transformed parameter space.
It may be more natural to suppose that we have an equal interest in all
alternatives which are equally distant from the null in the 8 parameter space. On
thls assumption the best test is the one which maximizes the average power on

-5

-4

-3

-2

-1

0

1

2

3

4

5

(4
Figure 4.1 (c) The 90,95 and 99% power contours (in the transformed parameter space) of
tests for r = 0.99 and nominal size n = 0.05.
the Bonferroni and

every sphere in the 8 parameter space. Evans and Savin (1980) define the average
power with respect to a uniform measure over the sphere in the 8 space. Using
this definition Evans and Savin calculate the average power of an a level x 2 test, a
nominal a level Bonferroni test and an exact a level finite induced test. The
results are reported in Table 4.1 for selected values of the radius R of the circle,
the correlation r and significance levels a.
When r = 0 the average power of both tests is very similar. This is because both
tests have very similar power contours in this case, namely circles for the X 2 test
and nearly circles for the Bonferroni test. On the other hand, when r is near one
and the radius R of the circle is small the average power of the x2 test is markedly

Ch. 14: Multiple Hvpothesis Testing

Table 4.1
Average

Powers o f the Bonferroni(B),Chi-Square(CS)
and Exact Finite Induced(E) Tests.

lugher than the average power of the Bonferroni test. This is because over a circle
of a given radius the average power of the X 2 test increases as r increases and the
average power of the Bonferroni test is virtually constant for all r. As the radius R
of the circle increases the average power of the Bonferroni test approaches that of
the X 2 test.
The average power of the exact finite induced test is similar to the average
power of the Bonferroni test. For a = 0.05 the maximum difference between the
average power of the exact test and the Bonferroni test occurs at r = 0.90 for a
circle of given radius. The average power of the exact test is about 0.065 (11.5%)
higher than the average power of the Bonferroni test when the radius is R = .25
and 0.027 (3%) hgher when the radius is R = 3.5. The corresponding figures are

Figure 4.2 The power of the Bonferroni (broken lines) and the X 2 (full lines) tests at radii
R = 2(0.5)5 as a function of the direction in degrees. The correlation is r = 0.9 and the nominal sizes
are a = 0.10, 0.05 and 0.01.

Ch. 14: Multiple Hypothesis Testing

867

somewhat higher if a = 0.10 and lower if a = 0.01. As a consequence, when the
correlation r is near one the exact test is also a poor competitor of the x2 test over
smaller radius circles.
Evans and Savin have plotted the behavior of the power over the circle for an a
level X 2 test and a nominal a level Bonferroni test. The power over various circles
is shown in Figure 4.2 for the case r = 0.90 and a = 0.10, 0.05 and 0.01. The x2
test has excellent power at most points on each circle. The power dips sharply
only in the neighborhood of 135 and 315 degrees. The Bonferroni test has better
power than the X 2 test only in the neighborhood of 135 and 315 degrees and even
here the power of the Bonferroni test is only marginally better than that of the x 2
test. The Bonferroni test has more uniform, but substantially lower power over
the smaller radius circles. For larger radius circles the power of the Bonferroni
test is higher and hence compares more favorably to the x2 test. The picture for
the exact finite induced test is similar with slightly higher power than the
Bonferroni test at all points on the circle.
When the finite induced intervals are shorter than the S intervals for the J/ of
primary interest it is common practice to conclude that the finite induced
procedure (test) is superior to the S procedure (Scheffe test), for example, see
Stoline and Ury (1979). Of course, if the finite induced intervals are shorter for all
J, in L, then the finite induced test is uniformly more powerful. However, the S
intervals are generally shorter for some J/ of secondary interest. When the S
intervals are shorter for some J/ of secondary interest the Scheffe test may have
higher average power. This is clearly demonstrated by the comparison of the
average powers of the X 2 test and the Bonferroni test for the case of q = 2
parameters. Hence, it is misleading to conclude that the finite induced test is
superior because the finite induced intervals are shorter for the J/ of primary
interest. To our knowledge there is no evidence that any of the well known
competitors of the Scheffe test have higher average power.
4.4. The problem of multicollinearity

The problem of multicollinearity arises when the explanatory variables are
correlated, i.e. the columns of the regressor matrix X are not orthogonal. In
discussions of the collinearity problem the individual regression coefficients are
taken to be the parameters of primary interest. This is a point of crucial
importance. A full rank regressor matrix can always be transformed so as to
eliminate multicollinearity, but the regression coefficients in the transformed
problem may no longer be of primary interest.
Leamer (1979) provides an excellent discussion of the collinearity problem from
a Bayesian point of view. He observes (pp. 71-72):
.. .that there is a special problem caused by collinearity. This is the problem

of interpreting multi-dimensional evidence. Briefly, collinear data provide

relatively good information about linear combinations of coefficients. The
interpretation problem is the problem of deciding how to allocate that
information to individual coefficients. Thls depends on prior information. A
solution to the interpretation problem thus involves formalizing and utilizing
effectively all prior information. The weak-evidence problem however
remains, even when the interpretation problem is solved. The solution to the
weak-evidence problem is more and better data. Within the confines of the
given data set there is nothing that can be done about weak-evidence.
The interpretation problem can be interpreted as a multiple decision problem
where there are q separate hypotheses, each specifying that an individual regression coefficient is equal to zero. In classical inference the finite and infinite
induced tests are two approaches to solving the interpretation problem. The finite
induced test provides a guaranteed solution to the interpretation problem whereas
the infinite induced test has a probability of less than one of providing a solution.
Multicollinearity plays an important role because of its effect on the power of the
tests. Consider the Christensen two parameter case where the null hypothesis is
H: Dl = P2= 0. The correlation r = 0 if the relevant two regressors are orthogonal.
The Bonferroni and Scheffe tests have similar average power for orthogonal or
nearly orthogonal data. As the correlation r increases the average power of the
Bonferroni test decreases compared with that of the Scheffe test. Thls means that
for multicollinear data the Bonferroni test solves the interpretation problem at a
cost; the cost is lower average power than for the Scheffe test. Hence there is a
trade-off between the probability of solving the interpretation problem and the
power of the test. The advantage of orthogonal data is that we can always decide
which individual regression coefficients are responsible for rejection at a very
small sacrifice of average power.
What we want to know is the conditional probability that the Scheffe test solves
the interpretation problem given that it has rejected the null hypothesis. The
conditional probability that the Scheffe test rejects H,: P, = 0 or H,: P2= 0 or
both given that it has rejected H i s the probability that the point (t,, t,) is outside
the x2 box divided by the probability that is outside the X 2 ellipse. This
conditional probability is calculated for significance levels a = 0.10,0.05,0.01,
correlations r = 0.0,0.5,0.9and radii R = 0.0 (0.5) 4.50. In the (p,d=)/o
and
( P 2 d 5 ) / 0 parameter space a point can be described by the angle of a ray
from the origin to the point and the distance of the point along this ray. Because
of the symmetry of the problem the calculations were done for angles between 45
and 135 degrees inclusive. Selected results are reported in Tables 4.2 and 4.3.
The results in Table 4.2 show that on small radius circles the average conditional probability can decrease as the correlation r increases. For example, at
a = 0.05 and R = 1.0 the average conditional probability is 0.637 when r = 0.0 and
only 0.234 when r = 0.9, the decrease being 63%.The decrease is 58.1% when

Ch. 14: Multiple Hypothesis Testing

Table 4.2
Average Conditional P r o b a b i l i t i e s of r e j e c t i n g B1=O or B2=0 (or both)
given t h a t the Chi-square Test r e j e c t s .

R

ACP

AP

ACS

ACP

AP

ACS

ACP

AP

ACS

0.4806
0.5005
0.5364
0.5778
0.6259
0.681 1
0.7427
0.8080
0.8711
0.9248
0.9630
0.4672
0.4386
0.4380
0.4640
0.5081
0.5713
0.6559
0.7567
0.8553
0.9298
0.9722
0.3734
0.1905
0.1636
0.1839
0.2414
0.3438
0.4883
0.6544
0.8108
0.9202
0.9696

ACP

Average Conditional P r o b a b i l i t y of r e j e c t i n g B1=O or B2=0 (or both)
given t h a t the chi- square t e s t r e J e c t s .

AP

Average P r o b a b i l i t y of r e j e c t i n g B =O or B2=0 (or both).

ACS

Average p r o b a b i l i t y t h a t the Chi-square t e s t r e j e c t s .

a = 0.10 and 69.4%when a = 0.01. On large radius circles the average conditional
probability increases as r increases from r = 0, eventually decreasing. Holding the
average power of the Scheffe test constant the average conditional probability
decreases as the correlation r increases. For instance, when a = 0.05 and the
average power is roughly 0.45 the average conditional probability falls from a b w t
0.75 to 0.24 as r moves from r = 0.0 to r = 0.9. For higher power this fall is less

N. E. Sauin
Table 4.3
Conditional Probability(CP) of r e ~ e c t i n g B 1 = O or
B 2 = 0 (or both), given that the Chi-Square(CS)
Test rejects.

r

Angle

0.0

45
60
75
90
105
120
135

0.5

45
60
75
90
105
120
135

0.9

45
60
75
90
105
120
135

dramatic and for sufficientlyhigh power it can reverse. The more detailed results
in Table 4.3 show that high power at a given alternative does not insure high
conditional probability at that alternative. When the correlation is fixed at r = 0.9
there is an inverse relation between the power and the conditional probability
even on large radius circles, namely, the higher the power, the lower the conditional probability.
The Bonferroni test solves the interpretation problem whatever the power of
the test. But the test is unsatisfactory when the power is low since in this case the
test is likely to be misleading. This suggests that we may want to trade off some
probability of solving the interpretation problem for some extra power. When the
average power of the Bonferroni test is high the average power of the Scheffe test
will also be hlgh. In this case the Scheffe test will have a high average conditional
probability of solving the interpretation problem. When the Scheffe test has high
power but the Bonferroni test has low power, then the sacrifice of power due to
using Bonferroni test may be difficult to justify. Therefore the Scheffe test may be
more attractive than the Bonferroni test in the presence of multicollinear data.
When the average power of the Scheffe test is low then what is needed is more
and better data. The weak evidence problem and the low power problem are two
sides of the same coin.

Ch. 14: Multiple Hypothesis Testing

871

5. Large sample induced tests
Large sample analogues of the finite induced tests and the Scheffe test can be
constructed for a variety of models. These include single equation and multivariate nonlinear models, linear and nonlinear simultaneous equations models,
time series models, and qualitative response models. As an illustration I will
briefly discuss large sample analogues of the tests in the context of the standard
nonlinear regression model:

where y, is a scalar endogenous variable, x , is a vector of exogenous variables, Po
is a k X 1 vector of unknown parameters and the u,'s are unobservable scalar
independently identically distributed random variables with mean zero and
variance .0:
The nonlinear least squares estimator, denoted by B, is defined as the value of /3
that minimizes the sum of squared residuals:

where the p that appears in (5.2) is the argument of the function f ( x , , .). In
contrast, p0 is the true fixed value. The consistency and asymptotic normality of
the nonlinear least squares estimator is rigorously proved in Jennrich (1969).
Therefore, we have:

where

is a k x k matrix and P* lies between B and Po. For a discussion of the
assumptions needed to prove (5.3), see Chapter 6 by Amemiya in this Handbook.
Amemiya points out that in the process of proving (5.3) we have in effect
shown that, asymptotically,

where G = ( a f / a p ' ) B o , a T x k matrix. The practical consequence of the approximation (5.5) is that all the results for the linear regression model are
asymptotically valid for the nonlinear regression model if G is treated as regressor
matrix. In particular, the usual t and F statistics can be used asymptotically. Note
that (5.5) holds exactly in the linear case.
As an example consider testing the linear hypothesis:

where C and c are defined as in (2.2). Let:

and

where G = ( a f / a p ' ) b . Then we have asymptotically under the null hypothesis

and

&(B)/(T-

q,

k ) and
is the ith diagonal element of P.
where s 2 =
Suppose that a finite number m of J, in L are of primary interest. Let the J, in
G be J,, = a:8, i = 1,. . ., m. The usual t statistic for testing the separate hypothesis H ( a , ) : 4, = a:8 = 0 is:

The acceptance region of a 6 level equal-tailed test of H ( a , ) is approximately:
l t o ( u , ) l < Z & , ~ ( T -k ) ,

i = l , ..., m .

(5.12)

The finite induced test accepts H if and only if all the separate hypotheses
H(a,), ..., H(a,) are accepted. When all the equal-tailed t tests have the same

873

Ch. 14: Multiple fipothesis Testing

significance level the acceptance region for an cu level Bonferroni test of H is
approximately:

where B = t,,,,(T- k ) . The Sidak or studentized maximum modulus critical
value can also be used in large samples.
A large sample analogue of the Scheffe test can be developed by using the fact
that the maximum of the squared t ratio:

is asymptotically distributed as qF(q,T - k ) . The proof is essentially the same as
the one presented in Section 3.3.1.
Next, consider testing the nonlinear hypothesis:

where h ( P ) is a q x 1 vector valued nonlinear function such that q < k . If /3 are
the parameters that characterize a concentrated likelihood function L ( P ) , where
L may or may not be derived from the normal distribution, then the hypothesis
(5.15) can be tested using the Wald (W), likelihood ratio (LR), or Lagrange
multipler ( L M ) test. For a discussion of these tests, see Chapter 13 by Engle in
thls Handbook.
When the error vector u is assumed to be normal in the nonlinear regression
model (5.1), the three test statistics can be written as

where p is the constrained maximum likelihood estimator obtained by maximizing L ( P ) subject to (5.15), and = ( d f / d P f ) B .When the hypothesis (5.15) is
true all three statistics (5.16), (5.17) and (5.18) are asymptotically distributed as
X 2 ( q )if u is normal. In fact, it can be shown that these statistics are asymptotically distributed as x 2 ( q ) even if u is not normal. Thus, these statistics can be
used to test a nonlinear hypothesis when u is non-normal.

e

N . E. Savin

874

Recall that from any convex set we can derive simultaneous confidence
intervals for all J, in L. This convex set can be the acceptance region of the W, LR
or LM tests in large samples. Starting with a finite set G of J, in L of primary
interest the convex set can be defined as the intersection of large sample t
intervals for all J, in G. The t statistics can be based on either the W or the LM
principle of test construction. A large sample analogue of the S intervals can be
based on the W test of H.
6. Empirical examples

6.1.

Textile example

Our first empirical illustration is based on the textile example of Theil (1971,
p. 103). This example considers an equation of the consumption of textiles in the
Netherlands 1923-1939:

where y = logarithm of textile consumption per capita, x, = logarithm of real per
capita income and x, =logarithm of the relative price of textile goods. The
estimated equation is reported by Theil (p. 116) as:

where the numbers in parentheses are standard errors.
Theil tests the hypothesis that the income elasticity (P,) is unity, and that the
price elasticity (P,) is minus unity. This hypothesis is:

The 0.01 level F test rejects H since the value of the F ratio is 11.2 and the upper
1% significance point of an F(2,14) distribution is 6.51.
Consider the Bonferroni test of H where the linear combinations of primary
interest are 8, and 8,. The t statistics for testing 8, and 8, are:

Ch. 14: Multiple Hypothesis Testing

875

and

respectively. The nominal 0.01 level Bonferroni test rejects H since B = tsI2(14)
= 3.33 when 6 = 0.01/2 = 0.005. Clearly, the separate hypothesis P2= - 1 is
responsible for the rejection of the Bonferroni test of H . The 0.01 level Scheffe
test of H also rejects H since the 0.01 level F test rejects H. In this example the
Bonferroni test has roughly the same power contour as the Scheffe test since the
correlation between the income and price variables is low, namely about 0.22.
The next step is to calculate simultaneous confidence intervals for 8, and 8,.
The B interval for 8, is O.143Of O.l6(3.33) and for 8, is 0.1711 k0.0q3.33) so that
the B intervals are -0.39 I 8 , s 0.68 and 0.04 I 8 , s 0.30, respectively. The S
interval for 8, is 0.1430+0.16(3.61) and for 8, is 0.1771 f0.04(3.61) since S
=
= 3.61. Hence the S intervals are - 0.43 I 8, r 0.72 and 0.03 r
8,10.32, respectively. Note that the S intervals are longer than the B intervals,
but not much longer. Both intervals for 8, cover zero and both intervals for 82
cover only positive values. This suggests that the income elasticity P, is unity and
that the price elasticity P, is greater than minus one. In this example the
hypothesis P2= - 1 is responsible for the rejection of the Scheffe as well as the
Bonferroni test of H . Thls result also follows from the fact that the absolute value
of the t statistic for 8, is larger than either B or S. i.e. It21 > B and ) t21 > S.
The final step is to calculate the normalized a, vector:

-I\

where abVa, =1. From Theil we have that:

so that:

where s 2 = 0.0001833. This confirms Theil's conclusions (p. 145) that the specification P2= -1 for the price elasticity is responsible for the F test (Scheffe test)
rejecting H , i.e. any linear combination with positive weights and a sufficiently
large weight on 82 is responsible for rejection.

Suppose in the B procedure that J, = 8, - 8, is of secondary interest. The B
interval for J, is 0.3141 *0.20(3.33) or -0.35 I J, I 0.98. The S interval for J, is
0.3141 0.023(3.61) or 0.23 I J, I 0.40 so that the S interval is shorter than the B
interval. Also notice that 4 = z , - z , is sdfz according to the S criterion, but not
the B criterion. Hence the Scheffe induced test of H is rejected by the separate
hypothesis that the income and price elasticities are the same except for sign:
p, = - P2. Theil (p. 134) objects to the length of the S intervals for the J, of
primary interest. In fact in the textile example the S intervals give interesting
results for both the J, of primary and secondary interest.

*

6.2. KIein S Model I example
Our second example is based on the unrestricted reduced form equation for
consumption expenditures from Klein's Model I of the United States economy
1921-1941:

where y = consumption, x , = government wage bill, x , = indirect taxes, x , =
government expenditures, x4 = time (measured as year-1931), x, = profits lagged
one year, x6 = end of year capital stock lagged one year, and x7 = private product
lagged one year. For the purpose of thls example all regressors are treated as
nonstochastic. The data is taken from Theil (1971, p. 456). The estimated
equation is:

where now the numbers in parentheses are t ratios. Our estimates of the P's agree
with those reported in Goldberger (1964, p. 325). (Note that Goldberger uses
x , - x , in place of x , so that his estimate of Pl is 0.19327 - 0.20501 = - 0.01174.)
Consider testing the hypothesis that all the slope coefficients are zero:

The slope coefficients are multipliers so we are testing the hypothesis that all the
multipliers in the reduced form equation for consumption are zero. The 0.05 level
Scheffk test rejects H since the 0.05 level F test overwhelmingly rejects H. The F
ratio is 28.2 which is much larger than 2.83, the upper 0.05 significance point of
the F(7,13) distribution. Suppose that the linear combinations of primary interest

877

Ch. 14: Multiple Hypothesis Testing

in the Bonferroni test are the slope coefficients: # i= O,, i = 1,2,... , 7 . Then the
critical t value for a nominal 0.05 level Bonferroni separate induced test of H is
B = t,,,(13) = 3.19, where 6 = 0.05/7 = 0.00714. The t ratio with the largest
absolute value is the one for lagged profits (P,). Since thls is only 1.49 the
Bonferroni test overwhelmingly accepts H. Thus in this example the Scheffe and
Bonferroni tests of H produce conflicting inferences.
We now apply the S procedure to find which linear combination of the
multipliers led to rejection of the Scheffe test of H. In this example none of the
individual multipliers are responsible for rejection since none of the t ratios have
an absolute value greater than S. The largest t ratio is 1.49 and S =
= 4.45. To find linear combinations of the multipliers which are responsible for
rejection I began by calculating the normalized vector a,. This vector has
components:

-/\

where these are proportional to the sample covariances between the dependent
variable and the regressors. The linear combination (6.12) gives some positive
weight to all the multipliers and especially to the multiplier P, for lagged private
product. Since (6.12) does not seem to have an interesting economic interpretation, I examined a number of other linear combinations. I could not find a linear
combination responsible for rejection which was also of economic interest.
In this example the explanatory variables are highly correlated. As a consequence the Bonferroni test can have low average power compared to the Scheffe
test. Hence the Bonferroni test may be very misleading. The Scheffe test gives
what appears to be a sensible result, but provides little help in deciding which
multipliers are nonzero. What is needed is more and better data for a satisfactory
solution to the interpretation problem.

References
Bailey, J. R. (1977), "Tables of the Bonferroni t Statistics", Journal of the American Statistical Association, 72:469-478.
Christensen, L. R. (1973), "Simultaneous Statistical Inference in the Normal Multiple Linear
Regression Model", Journal of the American Statistical Association, 68:457-461.
Christensen, L. R., D. W. Jorgenson and L. J. Lau (1975), "Transcendental Logarithmic Utility
Function", American Economic Reoiew, 65:367-383.
Cornish, E. A. (1954), "The Multivariate Small Sample t-Distribution Associated with a Set of
Normal Sample Deviates", Australian Journal of Physics, 7:531-542.
Darroch, J. N. and S. D. Silvey (1963), "On Testing More Than One Hypothesis", Annals of
Mathematical Statistics, 27:555-567.
Dhrymes, P. J. (1978), Introductory Econometrics. New York: Springer Verlag.
Dunn, 0.J. (1961), "Multiple Comparisons Among Means", Journal of the American Statistical
Associanon, 56:52-64.

878

N. E. Savin

Dunnett, C. W. (1955), "A Multiple Comparisons Procedure for Comparing Several Treatments with a
Control", Journal of the American Statistical Association, 50:1096-1121.
Dunnett, C. W. and M. Sobel, (1954), "A Bivariate Generalization of Student's t-Distribution with
Tables for Certain Cases", Biometrika, 41:153-169.
Evans, G. B. A. and N. E. Savin (1980), "The Powers of the Bonferroni and Scheffe Tests in the Two
Parameter Case", Manuscript, Faculty of Economics and Politics, Cambridge.
Fox, M. (1956). "Charts of the Power of the F-Test", Annals of Mathematrcal Statistics, 27:485-494.
Gabriel, K. R. (1964),"A Procedure for Testing the Homogeneity of All Sets of Means in the Analysis
of Variance", Biometrics, 40:459-477.
Gabriel, K. R. (1969), "Simultaneous Test Procedure-Some Theory of Multiple Comparisons",
Annals of Mathematical Statistics, 40:224-250.
Games, P. A. (1977), "An Improved Table for Simultaneous Control on g Contrasts", Journal of the
American Statistical Association, 72:531-534.
Gearv, R. C. and C. E. V. Leser (1968), "Significance Tests in Multiple Regression", The American
st&ican,
22:20-21.
Goldbereer. A. S. (1964). Econometric the on^. New York: John Wiley & Sons.
s Largest Absolute
Hahn, 6 J: and k W. endr ricks on (1971); "A Table of ~ e r c e n t a ~ k ~ o iofn tThe
Value of k Student t Variates and Its Applications", Biometrika, 58:323-332.
Hochberg, Y. (1974), "Some Generalization of the T-Method in Simultaneous Inference ",Journal of
Multivariate Analysis, 4:224-234.
Hochberg, Y. and C. Rodriquez (1977), "Intermediate Simultaneous Inference Procedures", Journal of
the American Statistical Association, 72:220-225.
Imhof, P. (1961). "Computing the Distribution of Quadratic Forms in Normal Variates", Biometrika,
48:419-426.
Jennrich, R. I. (1969). "Asymptotic Properties of Non-linear Least Squares Estimation", Annals of
Mathematical Statistics, 40:633-643.
Johnson, N. L. and S. Kotz (1972), Distributions in Statistics: Continuous Multioariate Distributions.
New York: John Wiley & Sons.
Jorgenson, D. W. (1971), "Econometric Studies of Investment Behavior: A Survey", Journal of
Economic Literature, 9:llll-l147.
Jorgenson, D. W. (1974), "Investment and Production: A Review", Intriligator M. D. and D. A.
Kendrick, Eds., Frontiers in Quantitative Economics, II. Amsterdam: North Holland.
Jorgenson, D. W. and L. J. Lau (1975), "The Structure of Consumer Preferences", Annals of Social
and Economic Measurement, 4:49-101.
Jorgenson, D. W. and L. J. Lau (1982), Transcendental Logarithmic Production Functions. Amsterdam:
North Holland.
Krishnaiah, P. R. (1963), Simultaneous Tests and the Efficiency of Generalized Incomplete Block
Designs, ARL 63-174. Wright-Patterson Air Force Base, Ohio.
Krishnaiah, P. R. (1964), Multiple Comparisons Tests in Multivariate Case. ARL 64-124, WrightPatterson AK Force Base, Ohio.
Krishnaiah, P. R. (1965), "On the Simultaneous ANOVA and MANOVA Tests", Ann. Inst. Statist.
Math., 17:35-53.
Krishnaiah, P. R. (1979), "Some Developments on Simultaneous Test Procedures", Krishnaiah, P. R.
(ed.), Developments in statistics, vol. 2, New York: Academic Press.
Krishnaiah, P. R. and J. V. Armitage (1965a), Percentage Points of the Multivariate t Distribution.
ARL 65-199. Wright-Patterson Air Force Base, Ohio.
Krishnaiah, P. R. and J. V. Armitage (1965b), Probability Integrals of the Multivariate F Distribution,
With Tables and Applications. ARL 65-236, Wright-Patterson Air Force Base, Ohio.
Krishnaiah, P. R. and J. V. Armitage (1966), "Tables for Multivariate I Distribution", Sankhya Ser. B.
28:31-56.
Krishnaiah, P. R. and J. V. Armitage (1970), "On a Multivariate F Distribution", in Essays in
Probability and Statistics, R. C. Bose et al. eds., Chapel Hill: University of North Carolina Press.
Krishnaiah, P. R., J. V. Armitage and M. C. Breiter (1969a), Tables for the Probability Integrals of the
Bivariate I Distribution. ARL 69-060. Wright-Patterson Air Force Base, Ohio.
Krishnaiah, P. R., J. V. Armitage and M. C. Breiter (1%9b), Tables for the Bivariate 1 1 1 Distribution.
ARL 69-0210. Wright-Patterson Air Force Base, Ohio.

Ch. 14: Multiple Hypothesis Tes ling

879

Leamer, E. E. (1979), Specification Searches. New York: Wiley 1979.
Lehmann. E. L. (1957ah "A Theon, of Some Multiple Decision Problems, I", Annals of Mathematical
28:l-25. '
Lehmann, E. L. (1957b), "A Theory of Some Multiple Decision Problems, II", Annals of Mathematical
Statistics, 28547-572.
Miller, R. G., Jr. (1966), Simultaneous Statistical Inference. New York: McGraw-Hill 1966.
Miller, R. G., Jr. (1977), "Developments in Multiple Comparisons, 1966-1976", Journal of the
American Statistical Association, 72:779-788.
Morrison, D. F. (1976). Multivariate Statistical Methoh, 2nd Edition, New York: McGraw-Hill.
Moses, L. E. (1976), "Charts for Finding Upper Percentage Points of Student's t in the Range .O1 to
.000lW,Technical Report No. 24 (5 Rol G M 21215.02), Stanford University.
Olshen, R. A. (1973), "The Conditional Level of the F Test", Journal American Statistical Association,
48:692-698.
Olshen, R. A. (1977), "A Note on a Reformulation of the S Method of Multiple Comparison-Comment", Journal American Statistical Association, 72:144-146.
Pearson, E. S. and H. 0 . Hartley (1972), Biometrika Tables for Statisticians, Cumbridge, England:
Cambridge University Press.
Richmond, J. (1982), "A General Method for Constructing Simultaneous Confidence Intervals",
Journal of the American Statistical Association, 77:455-460.
Roy, S. N. (1953), "On a Heuristic Method of Test Construction and Its Uses in Multivariate
Analysis", Annals of Mathematical Statistics, 24:220-239.
Roy, S. N. and R. C. Bose, (1953), "Simultaneous Confidence Interval Estimation", Annals of
Mathemutical Statistics, 2:415-536.
Sargan, J. D. (1976), "The Consumer Price Equation in the Post War British Economy: An E-xercise in
Equation Specification Testing", mimeo, London School of Economics.
Savin, N. E. (1980), "The Bonferroni and Scheffe Multiple Comparison Procedures", Review of
Econom~cStudies, 47:255-273.
Schaffer, J. P. (1977), "Multiple Comparisons Emphasizing Selected Contrasts: An Extension and
Generalization of Dunnet's Procedure", Biornefrics, 33:293-303.
Scheffe, H. (1953), "A Method of Judging All Contrasts in the Analysis of Variance", Biometrika,
40:87-104.
Scheffe, H. (1950), The Analysis of Variance. New York: John Wiley & Sons.
Scheffe, H. (1977a). "A Note on a Reformulation of the S-Method of Multiple Comparison", Journal
of the American Statistical Association, 72:143-144.
Scheffe, H. (1977b), "A Note on a Reformulation of the S-Method of Multiple ComparisonRejoinder", Journal of the American Statistical Association, 72:146.
Seber, G. A. F. (1964), "Linear Hypotheses and Induced Tests", Biometrika, 51:41-47.
Seber G. A. F. (1977), Linear Regression Analysis. New York: Wiley.
Sidak Z. (1967). "Rectangular Confidence Regions for the Means of Multivariate Normal Distributions", Journal of the American Statistical Associ~ri~n,
62:626-633.
Stoline, M. R. and N. K. Ury (1979), "Tables of the Studentized Maximum Modulus Distribution and
an Application to Multiple Comparisons Among Means", Technometrics, 21:87-93.
Theil, H. (1971). Principles of Econometrics. New York: John Wiley & Sons.
Tiku, M. (1965). "Laguerre Series Forms of Non-Central Chi-Squared and F Distributions", Biometrika, 52:415-427.
Tukey, J. W. (1953), "The Problem of Multiple Comparisons", Princeton University, mimeo.

rati is ti&

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close