Objectives
Introduction
Chi-Square Distribution
Chi-Square Test for Independence of Attributes
Chi-Square Test for Goodness of Fit
Conditions for Applying Chi-Square Test
Cells Pooling
Yates Correction
Limitations of Chi-Square Test
Let Us Sum Up
Key Words
Answers to Self Assessment Exercises
Terminal Questions/Exercises
Further Reading
Appendix Tables
17.0 OBJECTIVES
After studying this unit, you should be able to:
l
l
l
l
explain and interpret interaction among attributes,
use the chi-square distribution to see if two classifications of the same data
are independent of each other,
use the chi-square statistic in developing and conducting tests of goodnessof-fit, and
analyse the independence of attributes by using the chi-square test.
17.1 CHI-SQUARE DISTRIBUTION
In the previous two units, you have studied the procedure of testing hypothesis
and using some of the tests like Z-test and t-test. In one sample test you have
learned tests to determine whether a sample mean or proportion was
significantly different from the respective population mean or proportion. But in
practice the requirement in your research may not be confined to only testing
of one mean/proportion of a population. As a researcher you may be interested
in dealing with more than two populations. For example, you may be interested
in knowing the differences in consumer preferences of a new product among
people in the north, the south, and the north-east of India. In such situations the
tests you have learned in the previous units do not apply. Instead you have to
use chi-square test.
Chi-square tests enable us to test whether more than two population proportions
are equal. Also, if we classify a consumer population into several categories
(say high/medium/low income groups and strongly prefer/moderately prefer/
indifferent/do not prefer a product) with respect to two attributes (say consumer
income and consumer product preference), we can then use chi-square test to
test whether two attributes are independent of each other. In this unit you will
learn the chi-square test, its applications and the conditions under which the chisquare test is applicable.
132
Chi-Square Test
17.2 CHI-SQUARE DISTRIBUTION
The chi-square distribution is a probability distribution. Under some proper
conditions the chi-square distribution can be used as a sampling distribution of
chi-square. You will learn about these conditions in section 17.5 of this unit.
The chi-square distribution is known by its only parameter – number of degrees
of freedom. The meaning of degrees of freedom is the same as the one you
have used in student t-distribution. Figure 17.1 shows the three different chisquare distributions for three different degrees of freedom.
df = 2
Probability
df = 3
df = 4
0
2
4
6
8
10
12
14
16
χ2
Figure 17.1. Chi-Square Sampling Distributions for df=2, 3 and 4
It is to be noted that as the degrees of freedom are very small, the chi-square
distribution is heavily skewed to the right. As the number of degrees of
freedom increases, the curve rapidly approaches symmetric distribution. You
may be aware that when the distribution is symmetric, it can be approximated
by normal distribution. Therefore, when the degrees of freedom increase
sufficiently, the chi-square distribution approximates the normal distribution. This
is illustrated in Figure 17.2.
Probability
df = 2
df = 4
df = 10
df = 20
0
5
10 15 20 25 30 35 40
χ2
Figure 17.2. Chi-Square Sampling Distributions for df=2, 4, 10, and 20
Like student t-distribution there is a separate chi-square distribution for each
number of degrees of freedom. Appendix Table-1 gives the most commonly
used tail areas that are used in tests of hypothesis using chi-square distribution.
It will explain how to use this table to test the hypothesis when we deal with
examples in the subsequent sections of this unit.
133
Probability and Hypothesis
Testing
17.3 CHI-SQUARE TEST FOR INDEPENDENCE OF
ATTRIBUTES
Many times, the researchers may like to know whether the differences they
observe among several sample proportions are significant or only due to chance.
Suppose a sales manager wants to know consumer preferences of consumers
who are located in different geographic regions of a country, of a particular
brand of a product. In case the manager finds that the difference in product
preference among the people located in different regions is significant, he/she
may like to change the brand name according to the consumer preferences. But
if the difference is not significant then the manager may conclude that the
difference, if any, is only due to chance and may decide to sell the product
with the same name. Therefore, we are trying to determine whether the two
attributes (geographical region and the brand name) are independent or
dependent. It should be noted that the chi-square test only tells us whether two
principles of classification are significantly related or not, but not a measure of
the degree or form of relationship. We will discuss the procedure of testing the
independence of attributes with illustrations. Study them carefully to understand
the concept of χ2 test.
Illustration 1
Suppose in our example of consumer preference explained above, we divide
India into 6 geographical regions (south, north, east, west, central and north
east). We also have two brands of a product brand A and brand B.
The survey results can be classified according to the region and brand
preference as shown in the following table.
Consumer preference
Region
South
North
East
West
Central
North-east
Total
Brand A
64
24
23
56
12
12
191
Brand B
16
6
7
44
18
18
109
Total
80
30
30
100
30
30
300
In the above table the attribute on consumer preference is represented by a
column for each brand of the product. Similarly, the attribute of region is
represented by a row for each region. The value in each cell represents the
responses of the consumers located in a particular region and their preference
for a particular brand. These cell numbers are referred to as observed (actual)
frequencies. The arrangement of data according to the attributes in cells is
called a contingency table. We describe the dimensions of a contingency table
by first stating the number of rows and then the number of columns. The table
stated above showing geographical region in rows (6) and brand preference in
columns (2) is a 6 × 2 contingency table.
In the 6 × 2 contingency table stated above (the example of brand preference)
each cell value represents a frequency of consumers classified as having the
corresponding attributes.We also stated that these cell values are referred to as
134
observed frequencies. Using this data we have to determine whether or not
the consumer geographical location (region) matters for brand preference. Here
the null hypothesis (H0) is that the brand preference is not related to the
geographical region. In other words, the null hypothesis is that the two
attributes, namely, brand preference and geographical location of the consumer
are independent. As a basis of comparison, we use the sample results that
would be obtained on the average if the null hypothesis of independence was
true. These hypothetical data are referred to as the expected frequencies.
Chi-Square Test
We use the following formula for calculation of expected frequencies (E).
E=
Row total × Column total
Total
For example, the cell entry in row-1 and column-2 of the brand preference 6x2
contingency table referred to earlier is:
E=
80 × 191 15280
=
= 50.93
300
300
Accordingly, the following table gives the calculated expected frequencies for
the rest of the cells of the 6x2 contingency table.
Calculation of the Expected Frequencies
Consumer Preference
Region
Brand A
Brand B
Total
South
(80×191)/300 = 50.93
(80×109)/300 = 29.07
80
North
(30×191)/300 = 19.10
(30×109)/300 = 10.90
30
East
(30×191)/300 = 19.10
(30×109)/300 = 10.90
30
West
(100×191)300= 63.67
(100×109)/300 =36.33
100
Central
(30×191)300 = 19.10
(30×109)/300 = 10.90
30
Northern
(30×191)/300 = 19.10
(30×109)/300 = 10.90
30
Total
191
109
300
We use the following formula for calculating the chi-square value.
χ2 = ∑
(O i − E i )
Ei
Where, χ2 = chi-square; Oi = observed frequency; Ei = expected frequency;
and
Σ = sum of.
To ascertain the value of chi-square, the following steps are followed.
1) Subtract Ei from Oi for each of the 12 cells and square each of these differences
(O i–E i) 2.
2) Divide each squared difference by Ei and obtain the total, i.e.,
∑
(O i − E i ) 2
Ei
.
This gives the value of chi-squares which may be ranged from zero to infinity.
Thus, value of χ2 is always positive.
135
Probability and Hypothesis
Testing
Now we rearrange the data given in the above two tables for comparing the
observed and expected frequencies. The rearranged observed frequencies,
expected frequencies and the calculated χ2 value are given in the following
Table.
Row/Column
Observed
frequencies
(O i)
Expected
(O i–E i)
frequencies
(Ei)
(O i–E i) 2
(O i–E i) 2/E i
(1,1)
64
50.93
13.07
170.74
3.35
(2,1)
24
19.10
4.90
24.01
1.26
(3,1)
23
19.10
3.90
15.21
0.80
(4,1)
56
63.67
–7.67
58.78
0.92
(5,1)
12
19.10
–7.10
50.41
2.64
(6,1)
12
19.10
–7.10
50.41
2.64
(1,2)
16
29.07
–13.07
170.74
5.87
(2,2)
6
10.90
–4.90
24.01
2.20
(3,2)
7
10.90
–3.90
15.21
1.40
(4,2)
44
36.33
7.67
58.78
1.62
(5,2)
18
10.90
7.10
50.41
4.62
(6,2)
18
10.90
7.10
50.41
4.62
300
300
χ2 = 31.94
With r × c (i.e. r-rows and c-columns) contingency table, the degrees of
freedom are found by (r–1) x (c–1). In our example, we have 6 × 2
contingency table. Therefore, we have (6–1) × (2–1) = 5 × 1 = 5 degrees of
freedom. Suppose we take 0.05 as the significance level (a). Then at 5 degrees
of freedom and a = 0.05 significance level the table value (from Appendix
Table-4) is 11.071. Since the calculated χ2 value (31.94) is greater than the
table value of (11.071), we reject the null hypothesis and conclude that the
brand preference is not independent of the geographical location of the
customer. Therefore, the sales manager needs to change the brand name across
the regions.
Illustration 2
A TV channel programme manager wants to know whether there are any
significant differences among male and female viewers between the type of the
programmes they watch. A survey conducted for the purpose gives the
following results.
136
Type of TV
Chi-Square Test
Viewers Sex
programme
Male
Female
Total
News
30
10
40
Serials
20
40
60
Total
50
50
100
Calculate χ2 statistic and determine whether type of TV programme is
independent of the viewers' sex. Take 0.10 significance level.
Solution: In this example, the null and alternate hypotheses are:
H0: The viewers sex is independent of the type of TV programme (There is no
association among the male and female viewers).
H1: The viewers sex is not independent of the type of TV programme.
We are given the observed frequencies in the problem. The expected
frequencies are calculated in the same way as we have explained in illustration
1. The following table gives the calculated expected frequencies.
Type of TV
Programme
Viewers Sex
Female
Male
News
(40×50)/100 = 20
(40×50)/100 = 20
40
Serials
(60×50)/100 = 30
(60×50)/100 = 30
60
Total
50
50
100
Total
Now we rearrange the data on observed and expected frequencies and
calculate the χ2 value. The following table gives the calculated χ2 value.
(Row, Column) Observed
frequencies
Expected (O i–E i)
frequencies
(O i–E i) 2
(O i–E i) 2/E i
(Oi)
(Ei)
(1,1)
30
20
10
100
5.00
(2,1)
20
30
–10
100
3.33
(1,2)
10
20
–10
100
5.00
(2,2)
40
30
10
100
3.33
χ2 =16.66
Since we have a 2x2 contingency table, the degrees of freedom will be (r–1) ×
(c–1) = (2–1) × (2–1) = 1× 1 = 1. At 1 degree of freedom and 0.10
significance level the table value (from Appendix Table-4) is 2.706. Since the
calculated χ2 value (16.66) is greater than table value of χ2 (2.706) we reject
the null hypothesis and conclude that the type of TV programme is dependent
on viewers' sex. It should, therefore, be noted that the value of χ2 is greater
than the table value of x2 the difference between the theory and observation is
significant.
137
Probability and Hypothesis
Testing
Self Assessment Exercise A
1)
The following are the independent testing situations, calculated chisquare values and the significance levels. (i) state the null hypothesis,
(ii) determine the number of degrees of freedom, (iii) calculate the
corresponding table value, and (iv) state whether you accept or reject
the null hypothesis.
a) Type of the car (small, family, luxury) versus attitude by sex
(preferred, not preferred). χ2 = 10.25 and a = 0.05.
b) Income distribution per month (below Rs 10000, Rs 10000-20000,
Rs 20000-30000, Rs 30000 and above) and preference for type of
house with number of bed rooms (1, 2, 3, 4 and above). χ2 = 28.50
and a = 0.01.
c) Attitude towards going to a movie or for shopping versus sex (male,
female). χ2 = 8.50 and a = 0.01.
d) Educational level (illiterate, literate, high school, graduate) versus
political affiliation (CPI, Congress, BJP, BSP). χ2 = 12.65 and α =
0.10.
.........................................................................................................
...............................................................................................................
......................................................................................................
......................................................................................................
2)
The following are the number of columns and rows of a contingency
table. Determine the number of degrees of freedom that the chi-square
will have
a) 6 rows, 6 columns
A company has introduced a new brand product. The marketing
manager wants to know whether the preference for the brand is
distributed independent of the consumer’s education level. The survey of
a sample of 400 consumers gave the following results.
Illiterates
Literates High School Graduate Total
Bought new brand
50
55
45
60
210
Did not buy new
brand
50
45
55
40
190
100
100
100
100
400
Total
a) Calculate the expected frequencies and the chi-square value.
b) State the null hypothesis.
c) State whether you accept or reject the null hypothesis at a = 0.05.
...............................................................................................................
...............................................................................................................
138
17.4 CHI-SQUARE TEST FOR GOODNESS OF FIT
In unit 14, you have studied some probability distributions such as binomial,
Poisson and normal distributions. When we consider a sample data from a
population we try to assume the type of distribution the sample data follows.
The chi-square test is useful in deciding whether a particular probability
distribution such as the binomial, Poisson or normal distribution is the appropriate
probability distribution. This allows us to validate our assumption about the
probability distribution of the sample data. The chi-square test procedure used
for this purpose is called goodness-of-fit test. The test also indicates whether
or not the frequency distribution for the sample population has a particular
shape, such as the normal curve (symmetric distribution). This can be done by
testing whether there is a significant difference between an observed frequency
distribution and an assumed theoretical frequency distribution. Thus by applying
chi-square test for goodness of fit, we can determine whether the observed
data constitutes a sample drawn from the population with assumed theoretical
distribution. In this section we use chi-square test for goodness-of-fit to make
inferences about the type of distribution.
The logic inherent in the chi-square test allows us to compare the observed
frequencies (Oi) with the expected frequencies (Ei). The expected frequencies
are calculated on the basis of our theoretical assumptions about the population
distribution. Let us explain the procedure of testing by going through some
illustrations.
Illustration 3
A sales man has 3 products to sell and there is a 40% chance of selling each
product when he meets a customer. The following is the frequency distribution
of sales.
No. of products sold per sale:
Frequency of the number of sales:
0
1
2
3
10
40
60
20
At the 0.05 level of significance, do these sales of products follow a binomial
distribution?
Solution: In this illustration, the sales process is approximated by a binomial
distribution with P=0.40 (with a 40% chance of selling each product).
Ho: The sales of three products has a binomial distribution with P=0.40.
139
Probability and Hypothesis
Testing
H1: The sales of three products do not have a binomial distribution with P=0.40.
Before we proceed further we must calculate the expected frequencies in order
to determine whether the discrepancies between the observed frequencies and
the expected frequencies (based on binomial distribution) should be ascribed to
chance. We began determining the binomial probability in each situation of
sales (0, 1, 2, 3 products sold per sale). For three products, we would find the
probabilities of success by consulting the binomial probabilities Appendix Table1. By looking at the column labelled as n = 3 and p = 0.40 we obtained the
following figures of binomial probabilities of the sales.
No. of products
sold per sale (r)
Binomial probabilities
of the sales
0
0.216
1
0.432
2
0.288
3
0.064
1.000
We now calculate the expected frequency of sales for each situation. There are
130 customers visited by the salesman. We multiply each probability by 130 (no.
of customers visited) to arrive at the respective expected frequency. For
example, 0.216 × 130 = 28.08.
The following table shows the observed frequencies and the expected
frequencies.
No. of products
sold per sale
Observed
frequency
Binomial
probability
Number of
customers
visited
(4)
Expected
frequency
(1)
(4)
(2)
(3)
0
10
0.216
130
28.08
1
40
0.432
130
56.16
2
60
0.288
130
37.44
3
20
0.064
130
8.32
Total
130
(5) = (3) ×
Now we use the chi-square test to examine the significance of differences
between observed frequencies and expected frequencies. The formula for
calculating chi- square is
χ2 = ∑
(O i − E i ) 2
Ei
The following table gives the calculation of chi-square.
140
Observed
frequencies
(O i)
Expected
frequencies
(Ei)
(O i–E i)
(O i–E i) 2
(O i–E i) 2/E i
10
28.08
–18.08
326.89
11.64
40
56.16
–16.16
261.15
4.65
60
37.44
22.56
508.95
13.59
20
8.32
11.68
136.42
16.40
130
Chi-Square Test
χ2 = 46.28
130
In order to draw inferences about this calculated value of χ2 we are required
to compare this with table value of χ2. For this we need: (i) degrees of
freedom (n-1), and (ii) level of significance. In the problem we are given that
the level of significance is 0.05. The number of expected situations is 4. That is
(0,1,2,3 products sold per sale) n = 4. Therefore, the degrees of freedom will
be 3 (i.e., n-1 =
4–1 = 3). The table value from Appendix Table-4 is 7.815 at 3 degrees of
freedom and 0.05 level of significance. Since the calculated value (χ2 = 46.28)
is greater than the table value (7.815), we reject the null hypothesis and accept
the alternative hypothesis. We conclude that the observed frequencies do not
follow the binomial distribution.
Let us take another illustration which relates to the normal distribution.
Illustration 4
In order to plan how much cash to keep on hand, a bank manager is interested
in seeing whether the average deposit of a customer is normally distributed with
mean Rs. 15000 and standard deviation Rs. 6000. The following information is
available with the bank.
Deposit (Rs)
Number of depositors
Less than 10000
30
10000-20000 More than 20000
80
40
Calculate the χ2 statistic and test whether the data follows a normal distribution
with mean Rs.15000 and standard deviation Rs.6000 (take the level of
significance
(a) as 0.10).
Solution: In this illustration, the assumption made by the bank manager is
that the pattern of deposits follows a normal distribution with mean Rs.15000
and standard deviation Rs.6000. Therefore, in testing the goodness-of-fit you
may like to state the following hypothesis.
H0: The sample data of deposits is from a population having normal distribution
with mean Rs.15000 and standard deviation Rs.6000.
H1: The sample data of deposits is not from a population having normal
distribution with mean Rs.15000 and standard deviation Rs.6000.
In order to calculate the χ2 value we must have expected frequencies. The
expected frequencies are determined by multiplying the proportion of population
values within each class interval by the total sample size of observed
frequencies. Since we have assumed a normal distribution for our population,
141
Probability and Hypothesis
Testing
the expected frequencies are calculated by multiplying the area under the
respective normal curve and the total sample size (n=150).
For example, to obtain the area for deposits less than Rs.10000, we calculate
the normal deviate as follows:
From Appendix Table-3 (given at the end of this unit), this value (–0.83)
corresponds to a lower tail area of 0.5000–0.2967 = 0.2033. Multiplying 0.2033
by the sample size (150), we obtain the expected frequency 0.2033 × 150 =
30.50 depositors.
The calculations of the remaining expected frequencies are shown in the
following table.
Upper limit
Normal deviate
of the deposit
x–15000
range (x)
z = 6000
Area left
to x
Area of
deposit range
Expected
frequency
(Depositors)
(3)
(4)
(5)=(4)×150
(1)
(2)
10000
–0.83
0.2033
0.2033
30.50
20000
0.83
0.7967
0.5934
89.01
>20000
∞
1.0000
0.2033
30.50
1.0000
150
We should note that from Appendix Table-3 for 0.83 the area left to x is
0.5000 + 0.2967 = 0.7967 and for ∞ the area left to x is 0.5000 + 0.5000 =
1.0000. Similarly, the area of deposit range for normal deviate 0.83 = 0.7967–
0.2033 = 0.5934 and for ∞ = 1.0000–0.7967 = 0.2033.
Once the expected frequencies are calculated, the procedure for calculating χ2
statistic will be the same as we have seen in illustration 3.
χ2 = ∑
(O i − E i ) 2
Ei
The following table gives the calculation of chi-square.
142
Observed
frequencies(Oi)
Expected
frequencies(Ei)
(O i–E i)
(O i–E i) 2
(O i–E i) 2/E i
30
30.50
–0.50
0.2450
0.0080
80
89.01
–9.01
81.1801
0.9120
40
30.50
9.51
90.3450
2.9626
150
150
χ2 = 3.8827
Since n = 3, the number of degrees of freedom will be n–1 = 3–1 = 2 and we
are given 0.10 as the level of significance. From Appendix Table-4 the table
value of χ2 for df = 2 and α = 0.10 is 4.605. Since the calculated value of
χ2 (3.8827) is less than the table value we accept the null hypothesis and
conclude that the data are well described by the normal distribution with mean
= Rs.15000 and standard deviation = Rs. 6000.
Chi-Square Test
Let us consider an illustration which relates to Poisson Distribution.
Illustration 5
A small car company wishes to determine the frequency distribution of
warranty financed repairs per car for its new model car. On the basis of past
experience the company believes that the pattern of repairs follows a Poisson
distribution with mean number of repairs ( l) as 3. A sample data of 400
observations is provided below:
No. of repairs
more per car
0
1
2
3
4
5 or
No. of cars
20
57
98
85
78
62
i) Construct a table of expected frequencies using Poisson probabilities with l =3.
ii) Calculate the χ2 statistic and give your conclusions about the null hypothesis
(take the level of significance as 0.05).
Solution: For the above problem we formulate the following hypothesis.
H0: The number of repairs per car during warranty period follows a Poisson
probability distribution.
H1: The number of repairs per car during warranty period does not follow a Poisson
probability distribution.
As usual the expected frequencies are determined by multiplying the probability
values (in this case Poisson probability) by the total sample size of observed
frequencies. Appendix Table-2 provides the Poisson probability values. For
λ = 3.0 and for different x values we can directly read the probability values.
For example for λ = 3.0 and x = 0 the Poisson probability value is 0.0498, for
λ = 3.0 and x = 1 the Poisson probability value is 0.1494 and so on … .
The following table gives the calculated expected frequencies.
No. of repairs
per car (x)
(1)
Poisson probability
(2)
Expected frequency
Ei = (2) × 400
(3)
0
0.0498
19.92
1
0.1494
59.76
2
0.2240
89.60
3
0.2240
89.60
4
0.1680
67.20
5 or more
0.1848
73.92
Total
1.0000
400
It is to be noted that from Appendix Table-2 for λ = 3.0 we have taken the
Poisson probability values directly for x = 0,1,2,3 and 4. For x = 5 or more we
added the rest of the probability values (for x = 5 to x = 12) so that the sum
of all the probability for x = 0 to x = 5 or more will be 1.000.
143
Probability and Hypothesis
Testing
As usual we use the following formula for calculating the chi-square (χ2) value.
2
χ =∑
(O i − E i ) 2
Ei
The following table gives the calculated χ2 value
Observed
frequencies(Oi)
Expected
frequencies(Ei)
(O i–E i)
(O i–E i) 2
(O i–E i) 2/E i
20
19.92
0.08
0.0064
0.0003
57
59.76
– 2.76
7.6176
0.1275
98
89.60
8.40
70.5600
0.7875
85
89.60
– 4.60
21.1600
0.2362
78
67.20
10.80
116.6400
1.7357
62
73.92
– 11.92
142.0864
1.9222
400
400
χ2 = 4.8094
Since n = 6, the number of degrees of freedom will be n–1 = 6–1 = 5 and we
are given a = 0.05 as the level of significance. From table 4, the table value of
χ2 for 5 degrees of freedom and a = 0.05 is 11.071. Since the calculated
value of χ2 = 4.8094 which is less than the table value of χ2 =11.071, we
accept the null hypothesis (H0) and conclude that the data follows a Poisson
probability distribution with l = 3.0
Illustration 6
In order to know the brand preference of two washing detergents, a sample of
1000 consumers were surveyed. 56% of the consumers preferred Brand X
and 44% of the consumers preferred Brand Y. Do these data conform to the
idea that consumers have no special preference for either brand? Take
significance level as 0.05.
Solution: In this illustration, we assume that brand preference follows a
uniform distribution. That is, ½ of the consumers prefer Brand A and other ½
of the consumers prefer Brand B.
Therefore, we have the following hypothesis.
H0: Brand name has no special significance for consumer preference.
H1: Brand name has special significance for consumer preference.
Since the consumer preference data is given in proportion we will convert it
into frequencies. The number of consumers who preferred Brand X are 0.56 ×
1000 = 560 and Brand Y are 0.44 × 1000 = 440. The corresponding expected
frequencies are ½ × 1000 = 500 each brand.
The following table gives calculated χ2 value.
144
Observed
frequencies(Oi)
20
Expected
frequencies(Ei)
(O i–E i)
(O i–E i) 2
19.92
0.08
0.0064
(O i–E i) 2/E i
Chi-Square Test
0.0003
560
500
60
3600
7.2
440
500
– 60
3600
7.2
1000
1000
χ2 = 14.4
The table value (by consulting the Appendix Table-4) at 5% significance level
and n–1 = 2–1 = 1 degree of freedom is 3.841. Since the value of calculated
χ2 is 14.4 which is greater than table value, we reject the null hypothesis and
conclude that the brand names have special significance for consumer
preference.
17.5 CONDITIONS FOR APPLYING CHI-SQUARE
TEST
To validate the chi-square test, the data set available, needs to fulfill certain
conditions. Sometimes these conditions are also called precautions about using
the chi-square test. Therefore, when ever you use the chi-square test the
following conditions must be satisfied:
a) Random Sample: In chi-square test the data set used is assumed to be a random
sample that represents the population. As with all significance tests, if you have a
random sample data that represents population data, then any differences in the
table values and the calculated values are real and therefore significant. On the
other hand, if you have a non-random sample data, significance cannot be established,
though the tests are nonetheless sometimes utilised as crude “rules of thumb” any
way. For example, we reject the null hypothesis, if the difference between observed
and expected frequencies is too large. But if the chi-square value is zero, we
should be careful in interpreting that absolutely no difference exists between
observed and expected frequencies. Then we should verify the quality of data
collected whether the sample data represents the population or not.
b) Large Sample Size: To use the chi-square test you must have a large
sample size that is enough to guarantee the test, to test the similarity
between the theoretical distribution and the chi-square statistic. Applying chisquare test to small samples exposes the researcher to an unacceptable rate
of type-II errors. However, there is no accepted cutoff sample size. Many
researchers set the minimum sample size at 50. Remember that chi-square
test statistic must be calculated on actual count data (nominal, ordinal or
interval data) and not substituting percentages which would have the effect
of projecting the sample size as 100.
c) Adequate Cell Sizes: You have seen above that small sample size leads to
type-II error. That is, when the expected cell frequencies are too small, the
value of chi-square will be overestimated. This in turn will result in too
many rejections of the null hypothesis. To avoid making incorrect inferences
from chi- square tests we follow a general rule that the expected frequency
in any cell should be a minimum of 5.
d) Independence: The sample observations must be independent.
e) Final values: Observations must be grouped in categories.
145
Probability and Hypothesis
Testing
17.6 CELLS POOLING
In the previous section we have seen that the cell size should be large enough
of at least 5 or more. When a contingency table contains one or more cells
with expected frequency of less than 5, this requirement may be met by
combining two rows or columns before calculating χ2. We must combine these
cells in order to get an expected frequency of 5 or more in each cell. This
practice is also known as grouping the frequencies together. But in doing this,
we reduce the number of categories of data and will gain less information from
contingency table. In addition, we also lose 1 or more degrees of freedom due
to pooling. With this practice, it should be noted that the number of freedom is
determined with the number of classes after the regrouping. In a special case 2
× 2 contingency table, the degree of freedom is 1. Suppose in any cell the
frequency is less than 5, we may be tempted to apply the pooling method
which results in 0 degrees of freedom (due to loss of 1 df ) which is
meaningless. When the assumption of cell frequency of minimum 5 is not
maintained in case of a 2 × 2 contingency table we apply Yates correction. You
will learn about Yates correction in section 17.7. Let us take an illustration to
understand the cell pooling method.
Illustration 7
A company marketing manager wishes to determine whether there are any
significant differences between regions in terms of a new product acceptance.
The following is the data obtained from interviewing a sample of 190
consumers.
Degree of
acceptance
South
North
Region
East
West
Total
Strong
30
25
20
30
105
Moderate
15
15
20
20
70
Poor
5
10
0
0
15
Total
50
50
40
50
190
Calculate the chi-square statistic. Test the independence of the two attributes at
0.05 degrees of freedom.
Solution: In this illustration, the null and alternate hypotheses are:
H0: The product acceptance is independent of the region of the consumer.
H1: The product acceptance is not independent of the region of the consumer.
We are given the observed frequencies in the problem. The following table
gives the calculated expected frequencies.
146
Degree of
acceptance
South
Region
East
West
Total
Strong
27.63
27.63
22.11
27.63
105
Moderate
18.42
18.42
14.74
18.42
70
Poor
3.95
3.95
3.16
3.95
15
Total
50.00
50.00
40.00
50.00
190
North
Since the expected frequencies (cell values) in the third row are less than 5 we
pool the third row with the second row of both observed frequencies and
expected frequencies. The revised observed frequency and expected frequency
tables are given below.
Degree of
acceptance
South
North
Region
East
West
Total
Strong
30
25
20
30
105
Moderate and
poor
20
25
20
20
85
Total
50
50
40
50
190
Degree of
acceptance
South
North
Region
East
West
Total
Strong
27.63
27.63
22.11
27.63
105
Moderate and
poor
22.37
22.37
17.89
22.37
85
Total
50
50
40
50
Chi-Square Test
190
Now we rearrange the data on observed and expected frequencies and
calculate the χ2 value. The following table gives the calculated χ2 value.
(Row, Column)
Observed
Expected
(O i-E i) (O i–E i) 2 (O i–E i) 2/E i
frequencies(Oi) frequencies(Ei)
(1,1)
30
27.63
2.37
5.6169
0.2033
(2,1)
20
22.37
–2.37
5.6169
0.2511
(1,2)
25
27.63
–2.63
6.9169
0.2503
(2,2)
25
22.37
2.63
6.9169
0.3092
(1,3)
20
22.11
–2.11
4.4521
0.2014
(2,3)
20
17.89
2.11
4.4521
0.2489
(1,4)
30
27.63
2.37
5.6169
0.2033
(2,4)
20
22.37
-2.37
5.6169
0.2511
χ2 =1.9185
Since we have a 2 × 4 contingency table, the degrees of freedom will be (r–1)
× (c–1) = (2–1) × (4–1) = 1× 3 = 3. At 3 degree of freedom and 0.05
significance level the table value (from Appendix Table-4) is 7.815. Since the
calculated χ2 value (1.9185) is less than table value of χ2 (7.815) we accept
the null hypothesis and conclude that the product acceptance is independent of
the region of the consumer.
Illustration 8
The following table gives the number of typing errors per page in a 40 page
report. Test whether the typing errors per page have a Poisson distribution with
mean (λ) number of errors is 3.0.
147
Probability and Hypothesis
Testing
No. of typing
0
errors per page
1
2
3
4
5
6
7
8
9
10 or
more
No. of pages
9
6
8
4
3
2
1
1
0
1
5
i) Construct a table of expected frequencies using Poisson probabilities with λ = 3.
ii) Calculate the χ2 statistic and give your conclusions about the null hypothesis (take
the level of significance as 0.01).
Solution: For the above problem we formulate the following hypothesis.
H0: The number of typing errors per page follows a Poisson probability distribution.
H1: The number of typing errors per page does not follow a Poisson probability
distribution.
As usual the expected frequencies are determined by multiplying the probability
values (in this case Poisson probability) by the total sample size of observed
frequencies. Table 17.3 provides the Poisson probability values. For λ = 3.0 and
for different x values we can directly read the probability values. For example
for λ = 3.0 and x = 0 the Poisson probability value is 0.0498. The following
table gives the calculated expected frequencies.
No. of typing
errors per page(x)
Poisson probability
Expected frequency
Ei = (2) × 40
(1)
(2)
(3)
0
0.0498
1.99
1
0.1494
5.98
2
0.2240
8.96
3
0.2240
8.96
4
0.1680
6.72
5
0.1008
4.03
6
0.0504
2.02
7
0.0216
0.86
8
0.0081
0.32
9
0.0027
0.11
10 or more
0.0012
0.05
Total
1.0000
7.97
14.11
40
Since the expected frequencies of the first row are less than 5, we pool first
and second rows of observed and expected frequencies. Similarly, the expected
frequencies of the last 6 rows (with 5,6,7,8,9, and 10 or more errors) are less
than 5. Therefore we pool these rows with the row having the expected typing
errors as 4 or more.
As usual we use the following formula for calculating the chi-square (χ2) value.
2
χ =∑
148
(O i − E i ) 2
Ei
The following table gives the calculated χ2 value after pooling cells
Since n = 4, the number of degrees of freedom will be n–1 = 4–1 = 3 and we
are given a = 0.01 as the level of significance. From Table 4 the table value of
χ2 for 3 degrees of freedom and a = 0.01 is 11.345. Since the calculated
value of χ2 = 5.9632 which is less than the table value of χ2 =11.345, we
accept the null hypothesis (H0) and conclude that the typing errors follow a
Poisson probability distribution with l = 3.0.
17.7 YATES CORRECTION
Yates correction is also called Yates correction for continuity. In a 2 x 2
contingency table the degrees of freedom is 1. If any one of the expected cell
frequency is less than 5, then use of pooling method (explained in section 17.6)
may result in 0 degree of freedom due to loss of 1 degree of freedom in
pooling which is meaning less. More over, it is not valid to perform the chi
square test if any one or more of the expected frequencies is less than 5 (as
explained in section 17.5). Therefore, if any one or more of the expected
frequencies in a 2 × 2 contingency table is less than 5, the Yates correction is
applied. This was proposed by F. Yates, who was an English mathematician.
Suppose for a 2 × 2 contingency table, the four cell values a, b, c and d are
arranged in the following order.
a
b
c
d
The Yates formula for corrected chi square is given by
2
n

n  ad − bc − 
2

χ2 =
(a + b)(c + d )(a + c)(b + d)
Illustration 9
Suppose we have the following data on the consumer preference of a new
product collected from the people living in north and south India.
South India
North India
Row total
Number of consumers who
prefer present product
4
51
55
Number of consumers who
prefer new product
14
38
52
Column total:
18
89
107
149
Probability and Hypothesis
Testing
Do the data suggest that the new product is preferred by the people
independent of their region? Use a = 0.05.
Solution: Suppose we symbolise the true proportions of people who prefer
the new product as :
Ps = proportion of south Indians who prefer the new product
PN = Proportion of north Indians who prefer the new product
We state the null hypothesis (H0) and alternative hypothesis (H1)as:
H0: PS = PN (the proportion of people who prefer new product among south and north
India are the same).
H1: PS ≠ PN (the proportion of people who prefer new product among south and north
India are not the same).
In this illustration, (i) the sample size (n) = 107 (ii) the cell values are: a = 4,
b = 51, c = 14, d = 38, (iii) The corresponding row totals are: (a + b) = 55 and
(c + d) = 52, and column totals are (a + c) = 18 and (b + d) = 89.
Since one of the cell frequency is less than 5 (a = 4) we apply Yates
correction to the chi-square test.
2
n

n  ad − b c − 
2

χ2 =
(a + b) (c + d) (a + c) (b + d)
∴ χ 2 = 6.0386
The table value for degrees of freedom (2–1) (2–1) = 1 and significance level
∝ = 0.05 is 3.841. Since calculated value of chi-square is 6.0386 which is
greater than the table value we can reject H0 and accept H1 and conclude that
the preference for the new product is not independent of the geographical
region.
It may be observed that when N is large, Yates correction will not make much
difference in the chi square value. However, if N is small, the implication of
Yates correction may overstate the probability.
17.8
LIMITATIONS OF CHI-SQUARE TEST
In order to prevent the misapplication of the χ2 test, one has to keep the
following limitations of the test in mind:
a) As explained in section 17.5 (conditions for applying chi square test), the chi square
test is highly sensitive to the sample size. As sample size increases, absolute
differences become a smaller and smaller proportion of expected value. This means
150
that a reasonably strong association may not come up as significant if the sample
size is small. Conversely, in a large sample, we may find statistical significance
when the findings are small and insignificant. That is, the findings are not substantially
significant, although they are statistically significant.
Chi-Square Test
b) Chi-square test is also sensitive to small frequencies in the cells of contingency
table. Generally, when the expected frequency in a cell of a table is less than 5,
chi-square can lead to erroneous conclusions as explained in section 17.5. The
rule of thumb here is that if either (i) an expected value in a cell in a 2 × 2 contingency
table is less than 5 or (ii) the expected values of more than 20% of the cells in a
greater than 2 × 2 contingency table are less than 5, then chi square test should not
be applied. If at all a chi-square test is applied then appropriately either Yates
correction or cell pooling should also be applied.
c) No directional hypothesis is assumed in chi-square test. Chi-square tests the
hypothesis that two attributes/variables are related only by chance. That is if a
significant relationship is found, this is not equivalent to establishing the researchers’
hypothesis that attribute A causes attribute B or attribute B causes attribute A.
Self Assessment Exercise B
1) While calculating the expected frequencies of a chi-square distribution it was found
that some of the cells of expected frequencies have value below 5. Therefore,
some of the cells are pooled. The following statements tell you the size of the
contingency table before pooling and the rows/columns pooled. Determine the
number of degrees of freedom.
a) 5 × 4 contingency table. First two and last two rows are pooled.
b) 4 × 6 contingency table. First two and last two columns are pooled.
c) 6 × 3 contingency table. First two rows are pooled. 4th, 5th, and 6th rows
are pooled.
..................................................................................................................
..................................................................................................................
..................................................................................................................
2) What is the table value of chi-square for goodness-of-fit if there are:
a) 8 degrees of freedom and the significance level is 1%.
b) 13 degrees of freedom and the significance level is 5%.
c) 16 degrees of freedom and the significance level is 0.10%.
d) 6 degrees of freedom and the significance level is 0.20%.
..................................................................................................................
3) a) The following data is an observed frequency distribution. Assuming that
the data follows a Poisson distribution with l=2.5.
i) calculate Poisson probabilities and expected values, ii) calculate
chi square value, and iii) at 0.05 level of significance can we
conclude that the data follow a poisson distribution with l = 2.5.
No. of Telephone
calls per minute
There are several applications of chi-square distribution, some of which we
have studied in this Unit. These are (i) to test the goodness-of-fit, and (ii) to
test the independence of attributes. The chi-square distribution is known by its
only parameter – number of degrees of freedom. Like student t distribution
there is a separate chi-square distribution for each number of degrees of
freedom.
The chi-square test for testing the goodness-of-fit establishes whether the
sample data supports the assumption that a particular distribution applies to the
parent population. It should be noted that the statistical procedures are based on
some assumptions such as normal distribution of population. A chi-square
procedure allows for testing the null hypothesis that a particular distribution
applies. We also use chi-square test whether to test whether the classification
criteria are independent or not.
When performing chi-square test using contingency tables, it is assumed that all
cell frequencies are a minimum of 5. If this assumption is not met we may use
the pooling method but then there is a loss of information when we use this
method. In a 2 × 2 contingency table if one or more cell frequencies are less
than 5 we should apply Yates correction for computing the chi-square value.
In a chi-square test for goodness of-fit, the degrees of freedom are number of
categories – 1 (n–1). In a chi-square test for independence of attributes, the
degrees of freedom are (number of rows–1) × (number of columns–1). That is,
(r–1) × (c–1).
17.10 KEY WORDS
Adequate Cell Sizes: To avoid making incorrect inferences from chi-square
tests we follow a general rule that the expected frequency in any cell should be
a minimum of 7.
Cells Pooling: When a contingency table contains one or more cells with
expected frequency less than 5, we combine two rows or columns before
calculating χ2. We combine these cells in order to get an expected frequency of
5 or more in each cell.
152
Chi-Square Distribution: A kind of probability distribution, differentiated by
their degree of freedom, used to test a number of different hypotheses about
variances, proportions and distributional goodness of fit.
Chi-Square Test
Expected Frequencies: The hypothetical data in the cells are called as
expected frequencies.
Goodness of Fit: The chi-square test procedure used for the validation of our
assumption about the probability distribution is called goodness of fit.
Observed Frequencies: The actual cell frequencies are called observed
frequencies.
Yates Correction: If any one or more of the expected frequencies in a 2 × 2
contingencies table is less than 5, the Yates correction is applied.
17.11 ANSWERS TO SELF ASSESSMENT
EXERCISES
A) 1. a
i) H0: The preference for the type of car among people is independent
of their sex.
ii) degrees of freedom: 6
iii) χ2 (table value): 12.592
iv) Conclusion: Accept H0.
1. b
i)
H0: The income distribution and preference for type of house are
independently distributed.
ii) degrees of freedom: 9
iii) χ2 (table value): 21.666
iv) Conclusion: Reject H0.
1. c
i)
H0: The attitude towards going to a movie or for shopping is
independent of the sex.
ii) degrees of freedom: 1
iii) χ2 (table value): 6.635
iv) Conclusion: Reject H0.
1. d
i)
H0: The voters educational level and their political affiliation are
independent of each other.
ii) degrees of freedom: 9
iii) c2 (table value): 14.684
iv) Conclusion: Accept H0.
2. a) 25,
b) 6,
c) 8,
d) 21.
153
Probability and Hypothesis
Testing
3. a.
(Row,
Column)
Observed
Expected (Oi - Ei)
frequency frequency
(Ei)
(O i)
(Oi - Ei)2 (Oi - Ei)2/Ei
(1,1)
50
52.5
–2.5
6.25
0.1190
(1,2)
55
52.5
2.5
6.25
0.1190
(1,3)
45
52.5
–7.5
56.25
1.0714
(1,4)
60
52.5
7.5
56.25
1.0714
(2,1)
50
47.5
2.5
6.25
0.1316
(2,2)
45
47.5
7.5
56.25
1.1842
(2,3)
55
47.5
–2.5
6.25
0.1316
(2,4)
40
47.5
12.5
156.25
3.2895
Total
400
χ2 = 7.1178
400
3. b. H0: The preference for the brand is distributed independent of the consumers’
education level.
3. c. Table value χ2 at 3 d.f and α = 0.05 is 7.815. Since calculated value (7.1178)
is less than the table value of χ2 (7.815), we accept the H0.
B) 1.
a) 6,
b) 9,
c) 4
2.
a) 20.090,
b)22.362,
c) 23.542,
3.
i) Poisson probabilities and expected values
No. of repairs
per car (x)
(1)
0
1
2
3
4
5 or more
154
d) 8.558
Poisson probability
Expected frequency
Ei =(2)x150
(2)
0.0498
0.1494
0.2240
0.2240
0.1680
0.1848
(3)
7.47
22.41
33.60
33.60
25.20
27.72
3. ii) chi-square value
Chi-Square Test
No. of
Observed Expected
Telephone calls frequency frequency
per minute
(O i)
(Ei)
(Oi-Ei)
(Oi-Ei)2
(Oi-Ei)2/Ei
0
6
7.47
–1.47
2.16
0.2893
1
30
22.41
7.59
57.61
2.5706
2
41
33.60
7.40
54.76
1.6298
3
52
33.60
18.40
338.56
10.0762
4
12
25.20
–13.20
174.24
6.9143
9
27.72
–18.72
350.44
12.6421
5 or more
Total150
χ2=34.1222
150
3.iii) At 0.05 significance level and 4 degrees of freedom the table value is 9.488.
Since the calculated chi-square value is greater than the table value we reject
the null hypothesis that the frequency of telephone calls follows Poisson
distribution.
17.12 TERMINAL QUESTIONS/EXERCISES
1) Why do we use chi-square test?
2) What do you mean by expected frequencies in (a) chi-square test for testing
independence of attributes, and (b) chi-square test for testing goodness-of-fit?
Briefly explain the procedure you follow in calculating the expected values in
each of the above situations.
3) Explain the conditions for applying chi-square test.
4) What are the limitations for applying chi-square test?
5) When do you use Yates correction?
6) When do you pool rows or columns while applying chi-square test? What are its
limitations?
7) The following data provides information for 30 days on fatal accidents in a metro
city. Do the data suggest that the distribution of fatal accidents follow a Poisson
distribution? Take the level of significance as 0.05.
Fatal accidents per day
0
1
2
3
4 or more
Frequency
4
8
10
6
2
8) Below is an observed frequency distribution.
Marks
range
Under 40
40 and
under 50
50 and
under 60
60 and
under 75
75 and
90 and
under 90 above
No. of
students
9
20
65
34
14
8
At 0.01 significance level, the null hypothesis is that the data is from normal
distribution with a mean of 10 and a standard deviation of 2. What are your
conclusions?
155
Probability and Hypothesis
Testing
9) The following table gives the number of telephone calls attended by a credit card
information attendant.
Day
Test whether the telephone calls are uniformly distributed? Use 0.10
significance level.
10)The following data gives preference of car makes by type of customer.
Type of
customer
Car make
Maruti 800 Maruti Zen
Honda
Tata Indica Total
Single man
350
200
150
50
750
Single woman
100
150
100
80
430
Married man
300
150
120
120
690
Married woman
150
100
80
50
380
Total
900
600
450
300
2250
(a) Test the independence of the two attributes. Use 0.05 level of significance.
(b) Draw your conclusions.
11) A bath soap manufacturer introduced a new brand of soap in 4 colours. The
following data gives information on the consumer preference of the brand.
Consumer
Bath soap colour
rating
Red
Green
Brown
Yellow
Total
Excellent
30
20
20
30
100
Good
20
10
20
30
80
Fair
20
10
10
30
70
Poor
10
45
35
10
100
Total
80
85
85
100
350
From the above data:
a) Compute the χ2 value,
b) State the null hypothesis, and
c) Draw your inferences.
Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.
156
17.13
FURTHER READING
Chi-Square Test
A number of good text books are available on the topics dealth within this unit. The
following books may be used for more indepth study.
1) Kothari, C.R.(1985) Research Methodology Methods and Techniques, Wiley
Eastern, New Delhi.
2) Levin, R.I. and D.S. Rubin. (1999) Statistics for Management, Prentice-Hall
of India, New Delhi
3) Mustafi, C.K.(1981) Statistical Methods in Managerial Decisions,
Macmillan, New Delhi.
4) Chandan, J.S., Statistics for Business and Economics, Vikas Publishing
House Pvt Ltd New Delhi.
5) Zikmund, William G. (1988) Business Research Methods, The Dryden
Press, New York.
Appendix Table-2 Direct Values for Determining Poisson Probabilities
For a given value of l, entry indicates the probability of obtaining a specified value of X.
µ
x
Source: From Table IV of Fisher and Yates, Statistical Tables for Biological,
Agricultural and Medical Research, Published by Longman Group Ltd
(previously published by Oliver and Boyd, Edinburg, 1963).