11.4 Chi-Square Independence Test
The chi-square independence test is used to test for an association between two categorical variables of a population.
11.4.1 Terminologies Used for a Contingency Table
Recall that a contingency table summarizes the counts of two categorical variables. For example, the following contingency table groups 200 females according to their breast cancer status and smoking status:
Table 11.6: Contingency Table of Cancer Status (row) and Smoking Status (column)
Smoker
|
Non-smoker
|
Total | |
---|---|---|---|
Breast Cancer |
10 |
30 |
40 |
Cancer-free |
20 |
140 |
160 |
Total | 30 | 170 | 200 |
Suppose we randomly select an individual from this sample. Define the events:
The joint events are:
The variable “Cancer Status” is called the row variable, and it has two possible values—cancer or cancer-free. The variable “Smoking Status” is the column variable, and it has two values—smoker and non-smoker. The two numbers in the last column (40 and 160) are the row totals and the two in the last row (30, 170) are the column totals. The sample size is also called the grand total. The four numbers in bold are the joint frequencies. The boxes that contain the joint frequencies are referred to as cells.
Based on the
Table 11.7: Marginal and Joint Probability Distributions of Cancer Status and Smoking Status
Smoker
|
Non-smoker
|
Total
|
|
---|---|---|---|
Breast Cancer
|
|||
Cancer-free
|
|||
Total
|
|
|
1
|
We want to test for an association between the two variables in a contingency table. Two variables are said to be associated if they are NOT independent. If two variables are associated, then differences exist among the conditional distributions of one variable, given different values of the other variable. For example, the conditional distributions of “Cancer Status” given “Smoking Status” are given in the following table. Notice that the conditional distributions are simply the relative frequencies of “Cancer” within smoker and non-smoker groups.
Table 11.8: Conditional Probability Distribution of Cancer Status Given Smoking Status
Smoker
|
Non-smoker
|
Marginal Distribution
Of Cancer Status |
|
---|---|---|---|
Breast Cancer
|
|||
Cancer-free
|
|||
Total |
1
|
1
|
1
|
A segmented bar graph helps us visualize conditional distributions and the concept of association. The figure below is the segmented bar graph that displays the conditional distributions of “Cancer Status” for smokers and non-smokers and the marginal distribution of “Cancer Status”. The three bars should be identical if “Cancer Status” and “Smoking Status” are independent. That is, the conditional probabilities should equal the unconditional probabilities:
![]() |
Interpretation:
The proportion or percentage of females with breast cancer (the green bar) is higher among the smokers than the non-smokers. Therefore, “Cancer Status” and “Smoking Status” might be associated; we can test this by a chi-square independence test. |
11.4.2 Main Idea Behind Chi-Square Independence Test
The null hypothesis is that the two variables are independent; the alternative is that they are associated. The test statistic is the same as that from the chi-square goodness-of-fit test; for each cell, compute the difference between the observed frequency (O) and the expected frequency (
The test procedure is straightforward—the key is calculating each cell’s expected frequency. Recall that two events, A and B , are independent if
If
In general,
Applying the above formula to each cell yields the following expected frequencies:
- “Cancer” & “Smoker”:
- “Cancer” & “Non-smoker”:
- “Cancer free” & “Smoker”:
- “Cancer free” & “Non-smoker”:
.
To compute the test statistic, it is helpful to write each expected frequency in the same cell as the corresponding observed frequency. The following table gives both the observed and expected frequencies for each cell (the expected frequencies are displayed in brackets):
Table 11.9: Observed and Expected Frequency (in Brackets) of Chi-Square Independent Test
Smoker
|
Non-smoker
|
Total | |
---|---|---|---|
Breast Cancer |
10 (6) | 30 (34) | 40 |
Cancer-free |
20 (24) | 140 (136) | 160 |
Total |
30
|
170
|
200 |
Chi-Square Independence Test
The assumptions and steps of conducting a chi-square independence test are as follows.
Assumptions:
- All expected frequencies are at least 1.
- At most 20% of the expected frequencies are less than 5.
- Simple random sample (required only if you need to generalize the conclusion to a larger population).
Note: If either assumption 1 or 2 is violated, one can consider combining the cells to make the counts in those cells larger.
Steps to perform a chi-square independence test:
First, check the assumptions. Calculate the expected frequency for each possible value of the variable using
- Set up the hypotheses:
- State the significance level
. - Compute the value of the test statistic:
with, where , is the number of rows and is number of columns of the cells. - Find the P-value or rejection region based on the
curve withP-value the area to the right of under the curveRejection region the region to the right of - Reject the null
if the P-value or falls in the rejection region. - Conclusion.
Example: Chi-Square Independence Test
Test at the 10% significance level whether the variables “Cancer Status” and “Smoking Status” are associated.
Smoker (S1)
|
Non-smoker(S2)
|
Total | |
---|---|---|---|
Breast Cancer |
10 (6) | 30 (34) | 40 |
Cancer-free |
20 (24) | 140 (136) | 160 |
Total |
30
|
170
|
200 |
Check the assumptions: The expected frequencies are the values given in brackets, all greater than 5. We must assume this is a simple random sample of females.
Steps:
- Set up the hypotheses:
- The significance level is
. - Compute the value of the test statistic:
with . - Find the P-value:
since . - Decision: Reject the null
sin,ce P-value . - Conclusion: At the 10% significance level, we have sufficient evidence of an association between the variables “Cancer Status” and “Smoking Status”.

Exercise: Chi-Square Independence Test
A random sample of 230 adults yields the following data regarding age and Internet usage. At the 1% significance level, do the data provide sufficient evidence of an association between age and Internet usage?
Table 11.10: Contingency Table of Internet Usage (row) and Age (column)
18–24
|
25–64
|
65+
|
Total
|
|
---|---|---|---|---|
Never
|
6
|
38
|
31
|
75
|
Sometimes
|
14
|
31
|
5
|
50
|
Every day
|
50
|
50
|
5
|
105
|
Total
|
70
|
119
|
41
|
230
|
Show/Hide Answer
Answers:
Check the assumptions.
Applying the formula
- “Never” & “18-24”:
- “Never” & “25-64”:
- “Never” & “65+”:
- “Sometimes” & “18-24”:
- “Sometimes” & “25-64”:
- “Sometimes” & “65+”:
- “Every day” & “18-24”:
- “Every day” & “25-64”:
- “Every day” & “65+”:
The expected frequencies are given in brackets; they are all greater than 5. We are told this is a random sample. Therefore, assumptions for the chi-square independence test are satisfied.
Table 11.11: Observed and Expected Frequency of Internet Usage (row) and Age (column)
18–24
|
25–64
|
65+
|
Total
|
|
---|---|---|---|---|
Never
|
6 (22.826)
|
38 (38.804)
|
31 (13.370)
|
75
|
Sometimes
|
14 (15.217)
|
31 (25.870)
|
5 (8.913)
|
50
|
Every day
|
50 (31.957)
|
50 (54.326)
|
5 (18.717)
|
105
|
Total
|
70
|
119
|
41
|
230
|
Steps:
- Set up the hypotheses:
- The significance level is
. - Compute the value of the test statistic:
with
. - Find the P-value: P-value=
. - Decision: Reject the null
since P-value . - Conclusion: At the 1% significance level, we have sufficient evidence that there is an association between age and Internet usage.