"

11.4 Chi-Square Independence Test

The chi-square independence test is used to test for an association between two categorical variables of a population.

11.4.1 Terminologies Used for a Contingency Table

Recall that a contingency table summarizes the counts of two categorical variables. For example, the following contingency table groups 200 females according to their breast cancer status and smoking status:

Table 11.6: Contingency Table of Cancer Status (row) and Smoking Status (column)

Smoker (S1)
Non-smoker (S2)
Total
Breast Cancer (C1) 10 (C1&S1) 30 (C1&S2) 40 
Cancer-free (C2) 20  (C2&S1) 140  (C2&S2) 160 
Total 30 170 200

Suppose we randomly select an individual from this sample. Define the events:

S1=the subject is a smoker;S2=the subject is a non-smoker;C1=the subject has breast cancer;C2=the subject does not have breast cancer.

The joint events are:

C1&S1=the subject has cancer and is a smoker;C1&S2=the subject has cancer and is a non-smoker;C2&S1=the subject does not have cancer and is a smoker;C2&S2=the subject does not have cancer and is a non-smoker.

The variable “Cancer Status” is called the row variable, and it has two possible values—cancer or cancer-free. The variable “Smoking Status” is the column variable, and it has two values—smoker and non-smoker. The two numbers in the last column (40 and 160) are the row totals and the two in the last row (30, 170) are the column totals. The sample size is also called the grand total. The four numbers in bold are the joint frequencies. The boxes that contain the joint frequencies are referred to as cells.

Based on the fN rule, the marginal distribution of the row (column) variable equals the row (column) totals divided by n. The joint distribution is given by the joint frequencies divided by n. The following table shows the marginal distribution of “Cancer Status” in the last column, the marginal distribution of “Smoking Status” in the last row, and the joint distribution of the four cells.

Table 11.7: Marginal and Joint Probability Distributions of Cancer Status and Smoking Status

Smoker (S1)
Non-smoker (S2)
Total
Breast Cancer (C1)
P(C1&S1)=10200=0.05 P(C1&S2)=30200=0.15 P(C1)=40200=0.2
Cancer-free (C2)
P(C2&S1)=20200=0.1 P(C2&S2)=140200=0.7 P(C1)=160200=0.8
Total
 P(S1)=30200=0.15 P(S2)=170200=0.85
1

We want to test for an association between the two variables in a contingency table. Two variables are said to be associated if they are NOT independent. If two variables are associated, then differences exist among the conditional distributions of one variable, given different values of the other variable. For example, the conditional distributions of “Cancer Status” given “Smoking Status” are given in the following table. Notice that the conditional distributions are simply the relative frequencies of “Cancer” within smoker and non-smoker groups.

Table 11.8: Conditional Probability Distribution of Cancer Status Given Smoking Status

Smoker (S1)
Non-smoker (S2)
Marginal Distribution
Of Cancer Status
Breast Cancer (C1)
P(C1|S1)=1030=0.333 P(C1|S2)=30170=0.176 P(C1)=40200=0.2
Cancer-free (C2)
P(C2|S1)=2030=0.677 P(C2|S2)=140170=0.824 P(C2)=160200=0.8
Total
1
1
1

A segmented bar graph helps us visualize conditional distributions and the concept of association. The figure below is the segmented bar graph that displays the conditional distributions of “Cancer Status” for smokers and non-smokers and the marginal distribution of “Cancer Status”. The three bars should be identical if “Cancer Status” and “Smoking Status” are independent. That is, the conditional probabilities should equal the unconditional probabilities:

P(C1|S1)=P(C1|S2)=P(C1);P(C2|S1)=P(C2|S2)=P(C2).

A segmented bar chart showing the relative proportions of cancer in green to non-cancer in red given smoking status. Image description available.
Figure 11.2: Segment Bar Chart. [Image Description (See Appendix D Figure 11.2)] Click on the image to enlarge it.
Interpretation:

The proportion or percentage of females with breast cancer (the green bar) is higher among the smokers than the non-smokers. Therefore, “Cancer Status” and “Smoking Status” might be associated; we can test this by a chi-square independence test.

11.4.2 Main Idea Behind Chi-Square Independence Test

The null hypothesis is that the two variables are independent; the alternative is that they are associated. The test statistic is the same as that from the chi-square goodness-of-fit test; for each cell, compute the difference between the observed frequency (O) and the expected frequency (E), square it, and divide by the expected frequency. The expected frequency is the number we expect to observe if the null is true. A large chi-square statistic means the observed and the expected frequencies are significantly different, which provides evidence against the null hypothesis. Therefore, we should reject the null if the observed chi-square statistic is sufficiently large. More specifically, given the significance level α, reject H0 if the P-value α, where the P-value is the area to the right of the observed test statistic under the chi-square curve.

The test procedure is straightforward—the key is calculating each cell’s expected frequency. Recall that two events, A and B , are independent if P(A&B)=P(A)×P(B). For example, if the events “Breast Cancer” and “Smoker” are independent, then P(Breast Cancer & Smoker)=P(Breast Cancer)×P(Smoker) where P(Breast Cancer) and P(Smoker) are given by the marginal distribution of “Cancer Status” and “Smoking Status” respectively. That is,
P(Breast Cancer)=40200=0.2;P(Smoker)=30200=0.15.
If H0 (the two variables are independent) is true, the expected frequency for the cell “Cancer and Smoker” is
E=nP(Cancer and Smoker)=nP(Cancer)P(Smoker)=200×40200×30200=40×30200=6.
In general,
Expected frequency of the cell in rth row and cth column=rth row total×cth column totaln.
Applying the above formula to each cell yields the following expected frequencies:

  • “Cancer” & “Smoker”: E=40×30200=6.
  • “Cancer” & “Non-smoker”: E=40×170200=34.
  • “Cancer free” & “Smoker”: E=160×30200=24.
  • “Cancer free” & “Non-smoker”: E=160×170200=136..

To compute the test statistic, it is helpful to write each expected frequency in the same cell as the corresponding observed frequency. The following table gives both the observed and expected frequencies for each cell (the expected frequencies are displayed in brackets):

Table 11.9: Observed and Expected Frequency (in Brackets) of Chi-Square Independent Test

Smoker (S1)
Non-smoker (S2)
Total
Breast Cancer (C1) 10  (6) 30  (34) 40 
Cancer-free (C2) 20  (24) 140  (136) 160
Total
30
170
200

Chi-Square Independence Test

The assumptions and steps of conducting a chi-square independence test are as follows.

Assumptions:

  1. All expected frequencies are at least 1.
  2. At most 20% of the expected frequencies are less than 5.
  3. Simple random sample (required only if you need to generalize the conclusion to a larger population).

Note: If either assumption 1 or 2 is violated, one can consider combining the cells to make the counts in those cells larger.

Steps to perform a chi-square independence test:

First, check the assumptions. Calculate the expected frequency for each possible value of the variable using E=rth row total×cth column totaln, where n is the total number of observations. Check whether the expected frequencies satisfy assumptions 1 and 2. If not, consider combining some cells.

  1. Set up the hypotheses:
    H0:The two variables are independentHa:The two variables are associated.
  2. State the significance level α.
  3. Compute the value of the test statistic: χo2=all cells(OE)2E  with, df=(r1)×(c1) where E=rth row total×cth column totaln, r is the number of rows and c is number of columns of the cells.
  4. Find the P-value or rejection region based on the χ2 curve with df=(r1)×(c1)
    P-value  P(χ2χo2) the area to the right of χo2 under the curve
    Rejection region  χ2χα2 the region to the right of χα2
  5. Reject the null H0 if the P-value α or χo2 falls in the rejection region.
  6. Conclusion.

 

Example: Chi-Square Independence Test

Test at the 10% significance level whether the variables “Cancer Status” and “Smoking Status” are associated.

Smoker (S1)
Non-smoker(S2)
Total
Breast Cancer (C1) 10 (6) 30  (34) 40 
Cancer-free (C2) 20  (24) 140 (136) 160 
Total
30
170
200

Check the assumptions: The expected frequencies are the values given in brackets, all greater than 5. We must assume this is a simple random sample of females.

Steps:

  1. Set up the hypotheses:
    H0:The variables "Cancer Status" and "Smoking Status" are independent
    Ha:The variables "Cancer Status" and "Smoking Status" are associated.
  2. The significance level is α=0.1.
  3. Compute the value of the test statistic:
    χo2=all cells(OE)2E=(106)26+(3034)234+(2024)224+(140136)2136=3.922.
    with df=(r1)×(c1)=(21)×(21)=1.
  4. Find the P-value:
    P-value=P(χ2X02)=P(χ23.992)0.025<P-value<0.05 since 3.841(χ0.052)<χo2=3.922<5.024(χ0.0252).
  5. Decision: Reject the null H0 sin,ce P-value 0.05<0.1(α).
  6. Conclusion: At the 10% significance level, we have sufficient evidence of an association between the variables “Cancer Status” and “Smoking Status”.

 

Exercise: Chi-Square Independence Test

A random sample of 230 adults yields the following data regarding age and Internet usage. At the 1% significance level, do the data provide sufficient evidence of an association between age and Internet usage?

Table 11.10: Contingency Table of Internet Usage (row) and Age (column)

18–24
25–64
65+
Total
Never
6
38
31
75
Sometimes
14
31
5
50
Every day
50
50
5
105
Total
70
119
41
230
Show/Hide Answer

Answers:

Check the assumptions.

Applying the formula Expected frequency of the cell in rth row and cth column=rth row total×cth column totaln to each cell, the expected frequencies are given by:

  • “Never” & “18-24”: E=75×70230=22.826.
  • “Never” & “25-64”: E=75×119230=38.804.
  • “Never” & “65+”: E=75×41230=13.370.
  • “Sometimes” & “18-24”: E=50×70230=15.217.
  • “Sometimes” & “25-64”: E=50×119230=25.870.
  • “Sometimes” & “65+”: E=50×41230=8.913.
  • “Every day” & “18-24”: E=105×70230=31.957.
  • “Every day” & “25-64”: E=105×119230=54.326.
  • “Every day” & “65+”: E=105×41230=18.717.

The expected frequencies are given in brackets; they are all greater than 5. We are told this is a random sample. Therefore, assumptions for the chi-square independence test are satisfied.

Table 11.11: Observed and Expected Frequency of Internet Usage (row) and Age (column)

18–24
25–64
65+
Total
Never
6 (22.826)
38 (38.804)
31 (13.370)
75
Sometimes
14 (15.217)
31 (25.870)
5 (8.913)
50
Every day
50 (31.957)
50 (54.326)
5 (18.717)
105
Total
70
119
41
230

Steps:

  1. Set up the hypotheses:
    H0:The variables "Age" and "Internet usage" are independentHa:The variables "Age" and "Internet usage" are associated.
  2. The significance level is α=0.01.
  3. Compute the value of the test statistic:

    χo2=all cells(OE)2E=(622.826)222.826+(3838.804)238.804+(3113.370)213.370+(1415.217)215.217+(3125.870)225.870+(58.913)28.913+(5031.957)231.957+(5054.326)254.326+(518.717)218.717=59.084.

     with

    df=(r1)×(c1)=(31)×(31)=4.

  4. Find the P-value: P-value= P(χ2χo2)=P(χ259.084)<0.005.
  5. Decision: Reject the null H0 since P-value 0.005<0.01(α).
  6. Conclusion: At the 1% significance level, we have sufficient evidence that there is an association between age and Internet usage.