10.6 Inferences for Two Population Proportions

Wanhua Su

10.6 Inferences for Two Population Proportions

Previous studies suggest that more women than men have arthritis. The Centers for Disease Control and Prevention reported a survey of randomly selected Americans aged 65 and older. They found 411 of 1,012 men and 535 of 1,062 women had arthritis. Is there any evidence that women are more likely to suffer from arthritis than men? Let [latex]p_1[/latex] be the proportion of male arthritis sufferers and [latex]p_2[/latex] be the proportion of female sufferers. We want to test [latex]H_0: p_1 \geq p_2[/latex] versus [latex]H_a: p_1 < p_2[/latex] or [latex]H_0: p_1-p_2\geq 0[/latex] versus [latex]H_a: p_1 - p_2<0[/latex]. Inference on the population mean [latex]\mu[/latex] is based on the distribution of the sample mean [latex]\bar X;[/latex] inference on the difference of two population means [latex]\mu_1-\mu_2[/latex] is based on the distribution of the difference between the sample means [latex]\bar X_1-\bar X_2[/latex]; and inference on the population proportion [latex]p[/latex] is based on the distribution of the sample proportion [latex]\hat p[/latex]. Similarly, inference on the difference of two population proportions [latex]p_1-p_2[/latex] is based on the distribution of the difference between the sample proportions [latex]\hat p_1-\hat p_2[/latex].

10.6.1 Sampling Distribution of Difference Between Two Sample Proportions [latex]\hat{p}_1 - \hat{p}_2[/latex]

Key Facts: Sampling Distribution of Difference Between Two Sample Proportions

For independent samples of size [latex]n_1[/latex] and [latex]n_2[/latex] from the two populations:

The mean of [latex]\hat{p}_1 - \hat{p}_2[/latex] equals the difference of the population proportions, i.e., [latex]\mu_{\scriptsize \hat{p}_1 - \hat{p}_2} = \mu_{\scriptsize \hat{p}_1} - \mu_{\scriptsize \hat{p}_2} = p_1 - p_2[/latex].
The standard deviation of [latex]\hat{p}_1 - \hat{p}_2[/latex]: [latex]\sigma_{\scriptsize \hat{p}_1 - \hat{p}_2} = \sqrt{\frac{p_1(1 - p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}[/latex].
These two conclusions are always true regardless of the sample sizes [latex]n_1[/latex] and [latex]n_2[/latex].
The shape of the distribution of [latex]\hat{p}_1 - \hat{p}_2[/latex]: by the central limit theorem, when the sample sizes [latex]n_1[/latex] and [latex]n_2[/latex] are large enough, [latex]\hat{p}_1 - \hat{p}_2[/latex] is approximately normally distributed. The rule of thumb is [latex]n_1 p_1 \geq 5 , n_1 (1 - p_1) \geq 5[/latex] and [latex]n_2 p_2 \geq 5 , n_2 (1 - p_2) \geq 5[/latex].

To summarize, when [latex]n_1 p_1 \geq 5 , n_1 (1 - p_1) \geq 5[/latex] and [latex]n_2 p_2 \geq 5 , n_2 (1 - p_2) \geq 5[/latex],

[latex]\hat{p}_1 - \hat{p}_2 \sim N \left( p_1 - p_2 , \sqrt{\frac{p_1(1 - p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}} \right).[/latex]

The standardized version is

[latex]Z = \frac{(\hat{p}_1 - \hat{p}_2) - (p_1 - p_2)}{\sqrt{ \frac{p_1(1 - p_1)}{n_1} + \frac{p_2(1 -p_2)}{n_2}}} \sim N(0, 1).[/latex]

10.6.3 Two-Proportion z Interval for the Difference Between Two Proportions [latex]p_1 - p_2[/latex]

A point estimate for the difference between two population proportions [latex](p_1 - p_2)[/latex] is the difference between the sample proportions [latex](\hat{p}_1 - \hat{p}_2)[/latex].

Assumptions:

Both samples are simple random samples from their respective populations.
The two samples are independent.
Large samples, all the number of successes, and the number of failures [latex]x_1, n_1 -x_1, x_2[/latex], and [latex]n_2 - x_2[/latex] are at least 5.

Note: As was the case with one-proportion inferences, [latex]p_1[/latex] and [latex]p_2[/latex] are generally unknown and estimated with [latex]\hat{p}_1 = \frac{x_1}{n_1}[/latex] and [latex]\hat{p}_2 = \frac{x_2}{n_2}[/latex]. Thus, since [latex]n_i\hat{p}_i = n_i \frac{x_i}{n_i} = x_i[/latex] and [latex]n_i(1 - \hat{p}_i) = n_i \left( 1 - \frac{x_i}{n_i} \right) = n_i \left( \frac{n_i - x_i}{n_i} \right) = n_i - x_i[/latex], the sample is deemed sufficiently large if [latex]n_i \hat{p}_i = x_i \geq 5[/latex] and [latex]n_i (1 - \hat{p}_i) = n_i - x_i \geq 5[/latex] for [latex]i = 1, 2[/latex].

A [latex](1 – \alpha) \times 100\%[/latex] confidence interval for the difference between the population proportions [latex](p_1 - p_2)[/latex] is

[latex](\hat{p}_1 - \hat{p}_2) \pm z_{\alpha / 2 } \sqrt{ \frac{\hat{p}_1(1 - \hat{p}_1)}{n_1} + \frac{\hat{p}_2(1 - \hat{p}_2)}{n_2}}[/latex]

where [latex]z_{\alpha / 2}[/latex] is the z score such that the area under the standard normal curve to its right is [latex]\frac{\alpha}{2}[/latex]. This is a two-tailed interval.

A [latex](1 – \alpha) \times 100\%[/latex] upper-tail confidence interval is

[latex]\left( (\hat{p}_1 - \hat{p}_2) - z_{\alpha } \sqrt{\frac{\hat{p}_1(1 - \hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}, 1 \right),[/latex]

and a [latex](1 – \alpha) \times 100\%[/latex] lower-tailed confidence interval is

[latex]\left( -1 , (\hat{p}_1 - \hat{p}_2) + z_{\alpha } \sqrt{\frac{\hat{p}_1(1 - \hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}} \right).[/latex]

Note that the largest possible value of [latex]p_1-p_2[/latex] is 1 when [latex]p_1=1, p_2=0[/latex], and the smallest possible value of [latex]p_1-p_2[/latex] is -1 when [latex]p_1=0, p_2=1.[/latex]

10.6.2 Two-Proportion z Test for the Difference Between Two Proportions [latex]p_1 - p_2[/latex]

Recall that the population proportion can be viewed as the average of the indicator random variable [latex]X = \begin{cases} 1 & \text{with probability } p \\ 0 & \text{with probability } 1-p \end{cases}[/latex] with a mean [latex]\mu = p[/latex] and standard deviation [latex]\sigma = \sqrt{p(1-p)}[/latex]. Note that the standard deviation is a function in [latex]p[/latex]. For a two-tailed test, the null hypothesis is that two population proportions are equal, that is, [latex]H_0: p_1 = p_2[/latex]; consequently, if the null hypothesis is true, it follows that the populations have the same standard deviation. Therefore, similar to a pooled two-sample t-test, we can pool the two samples together to obtain a better estimate of the common standard deviation. If [latex]H_0: p_1 = p_2[/latex] is true, let [latex]p_1 = p_2 = p_p[/latex] , where [latex]p_p[/latex] is the common standard deviation. Then, the test statistic becomes

[latex]Z = \frac{(\hat{p}_1 - \hat{p}_2) - (p_1 - p_2)}{\sqrt{ \frac{p_1(1 - p_1)}{n_1} + \frac{p_2(1 -p_2)}{n_2}}} = \frac{(\hat{p}_1 - \hat{p}_2) - 0}{\sqrt{ \frac{p_{\scriptsize p}(1 - p_{\scriptsize p})}{n_1} + \frac{p_{\scriptsize p}(1 -p_{\scriptsize p})}{n_2}}} = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{p_{\scriptsize p} (1-p_{\scriptsize p})} \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}.[/latex]

The common proportion [latex]p_{\scriptsize p}[/latex] is estimated by

[latex]\hat{p}_{\scriptsize p} = \frac{x_1 + x_2}{n_1 + n_2}[/latex].

Assumptions:

Both samples are simple random samples from their respective populations.
The two samples are independent.
Large samples: all the number of successes and failures [latex]x_1, n_1 - x_1, x_2[/latex], and [latex]n_2 - x_2[/latex] and are at least 5.

Steps to perform a two-proportion z test:

Set up the hypotheses:

Two-tailed	Right-tailed	Left-tailed
[latex]H_0: p_1 = p_2[/latex]	[latex]H_0: p_1 \leq p_2[/latex]	[latex]H_0: p_1 \geq p_2[/latex]
[latex]H_a: p_1 \neq p_2[/latex]	[latex]H_a: p_1 \: \gt \: p_2[/latex]	[latex]H_a: p_1 < p_2[/latex]

State the significance level [latex]\alpha[/latex].
Compute the value of the test statistic:
[latex]z_o = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}_{\scriptsize p} (1 - \hat{p}_{\scriptsize p})} \sqrt{ \frac{1}{n_1} + \frac{1}{n_2}}}[/latex] with [latex]\hat{p}_{\scriptsize p} = \frac{x_1 + x_2}{n_1 + n_2} , \hat{p}_1 = \frac{x_1}{n_1}, \hat{p}_2 = \frac{x_2}{n_2}[/latex].

Find the P-value or rejection region.

	Two-tailed	Right-tailed	Left-tailed
Null	[latex]H_0: p_1 = p_2[/latex]	[latex]H_0: p_1 \leq p_2[/latex]	[latex]H_0: p_1 \geq p_2[/latex]
Alternative	[latex]H_a: p_1 \neq p_2[/latex]	[latex]H_a: p_1 \: \gt \: p_2[/latex]	[latex]H_a: p_1 < p_2[/latex]
P-value	[latex]2P(Z \geq \|z_o\|)[/latex]	[latex]P(Z \geq z_o)[/latex]	[latex]P(Z \leq z_o)[/latex]
Rejection region	[latex]Z \geq z_{\alpha / 2}[/latex] or [latex]Z \leq - z_{\alpha / 2}[/latex]	[latex]Z \geq z_{\alpha }[/latex]	[latex]Z \leq - z_{\alpha }[/latex]

Reject the null [latex]H_0[/latex] if the P-value [latex]\leq \alpha[/latex] or [latex]z_o[/latex] falls in the rejection region.
Conclusion.

Example: Two-Proportion z Test and z Interval

The Centers for Disease Control and Prevention reported a survey of randomly selected Americans aged 65 and older. They found 411 of 1,012 men and 535 of 1,062 women had arthritis.

Is there any evidence that women are more likely to suffer from arthritis than men? Test at the 1% significance level.
Let [latex]p_1[/latex] be the proportion of men who have arthritis and [latex]p_1[/latex] be the proportion of women who have arthritis.
Check the assumptions:
1. We have simple random samples.
2. The two samples are independent.
3. All the number of successes and failures [latex]x_1 = 411, n_1 - x_1 = 601, x_2= 535[/latex], and [latex]n_2 - x_2 = 572[/latex] are at least 5.
Steps:
1. Set up the hypotheses: [latex]H_0: p_1 \geq p_2[/latex]: versus [latex]H_a: p_1 < p_2[/latex].
2. State the significance level [latex]\alpha = 0.01[/latex].
3. The test statistic:
  [latex]z_o = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}_p (1 - \hat{p}_p)} \sqrt{ \frac{1}{n_1} + \frac{1}{n_2}}} = \frac{0.406 - 0.504}{\sqrt{0.456 (1 - 0.456)} \sqrt{ \frac{1}{1012} + \frac{1}{1062}}} = -4.479[/latex]
  
  where
  
  [latex]\hat{p}_p = \frac{x_1 + x_2}{n_1 + n_2} = \frac{411 + 535}{1012 + 1062} = 0.456 ,[/latex] [latex]\hat{p}_1 = \frac{x_1}{n_1} = \frac{411}{1012} = 0.406 , \hat{p}_2 = \frac{x_2}{n_2} = \frac{535}{1062} = 0.504.[/latex]
4. Find the P-value. For a left-tailed test, the P-value is the area to the left of the observed test statistic [latex]z_o[/latex]:
  P-value = [latex]P(Z \leq z_o) = P(Z \leq - 4.479) \approx 0[/latex].
5. Decision: Since the P-value [latex]\approx 0 < 0.01 (\alpha)[/latex], we should reject the null [latex]H_0[/latex].
6. Conclusion: At the 1% significance level, we have sufficient evidence that women are more likely to suffer from arthritis than men.
Obtain a confidence interval for [latex]p_1 - p_2[/latex], corresponding to the test in part a).
For a left-tailed test at the 1% significance level, we should obtain a 99% lower-tailed interval. [latex]1 - \alpha = 0.99 \Longrightarrow \alpha = 0.01 \Longrightarrow z_{\alpha } = z_{0.01} = 2.33[/latex].
A 99% lower-tail confidence interval for [latex]p_1 - p_2[/latex] is

[latex]\left( -1 , (\hat{p}_1 - \hat{p}_2) + z_{\alpha } \sqrt{\frac{\hat{p}_1(1 - \hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}} \right)[/latex]

[latex]= \left( - 1 , (0.406 - 0.504) + 2.33 \sqrt{\frac{0.406 (1 - 0.406)}{1012} + \frac{0.504(1 - 0.504)}{1062}} \right) = ( - 1 , - 0.047)[/latex].

Interpretation: We are 99% confident that [latex](p_1 - p_2)[/latex] is below -0.047. That is, we are 99% confident that the proportion of women who have arthritis is at least 0.047 higher than the proportion of men.
Does the interval in part b) support the conclusion of the test in part a)?
Yes. In part a), we reject [latex]H_0[/latex] and claim [latex]H_a: p_1 < p_2[/latex] (suggesting men have a smaller proportion than women). In part b), the entire interval is below 0, so we are 99% confident that [latex]p_1 - p_2 < 0[/latex].

Exercises: Inference on Proportions

It is believed that there is an association between breast cancer and smoking. The following table summarizes the results of an observational study of 200 females classified by their disease and smoking status.

	Smoker	Non-smoker	Total
Breast Cancer	10	30	40
Cancer Free	20	140	160
Total	30	170	200

Obtain a 99% confidence interval for the proportion of females with breast cancer.
Obtain the minimum sample size n needed so that we are 95% confident that the error is at most 0.02 when [latex]\hat{p}[/latex] is used to estimate p. Use the conservative estimate [latex]\hat{p} = 0.5[/latex].
Test at the 5% significance level whether the proportion of females with breast cancer is higher among smokers than non-smokers.
Obtain a confidence interval corresponding to the test in part c).

Show/Hide Answer

Obtain a 99% confidence interval for the proportion of females with breast cancer.
The point estimate for the proportion of females with breast cancer is [latex]\hat p = \frac{x}{n} = \frac{40}{200} = 0.2[/latex].

[latex]1 - \alpha = 0.99 \Longrightarrow \alpha = 0.01 \Longrightarrow z_{\alpha / 2} = z_{0.005} = 2.575[/latex].

The 99% confidence interval for the proportion of breast cancer is

[latex]\hat{p} \pm \sqrt{\frac{\hat{p} (1 - \hat{p})}{n}} = 0.2 \pm 2.575 \times \sqrt{\frac{0.2(1-0.2)}{200}} = (0.127, 0.273)[/latex].

Interpretation: We are 99% confident that the proportion of females with breast cancer is somewhere between 0.127 and 0.273.
Obtain the minimum sample size n needed so that we are 95% confident that the error is at most 0.02 when [latex]\hat{p}[/latex] is used to estimate p. Use the conservative estimate [latex]\hat{p} = 0.5[/latex].
[latex]n = 0.25 \left( \frac{z_{\alpha /2 }}{E} \right)^2 = 0.25 \left( \frac{2.575}{0.02} \right)^2 = 4144.14, \quad \text{rounded up to } n=4145[/latex].
Test at the 5% significance level whether the proportion of females with breast cancer is higher among smokers than non-smokers.
Let [latex]p_1[/latex] be the proportion of females with breast cancer among smokers and [latex]p_2[/latex] be the proportion of females with breast cancer among non-smokers.
Check the assumptions:
1. We have simple random samples.
2. The two samples are independent.
3. All the number of successes and failures [latex]x_1 = 10, n_1 - x_1 = 20, x_2 = 30[/latex] and [latex]n_2 - x_2 = 140[/latex] are at least 5.
Steps:
1. Set up the hypotheses: [latex]H_0: p_1 \leq p_2[/latex] versus [latex]H_a: p_1 \: \gt \: p_2[/latex].
2. The significance level [latex]\alpha = 0.05[/latex].
3. Compute the test statistic:
  [latex]z_o = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}_p (1 - \hat{p}_p)} \sqrt{\frac{1}{n_1} + \frac{1}{n_2} }} = \frac{0.333 - 0.176}{\sqrt{0.2 (1-0.2)} \sqrt{ \frac{1}{30} + \frac{1}{170}}} = 1.982[/latex],
  
  where
  
  [latex]\hat{p}_p = \frac{x_1 + x_2}{n_1 + n_2} = \frac{10 + 30}{30 + 170} = 0.2,[/latex] [latex]\hat{p}_1 = \frac{x_1}{n_1} = \frac{10}{30} = 0.333, \hat{p}_2 = \frac{x_2}{n_2} = \frac{30}{170} = 0.176[/latex].
4. Find the P-value. For a right-tailed test, the P-value is the area to the right of the observed test statistic [latex]z_o[/latex].
  P-value = [latex]P(Z \geq z_o) = P(Z \geq 1.982) = P( Z \leq -1.982) = 0.0239[/latex].
5. Decision: Since the P-value [latex]=0.0239 < 0.05(\alpha)[/latex], we should reject the null [latex]H_0[/latex].
6. Conclusion: At the 5% significance level, we have sufficient evidence that the proportion of females with breast cancer is higher among smokers than non-smokers.
Obtain a confidence interval corresponding to the test in part c).
For a right-tailed test at the 5% significance level, we should obtain a 95% upper-tailed confidence interval

.[latex]\left( (\hat{p}_1 - \hat{p}_2) - z_\alpha \sqrt{\frac{\hat{p}_1(1 - \hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}} , 1\right)[/latex]

[latex]= \left( (0.333 - 0.176) - 1.645 \sqrt{\frac{0.333 (1 - 0.333)}{30} + \frac{0.176(1 - 0.176)}{170}}, 1 \right)= ( 0.0075 , 1)[/latex].

Interpretation: We are 95% confident that the proportion of females with breast cancer is at least 0.0075 higher for smokers than non-smokers.

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License