10.2 Distribution of the Sample Proportion

Inferences about the population mean [latex]\mu[/latex] are based on the distribution of the sample mean [latex]\bar{X}[/latex]. Similarly, inferences about the population proportion [latex]p[/latex] are based on the distribution of the sample proportion [latex]\hat{p}[/latex].

The population proportion is defined as

[latex]p = \frac{\text{# of individuals having a certain attribute}}{\text{# of individuals in the population}} = \frac{\text{# of successes}}{N}.[/latex]

The population proportion can be regarded as a special type of population mean if we let the variable of interest be an indicator variable as follows:

[latex]x_i = \begin{cases} 1 & \text{if the ith individual has the attribute (a success)}, \\ 0 & \text{if the ith individual does not have the attribute (a failure)}. \end{cases}[/latex]

Then, the population proportion can be rewritten as

[latex]p=\frac{\text{# of individuals having a certain attribute}}{\text{# of individuals in the population}} = \frac{\text{# of successes}}{N} = \frac{\sum x_i}{N}.[/latex]

The variable of interest [latex]X[/latex] has only two possible values: 1 if the individual has the attribute and 0 if not. Randomly select one individual and define [latex]p[/latex] as the probability that this individual has the attribute. As a result, the probability distribution of [latex]X[/latex] is

Table 10.1: Probability Distribution of an Indicator Variable

[latex]x[/latex]
1
0
[latex]P(X=x)[/latex]
[latex]p[/latex]
[latex]1 – p[/latex]

with a population mean and population standard deviation:

[latex]\mu = \sum x P(X=x) = 1 \times p + 0 \times (1-p) = p,[/latex]

[latex]\sigma = \sqrt{\sum x^2 P(X=x) - \mu^2} = \sqrt{1^2 \times p + 0^2 \times (1-p) - \mu^2} = \sqrt{p - p^2} = \sqrt{p(1-p)}.[/latex]

The sample proportion can be viewed as a special type of sample mean (in the same way that the population proportion can be viewed as a special type of population mean). That is, in a simple random sample of size n, the proportion of individuals with the specific attribute is the sample proportion:

[latex]\begin{align*} \hat{p} &= \frac{\text{# of individuals having a certain attribute in the sample}}{\text{sample size}}\\& = \frac{\text{# of successes in the sample}}{n} = \frac{\sum x_i}{n} = \bar{x} \end{align*}[/latex]

with [latex]x_i = 1[/latex] if the individual has the attribute and [latex]x_i=0[/latex] if not.

Recall from Chapter 6, the sampling distribution of the sample mean [latex]\bar{X}[/latex]:

  • Centre: the mean of the sample mean [latex]\bar{X}[/latex] equals the population mean [latex]\mu[/latex]. That is,

    [latex]\mu_{\scriptsize \bar{X}} = \mu.[/latex]

  • Spread: the standard deviation of the sample mean equals the population standard deviation divided by the square root of the sample size. That is,

    [latex]\sigma_{\scriptsize \bar{X}} = \frac{\sigma}{\sqrt{n}}.[/latex]

These two arguments are true for any population distribution and sample size n.

  • Shape:
    • When the population distribution is normal, [latex]\bar{X}[/latex] is also normal regardless of n.
    • When the population distribution is non-normal but the sample size n is large, [latex]\bar{X}[/latex] is approximately normally distributed. This is guaranteed by the central limit theorem (CLT).

The same conclusions can be applied to the sampling distribution of the sample proportion [latex]\hat{p}[/latex], where the variable of interest is

[latex]X = \begin{cases} 1 & \text{with probability } p \\ 0 & \text{with probability } 1-p \end{cases}[/latex]

with the population mean [latex]\mu = p[/latex] and standard deviation [latex]\sigma = \sqrt{p(1-p)}[/latex]. Therefore, the sampling distribution of the sample proportion [latex]\hat{p}[/latex] is summarized as follows.

Key Facts: Sampling Distribution of the Sample Proportion

  • Centre: the mean of the sample proportion [latex]\hat{p}[/latex] equals the population mean [latex]\mu[/latex]. That is,

    [latex]\mu_{\scriptsize \hat{p}} = \mu = p[/latex].

  • Spread: the standard deviation of the sample proportion [latex]\hat{p}[/latex] equals the population standard deviation [latex]\sigma[/latex] divided by the square root of the sample size. That is,

    [latex]\sigma_{\scriptsize \hat{p}} = \frac{\sigma}{\sqrt{n}} = \frac{\sqrt{p(1-p)}}{\sqrt{n}} = \sqrt{ \frac{p(1-p)}{n}}[/latex].

These two arguments are true for any population proportion [latex]p[/latex] and any sample size n.

  • Shape: The population distribution is non-normal. By the central limit theorem (CLT), however, [latex]\hat{p}[/latex] is approximately normal if n is large enough. The rule of thumb is to guarantee both [latex]np \geq 5[/latex] and [latex]n(1-p) \geq 5[/latex], i.e., [latex]n \geq \max \left\{ \frac{5}{p}, \frac{5}{1-p} \right\}[/latex]. Some textbooks require both [latex]np \geq 10[/latex] and [latex]n(1-p) \geq 10[/latex].

Central limit theorem for the sample proportion:

If the sample size n is large enough ([latex]np \geq 5[/latex] and [latex]n(1-p) \geq 5[/latex]), the sampling distribution of the sample proportion [latex]\hat{p}[/latex] is approximately normally distributed.

For example, suppose the population proportion is [latex]p=0.05[/latex]. Then the sampling distribution of the sample proportion [latex]\hat{p}[/latex] is approximately normally distributed if the sample size is at least

[latex]n = \max \left\{ \frac{5}{p}, \frac{5}{1-p} \right\} = \max \left\{ \frac{5}{0.05}, \frac{5}{1-0.05} \right\} = \max \{ 100, 5.26 \} = 100.[/latex]

The following figures show the sampling distribution of the sample proportion with [latex]p=0.05[/latex] and sample sizes n = 50, 100, 200, and 1000.

A histogram of sample proportion for sample size n = 50. Image description available. A histogram of sample proportion for sample size n = 100. Image description available. A histogram for sample proportion for sample size n = 200. Image description available. A histogram of sample proportion for sample size n = 1000. Image description available.
Figure 10.1: Histograms of Sample Proportions with Different Sample Size. [Image Description (See Appendix D Figure 10.1)] Click on the image to enlarge it.

There are several findings:

  • The sampling distribution of the sample proportion becomes increasingly normal as the sample size n increases. When n = 50, the sampling distribution of sample proportion is skewed. When n = 100, the distribution is still slightly right skewed. For n = 200 and n = 1000, the sampling distribution appears bell-shaped and symmetric (indicative of a normal distribution).
  • The mean of the sample proportion (blue dashed line) is always identical to the population proportion p = 0.05 (red solid line) regardless of the sample size n.
  • The standard deviation of the sample proportion decreases as n increases.

To summarize, for [latex]np \geq 5[/latex] and [latex]n(1-p) \geq 5[/latex], [latex]\hat{p} \sim N \left( p , \sqrt{\frac{p(1-p)}{n}} \right)[/latex]. The standardized version of [latex]\hat{p}[/latex] is [latex]Z = \frac{\hat{p} - p}{\sqrt{\frac{p(1-p)}{n}}} \sim N(0,1)[/latex]. As a result, inferences about the population proportions are based on the standard normal distribution.

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Introduction to Applied Statistics Copyright © 2024 by Wanhua Su is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.