"

10.2 Distribution of the Sample Proportion

Inferences about the population mean μ are based on the distribution of the sample mean X¯. Similarly, inferences about the population proportion p are based on the distribution of the sample proportion p^.

The population proportion is defined as

p=# of individuals having a certain attribute# of individuals in the population=# of successesN.

The population proportion can be regarded as a special type of population mean if we let the variable of interest be an indicator variable as follows:

xi={1if the ith individual has the attribute (a success),0if the ith individual does not have the attribute (a failure).

Then, the population proportion can be rewritten as

p=# of individuals having a certain attribute# of individuals in the population=# of successesN=xiN.

The variable of interest X has only two possible values: 1 if the individual has the attribute and 0 if not. Randomly select one individual and define p as the probability that this individual has the attribute. As a result, the probability distribution of X is

Table 10.1: Probability Distribution of an Indicator Variable

x
1
0
P(X=x)
p
1p

with a population mean and population standard deviation:

μ=xP(X=x)=1×p+0×(1p)=p,

σ=x2P(X=x)μ2=12×p+02×(1p)μ2=pp2=p(1p).

The sample proportion can be viewed as a special type of sample mean (in the same way that the population proportion can be viewed as a special type of population mean). That is, in a simple random sample of size n, the proportion of individuals with the specific attribute is the sample proportion:

p^=# of individuals having a certain attribute in the samplesample size=# of successes in the samplen=xin=x¯

with xi=1 if the individual has the attribute and xi=0 if not.

Recall from Chapter 6, the sampling distribution of the sample mean X¯:

  • Centre: the mean of the sample mean X¯ equals the population mean μ. That is,

    μX¯=μ.

  • Spread: the standard deviation of the sample mean equals the population standard deviation divided by the square root of the sample size. That is,

    σX¯=σn.

These two arguments are true for any population distribution and sample size n.

  • Shape:
    • When the population distribution is normal, X¯ is also normal regardless of n.
    • When the population distribution is non-normal but the sample size n is large, X¯ is approximately normally distributed. This is guaranteed by the central limit theorem (CLT).

The same conclusions can be applied to the sampling distribution of the sample proportion p^, where the variable of interest is

X={1with probability p0with probability 1p

with the population mean μ=p and standard deviation σ=p(1p). Therefore, the sampling distribution of the sample proportion p^ is summarized as follows.

Key Facts: Sampling Distribution of the Sample Proportion

  • Centre: the mean of the sample proportion p^ equals the population mean μ. That is,

    μp^=μ=p.

  • Spread: the standard deviation of the sample proportion p^ equals the population standard deviation σ divided by the square root of the sample size. That is,

    σp^=σn=p(1p)n=p(1p)n.

These two arguments are true for any population proportion p and any sample size n.

  • Shape: The population distribution is non-normal. By the central limit theorem (CLT), however, p^ is approximately normal if n is large enough. The rule of thumb is to guarantee both np5 and n(1p)5, i.e., nmax{5p,51p}. Some textbooks require both np10 and n(1p)10.

Central limit theorem for the sample proportion:

If the sample size n is large enough (np5 and n(1p)5), the sampling distribution of the sample proportion p^ is approximately normally distributed.

For example, suppose the population proportion is p=0.05. Then the sampling distribution of the sample proportion p^ is approximately normally distributed if the sample size is at least

n=max{5p,51p}=max{50.05,510.05}=max{100,5.26}=100.

The following figures show the sampling distribution of the sample proportion with p=0.05 and sample sizes n = 50, 100, 200, and 1000.

A histogram of sample proportion for sample size n = 50. Image description available. A histogram of sample proportion for sample size n = 100. Image description available. A histogram for sample proportion for sample size n = 200. Image description available. A histogram of sample proportion for sample size n = 1000. Image description available.
Figure 10.1: Histograms of Sample Proportions with Different Sample Size. [Image Description (See Appendix D Figure 10.1)] Click on the image to enlarge it.

There are several findings:

  • The sampling distribution of the sample proportion becomes increasingly normal as the sample size n increases. When n = 50, the sampling distribution of sample proportion is skewed. When n = 100, the distribution is still slightly right skewed. For n = 200 and n = 1000, the sampling distribution appears bell-shaped and symmetric (indicative of a normal distribution).
  • The mean of the sample proportion (blue dashed line) is always identical to the population proportion p = 0.05 (red solid line) regardless of the sample size n.
  • The standard deviation of the sample proportion decreases as n increases.

To summarize, for np5 and n(1p)5, p^N(p,p(1p)n). The standardized version of p^ is Z=p^pp(1p)nN(0,1). As a result, inferences about the population proportions are based on the standard normal distribution.