2.3 Spread (Variation) of a Distribution

Besides to the centre, we need another descriptive measure to describe how the data spread out. That is called the spread or variability of the distribution. Measures of variation covered are the range, interquartile range (IQR), and standard deviation.

2.3.1 Range and Interquartile Range (IQR)

One intuitive measure of the spread is the range of the data, which is defined as the difference between the largest and the smallest observations,

[latex]\text{range=maximum-minimum=largest-smallest}.[/latex]

Similar to the mean, range is sensitive to outliers.

We can use the interquartile range

[latex]IQR = Q_3 - Q_1,[/latex]

which is the difference between [latex]Q_3[/latex] and [latex]Q_1[/latex] to describe the spread if the distribution is extremely skewed or outliers exist. The IQR is often paired with the median to describe the spread and the centre of a distribution respectively.

2.3.2 Standard Deviation

Like the mean, the standard deviation takes into account all the observations and measures variation by indicating on average of how far the observations are away from the mean. For a data set with a large amount of variation, i.e., the observations are very different from one another, the standard deviation will be large. For a data set with a small amount of variation, on average, the observations are close to the mean, so the standard deviation will be small.

Steps to calculate the sample standard deviation are:

  1. Calculate the sample mean of the data set, [latex]\bar{x}[/latex].
  2. For each observation [latex]x_i[/latex], find its deviation from the mean [latex]\bar{x}[/latex], denoted as [latex](x_i - \bar{x})[/latex]. The sum of the deviations always equals zero, i.e.,  [latex]\sum _{i=1} ^n (x_i - \bar{x}) = 0[/latex].
  3. In order to obtain quantities that do not sum to zero, take the square of the deviations. The sum of squared deviations, [latex]\sum _{i=1} ^n (x_i - \bar{x})^2[/latex] gives a measure of total variation of all the observations.
  4. Finally, the sample standard deviation, denoted as [latex]s[/latex], is calculated as[latex]s = \sqrt{\frac{\sum _{i=1} ^n (x_i - \bar{x})^2}{n-1}}.[/latex]

This is referred as the defining formula of the sample standard deviation.

The term

[latex]\begin{equation} s^2 = \frac{\sum _{i=1} ^n (x_i - \bar{x})^2}{n-1} \end{equation}[/latex]

is defined as the sample variance of the data. Roughly speaking, it gives the average squared distance from each observation [latex]x_i[/latex] to the sample mean [latex]\bar{x}[/latex]. The square root of the sample variance [latex]s^2[/latex] gives the sample standard deviation [latex]s[/latex]. Roughly speaking, the sample standard deviation [latex]s[/latex] can be interpreted as the average distance from each observation [latex]x_i[/latex] to the sample mean [latex]\bar{x}[/latex]. Just as the sample mean [latex]\bar{x}[/latex] is used to estimate the population mean [latex]\mu[/latex], the sample variance [latex]s^2[/latex] can be used to estimate the population variance [latex]\sigma^2[/latex], and the sample standard deviation [latex]s[/latex] can be used to estimate the population standard deviation [latex]\sigma[/latex].

It can be shown that

[latex]\sum _{i=1} ^n (x_i - \bar{x})^2 = \sum_{i=1}^n x_i^2 - \frac{(\sum_{i=1}^n x_i)^2}{n}.[/latex]

And the sample standard deviation becomes

[latex]s= \sqrt{\frac{(\sum x_i^2) - \frac{(\sum x_i)^2}{n}}{n-1}}.[/latex]

This is referred as the computing formula of the sample standard deviation. The defining formula [latex]s= \sqrt{\frac{\sum (x_i - \bar{x})^2}{n-1}}[/latex] is  helpful in understanding the meaning of the sample standard deviation; while the computing formula [latex]s= \sqrt{\frac{(\sum x_i^2) - \frac{(\sum x_i)^2}{n}}{n-1}}[/latex] is useful in calculating the sample standard deviation by hand since it involves much less calculations.

The standard deviation is often paired with the mean to describe the spread and the centre of a distribution respectively.

Example: Measures of Spread (Variation)

Find the range, IQR, and sample standard deviation for 3, 5, 3, 7, 7.

  1. For range
    • Sort into 3, 3, 5, 7, 7. The minimum (smallest observation) is 3, and the maximum (largest observation) is 7.
    • [latex]\text{range=maximum-minimum}=7-3=4[/latex]
  2. For IQR
    • Sort into 3, 3, 5, 7, 7.
    • [latex]n=5[/latex] is odd, median [latex]Q_2=5[/latex].
    • The first half is 3, 3, 5. The median of the first half is [latex]Q_1=3[/latex]. The second half is 5, 7, 7. The median of the second half is [latex]Q_3=7[/latex].
    • [latex]IQR=Q_3-Q_1=7-3=4[/latex]
  1. For sample standard deviation, the following table shows the calculation of the sample standard deviation.

Table 2.1: Calculate the Sample Standard deviation Using the Computing Formula

[latex]\sum x_i = 25[/latex] [latex]\sum x_i^2 = 141[/latex]
[latex]x_i[/latex] [latex]x_i^2[/latex]
3 32=9
5 52=25
3 32=9
7 72=49
7 72=49

 

[latex]\begin{align*} s & = \sqrt{\frac{(\sum x_i^2) - \frac{(\sum x_i)^2}{n}}{n-1}} \\ & = \sqrt{\frac{141 - \frac{25^2}{5}}{5-1}} = \sqrt{\frac{141-125}{4}} = \sqrt{\frac{16}{4}} = \sqrt{4} = 2. \end{align*}[/latex]

Interpretation: Roughly speaking, the average distance between the observations and the sample mean is 2.

If you would like to use the defining formula, it is helpful to construct the following table:

Table 2.2: Calculate the Sample Standard deviation Using the Defining Formula

[latex]\sum x =25[/latex] [latex]\sum(x_i -\bar{x})=0[/latex] [latex]\sum(x_i - \bar{x})^2 =16[/latex]
[latex]x_i[/latex] Deviation: [latex](x_i - \bar{x})[/latex] [latex](x_i -\bar{x})^2[/latex]
[latex]3[/latex] [latex]3-5=-2[/latex] [latex](-2)^2=4[/latex]
[latex]5[/latex] [latex]5-5=0[/latex] [latex]0^2=0[/latex]
[latex]3[/latex] [latex]3-5=-2[/latex] [latex](-2)^2=4[/latex]
[latex]7[/latex] [latex]7-5=2[/latex] [latex]2^2=4[/latex]
[latex]7[/latex] [latex]7-5=2[/latex] [latex]2^2=4[/latex]

 

The sample standard deviation calculated by the defining formula is

[latex]s = \sqrt{\frac{\sum _{i=1} ^n (x_i - \bar{x})^2}{n-1}} = \sqrt{\frac{16}{5-1}} = 2[/latex]

which is the same as the value obtained by the computing formula.

2.3.3 Summary: Choose Proper Measures

Here are some guidelines for choosing proper measures to describe the centre and spread (variation) of a distribution:

  • Use the median and the IQR for the centre and spread respectively when the distribution is skewed or outliers exist.
  • Use the mean and the standard deviation for the centre and spread respectively when the distribution is roughly symmetric and there are no outliers.
  • Although the mode may also be used as a measure of centre for numerical data, it is usually not as informative as the median or the mean.
  • For categorical data, the mode is the only descriptive measure we can use to describe the center of qualitative/categorical data. None of the measures of spread covered in this chapter (i.e., range, IQR, standard deviation) can be applied to qualitative/categorical data.

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Introduction to Applied Statistics Copyright © 2024 by Wanhua Su is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.