"

2.3 Spread (Variation) of a Distribution

Besides to the centre, we need another descriptive measure to describe how the data spread out. That is called the spread or variability of the distribution. Measures of variation covered are the range, interquartile range (IQR), and standard deviation.

2.3.1 Range and Interquartile Range (IQR)

One intuitive measure of the spread is the range of the data, which is defined as the difference between the largest and the smallest observations,

range=maximum-minimum=largest-smallest.

Similar to the mean, range is sensitive to outliers.

We can use the interquartile range

IQR=Q3Q1,

which is the difference between Q3 and Q1 to describe the spread if the distribution is extremely skewed or outliers exist. The IQR is often paired with the median to describe the spread and the centre of a distribution respectively.

2.3.2 Standard Deviation

Like the mean, the standard deviation takes into account all the observations and measures variation by indicating on average of how far the observations are away from the mean. For a data set with a large amount of variation, i.e., the observations are very different from one another, the standard deviation will be large. For a data set with a small amount of variation, on average, the observations are close to the mean, so the standard deviation will be small.

Steps to calculate the sample standard deviation are:

  1. Calculate the sample mean of the data set, x¯.
  2. For each observation xi, find its deviation from the mean x¯, denoted as (xix¯). The sum of the deviations always equals zero, i.e.,  i=1n(xix¯)=0.
  3. In order to obtain quantities that do not sum to zero, take the square of the deviations. The sum of squared deviations, i=1n(xix¯)2 gives a measure of total variation of all the observations.
  4. Finally, the sample standard deviation, denoted as s, is calculated ass=i=1n(xix¯)2n1.

This is referred as the defining formula of the sample standard deviation.

The term

(1)s2=i=1n(xix¯)2n1

is defined as the sample variance of the data. Roughly speaking, it gives the average squared distance from each observation xi to the sample mean x¯. The square root of the sample variance s2 gives the sample standard deviation s. Roughly speaking, the sample standard deviation s can be interpreted as the average distance from each observation xi to the sample mean x¯. Just as the sample mean x¯ is used to estimate the population mean μ, the sample variance s2 can be used to estimate the population variance σ2, and the sample standard deviation s can be used to estimate the population standard deviation σ.

It can be shown that

i=1n(xix¯)2=i=1nxi2(i=1nxi)2n.

And the sample standard deviation becomes

s=(xi2)(xi)2nn1.

This is referred as the computing formula of the sample standard deviation. The defining formula s=(xix¯)2n1 is  helpful in understanding the meaning of the sample standard deviation; while the computing formula s=(xi2)(xi)2nn1 is useful in calculating the sample standard deviation by hand since it involves much less calculations.

The standard deviation is often paired with the mean to describe the spread and the centre of a distribution respectively.

Example: Measures of Spread (Variation)

Find the range, IQR, and sample standard deviation for 3, 5, 3, 7, 7.

  1. For range
    • Sort into 3, 3, 5, 7, 7. The minimum (smallest observation) is 3, and the maximum (largest observation) is 7.
    • range=maximum-minimum=73=4
  2. For IQR
    • Sort into 3, 3, 5, 7, 7.
    • n=5 is odd, median Q2=5.
    • The first half is 3, 3, 5. The median of the first half is Q1=3. The second half is 5, 7, 7. The median of the second half is Q3=7.
    • IQR=Q3Q1=73=4
  1. For sample standard deviation, the following table shows the calculation of the sample standard deviation.

Table 2.1: Calculate the Sample Standard deviation Using the Computing Formula

xi=25 xi2=141
xi xi2
3 32=9
5 52=25
3 32=9
7 72=49
7 72=49

 

s=(xi2)(xi)2nn1=141252551=1411254=164=4=2.

Interpretation: Roughly speaking, the average distance between the observations and the sample mean is 2.

If you would like to use the defining formula, it is helpful to construct the following table:

Table 2.2: Calculate the Sample Standard deviation Using the Defining Formula

x=25 (xix¯)=0 (xix¯)2=16
xi Deviation: (xix¯) (xix¯)2
3 35=2 (2)2=4
5 55=0 02=0
3 35=2 (2)2=4
7 75=2 22=4
7 75=2 22=4

 

The sample standard deviation calculated by the defining formula is

s=i=1n(xix¯)2n1=1651=2

which is the same as the value obtained by the computing formula.

2.3.3 Summary: Choose Proper Measures

Here are some guidelines for choosing proper measures to describe the centre and spread (variation) of a distribution:

  • Use the median and the IQR for the centre and spread respectively when the distribution is skewed or outliers exist.
  • Use the mean and the standard deviation for the centre and spread respectively when the distribution is roughly symmetric and there are no outliers.
  • Although the mode may also be used as a measure of centre for numerical data, it is usually not as informative as the median or the mean.
  • For categorical data, the mode is the only descriptive measure we can use to describe the center of qualitative/categorical data. None of the measures of spread covered in this chapter (i.e., range, IQR, standard deviation) can be applied to qualitative/categorical data.