2.3 Spread (Variation) of a Distribution
Besides to the centre, we need another descriptive measure to describe how the data spread out. That is called the spread or variability of the distribution. Measures of variation covered are the range, interquartile range (IQR), and standard deviation.
2.3.1 Range and Interquartile Range (IQR)
One intuitive measure of the spread is the range of the data, which is defined as the difference between the largest and the smallest observations,
Similar to the mean, range is sensitive to outliers.
We can use the interquartile range
which is the difference between
2.3.2 Standard Deviation
Like the mean, the standard deviation takes into account all the observations and measures variation by indicating on average of how far the observations are away from the mean. For a data set with a large amount of variation, i.e., the observations are very different from one another, the standard deviation will be large. For a data set with a small amount of variation, on average, the observations are close to the mean, so the standard deviation will be small.
Steps to calculate the sample standard deviation are:
- Calculate the sample mean of the data set,
. - For each observation
, find its deviation from the mean , denoted as . The sum of the deviations always equals zero, i.e., . - In order to obtain quantities that do not sum to zero, take the square of the deviations. The sum of squared deviations,
gives a measure of total variation of all the observations. - Finally, the sample standard deviation, denoted as
, is calculated as
This is referred as the defining formula of the sample standard deviation.
The term
is defined as the sample variance of the data. Roughly speaking, it gives the average squared distance from each observation
It can be shown that
And the sample standard deviation becomes
This is referred as the computing formula of the sample standard deviation. The defining formula
The standard deviation is often paired with the mean to describe the spread and the centre of a distribution respectively.
Example: Measures of Spread (Variation)
Find the range, IQR, and sample standard deviation for 3, 5, 3, 7, 7.
- For range
- Sort into 3, 3, 5, 7, 7. The minimum (smallest observation) is 3, and the maximum (largest observation) is 7.
- For IQR
- Sort into 3, 3, 5, 7, 7.
is odd, median .- The first half is 3, 3, 5. The median of the first half is
. The second half is 5, 7, 7. The median of the second half is .
- For sample standard deviation, the following table shows the calculation of the sample standard deviation.
Table 2.1: Calculate the Sample Standard deviation Using the Computing Formula
3 | 32=9 |
5 | 52=25 |
3 | 32=9 |
7 | 72=49 |
7 | 72=49 |
Interpretation: Roughly speaking, the average distance between the observations and the sample mean is 2.
If you would like to use the defining formula, it is helpful to construct the following table:
Table 2.2: Calculate the Sample Standard deviation Using the Defining Formula
Deviation: |
||
The sample standard deviation calculated by the defining formula is
which is the same as the value obtained by the computing formula.
2.3.3 Summary: Choose Proper Measures
Here are some guidelines for choosing proper measures to describe the centre and spread (variation) of a distribution:
- Use the median and the IQR for the centre and spread respectively when the distribution is skewed or outliers exist.
- Use the mean and the standard deviation for the centre and spread respectively when the distribution is roughly symmetric and there are no outliers.
- Although the mode may also be used as a measure of centre for numerical data, it is usually not as informative as the median or the mean.
- For categorical data, the mode is the only descriptive measure we can use to describe the center of qualitative/categorical data. None of the measures of spread covered in this chapter (i.e., range, IQR, standard deviation) can be applied to qualitative/categorical data.