2.4 Five-Number Summary and Boxplot

The five-number summary of a data set consists of the minimum (the smallest observation), [latex]Q_1, Q_2,Q_3[/latex] and the maximum (the largest observation).

These five numbers together give us a brief idea about the distribution of the data: [latex]Q_2[/latex] (the median) is the centre of the distribution, the range (the difference between the maximum and the minimum) and the IQR (the difference between [latex]Q_3[/latex] and [latex]Q_1[/latex]) tell us the spread (variation) of the data. The difference between [latex]Q_1[/latex] and the minimum, between [latex]Q_2[/latex] and [latex]Q_1[/latex], between [latex]Q_3[/latex] and [latex]Q_2[/latex], and between the maximum and [latex]Q_3[/latex] give the range of the first, second, third and fourth 25% of the data respectively. Moreover, the five-number summary helps us identify outliers, those observations that are far away from the bulk of the data.

2.4.1 Identify Outliers

Outliers are observations far away from the majority of the data. Quantitatively, any observation that falls outside the interval of (lower limit, upper limit) is considered as an outlier. The upper and lower limits are defined as:

[latex]\text{lower limit} = Q_1 - 1.5 \times IQR; \quad \text{upper limit} = Q_3 + 1.5 \times IQR.[/latex]

Example: Identify Outliers

Identify the outliers for the data 3, 1, 9, 7, 5, 11, 21 if any.

Steps:

  1. Find the quartiles. Refer to Example 4, part (a), [latex]Q_1 = 4, Q_2=7, Q_3=10[/latex].
  2. [latex]IQR = Q_3 - Q_1 = 10-4=6[/latex]
  3. [latex]\text{lower limit}=Q_1 -1.5 \times IQR=4-1.5 \times 6=-5[/latex]
  4. [latex]\text{upper limit}=Q_3+1.5 \times IQR=10+1.5 \times 6=19[/latex]

Since 21 > 19, it is outside the interval (-5, 19), 21 is an outlier.

Exercise: Choose Proper Measures

Based on the histogram and five-number summary of the data, answer the following questions.

Table 2.3: Five-Number Summary of the Data

Summary
Min
Q1
Median
Q3
Max
 
0.1
2
3.5
5
32
Histogram of the data the same as the one in the review question 1.7. The y-axis is the frequency and the x-axis is survival time in years. Image description available.
Figure 2.2: Histogram of the Data [Image Description (See Appendix D Figure 2.2)]
  1. Comment on the distribution (shape, centre, spread).
  2. Are there any outliers in the data?
  3. Provide proper measures of the centre and spread of the data. Explain why.
Show/Hide Answer
  1. Comment on the distribution (shape, centre, spread).
    The distribution is unimodal, skewed to the right with a median 3.5 and [latex]IQR = 5-2=3[/latex].
  1. Are there any outliers in the data?
    Yes. [latex]\text{Upper limit} = Q_3 + 1.5 \times IQR = 5 + 1.5 \times 3 = 9.5[/latex].
    Any observation greater than 9.5 is an outlier.
  1. Provide proper measures of the centre and spread of the data. Explain why.
    Use median for the centre and IQR for the spread due to outliers and strong skewness.

2.4.2 Boxplot

A boxplot, also called a box-and-whisker plot, is a useful tool to display the centre and spread of a data set by providing a graphical representation of the five-number summary as well as potential outliers. Steps to draw a boxplot:

  1. Calculate the five-number summary: minimum, [latex]Q_1, Q_2, Q_3[/latex], and maximum.
  2. Calculate the lower and upper limits: [latex]\text{lower limit}=Q_1 -1.5 \times IQR[/latex], and [latex]\text{upper limit} = Q_3 + 1.5 \times IQR.[/latex]
  3. Find the adjacent values, the largest and smallest observations within the lower and upper limits. Identify the potential outliers (observations beyond the upper and lower limits), if any exist.
  4. Draw short horizontal lines at [latex]Q_1, Q_2, Q_3[/latex] , and connect them with vertical lines to form a box.
  5. Draw very short horizontal lines at the adjacent values and then draw the whiskers by connecting the adjacent values and the box with vertical lines.
  6. Plot each potential outlier with an asterisk.
  7. Put labels and the title.

  • A boxplot can be drawn vertically or horizontally.
  • Symbols such as circles or asterisks are often used to plot potential outliers.

Example: Draw a Boxplot

Construct a boxplot for the data 3, 1, 9, 7, 5, 11, 21.

Steps:

  1. Calculate the five-number summary:
    sort: 1, 3, 5, 7, 9, 11, 21
    [latex]min = 1, Q_1=4, Q_2=7, Q_3=10, max = 21[/latex]
  2. Calculate the lower and upper limits
    [latex]IQR = Q_3 - Q_1 = 10 - 4 =6[/latex]
    [latex]\text{lower  limit} = Q_1 -1.5 \times IQR = 4 - 1.5 \times 6 = -5[/latex]
    [latex]\text{upper limit} = Q_3 -1.5 \times IQR = 10 + 1.5 \times 6 = 19.[/latex]
  3. Adjacent values are 1 and 11, so the max 21 is an outlier.
  4. Form a box based on [latex]Q_1 = 4, Q_2 = 7, Q_3 = 10.[/latex]
  5. Mark the adjacent values 1 and 11, “grow the whiskers,” the dashed lines connecting the box and the adjacent values.
  6. Plot the potential outlier with 21.
  7. Title and label the boxplot.

Example Boxplot

A boxplot with an upper error bar at 11, third quantile at 10, second quantile at 7, first quantile at 2, and a lower error bar at 1. Image description available.
Figure 2.3: Resulting Boxplot of the Example [Image Description (See Appendix D Figure 2.3)]

We can describe the distribution of the data in the following aspects based on a boxplot:

  • The centre: the median [latex]Q_2[/latex].
  • The spread (variation): the range and IQR. Note that, however, the range is sensitive to outliers.
  • The shape of the distribution:
    • Left skewed if the distance between the lower adjacent value and [latex]Q_1[/latex] is larger than the distance between the upper adjacent value and [latex]Q_3[/latex], and the distance between [latex]Q_1[/latex] and the median is larger than the distance between [latex]Q_3[/latex] and the median.
    • Right skewed if the distance between the lower adjacent value and [latex]Q_1[/latex] is smaller than the distance between the upper adjacent value and [latex]Q_3[/latex], and the distance between [latex]Q_1[/latex] and the median is smaller than the distance between [latex]Q_3[/latex] and the median.
    • Symmetry if the distance between the lower adjacent value and [latex]Q_1[/latex] is approximately equal to the distance between the upper adjacent value and [latex]Q_3[/latex], and the distance between [latex]Q_1[/latex] and the median is approximately equal to the distance between [latex]Q_3[/latex] and the median.
    • Note that it is sometimes the case that the whiskers show skewness in one direction while the box shows skewness in the opposite direction. In such cases, it is not always possible to clearly determine skewness or symmetry.
  • Identify outliers.

The following are three boxplots that show right skewed, symmetric, and left skewed distributions respectively.

Three boxplots representing three distributions. The first is lower than the second and the second is lower than the third. Image description available.

Figure 2.4: Boxplots of Skewed and Symmetric Distributions. [Image Description (See Appendix D Figure 2.4)]

Similar to side-by-side histograms, we can use side-by-side boxplots to compare different groups.

Example: Side-by-Side Boxplots

I want to compare grades of students who attend lectures with those who do not. Both the table and the side-by-side boxplots tell us that:

  • Attendees have a larger median score.
  • Non-attendees have a slightly larger variation. Both the IQR (height of the box) and standard deviation of non-attendees are larger than that of attendees.
  • Grades of both groups are slightly left skewed with a longer tail on the lower end.

Table 2.4: Numerical Summaries of Grades of Non-Attendees and Attendees

Summary Min Q1 Median Q3 Max Mean SD
Non-attendees 35.62 52.70 64.76 77.78 87.30 63.23 15.48
Attendees 47.77 69.80 77.83 85.15 96.51 76.92 11.83

 

A pair of boxplots comparing the final grades of non-attendees to attendees. The plot of attendees is overall higher. Image description available.

Figure 2.5: Side-by-Side Boxplots of Non-Attendees and Attendees. [Image Description (See Appendix D Figure 2.5)]

 

Exercise: Draw a Boxplot

Draw a boxplot for the data: -5, 0.05, 0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85, 0.95.

Show/Hide Answer

Boxplot for Sample Data

A boxplot of the data above. Lower adjacent value = 0.05, Q1 = 0.2, Q2 = 0.5, Q3 = 0.7, and upper adjacent value = 0.95. There is a potential outlier at -5. Image description available.
Figure 2.6: Boxplot for the Sampled Data. [Image Description (See Appendix D Figure 2.6)]

Steps:

  1. Calculate five-number summary:
    sort: -5, 0.05, 0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85, 0.95
    [latex]min=-5, Q_1=0.2, Q_2=0.45, Q_3=0.7, max=0.95[/latex]
  2. Calculate the lower and upper limits
    [latex]IQR=Q_3-Q_1=0.7-0.2=0.5[/latex]
    [latex]\text{lower limit} =Q_1-1.5 \times IQR=0.2-1.5 \times 0.5=-0.55[/latex]
    [latex]\text{upper limit}=Q_3+1.5 \times IQR=0.7+1.5 \times 0.5=1.45[/latex]
  3. Adjacent values are 0.05 and 0.95, the min -5 is an outlier.
  4. Form a box based on [latex]Q_1=0.2,Q_2=0.45,Q_3=0.7.[/latex]
  5. Mark the adjacent values 0.05 and 0.95, and then draw the whiskers, the dashed lines connecting the box and the two adjacent values.
  6. Plot the outlier -5.
  7. Title and label boxplot.

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Introduction to Applied Statistics Copyright © 2024 by Wanhua Su is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.