1.4 Organizing Data

Wanhua Su

1.4 Organizing Data

Next, we focus on presenting and summarizing data using different tables and figures.

Given a set of data, how can you present the data? It is essential to plot the data before conducting any data analysis. The definition of descriptive statistics tells us we can use tables and graphs to present the data. Different tables and graphs are used to describe the two different types of data—qualitative and quantitative. Let us start with qualitative variables, then continuous and then discrete variables.

1.4.1 Organizing Qualitative Data

Numerically, we can use frequency or relative frequency tables to summarize qualitative/categorical data. Graphically, we can use pie charts or bar charts.

The distribution of a qualitative variable is given in a frequency (relative frequency) table. For example, the students were asked, “How you came to school today?” Fifty-three students answered by car, 136 by public transportation, nine by bicycle, 74 by walking, and one by other means. The results are summarized in the following table.

Table 1.3: Frequency and Relative Frequency Table of “Transportation”

Transportation	Frequency	Relative Frequency	Percentage
Car	53	53/273 = 0.1941	19.41%
Public	136	136/273 = 0.4982	49.82%
Bicycle	9	9/273 = 0.0330	3.30%
Walking	74	74/273 = 0.2710	27.10%
Other	1	1/273 = 0.0037	0.36%
Total	273	1.000	100%

The first column gives all possible outcomes of the variable, which are called categories.
The second column gives the number of observations falling into each category; we call this number the frequency of that category. For example, the frequency of taking public transit is 136.
The third column gives the relative frequency, which is calculated as:
[latex]\text{relative frequency} = \frac{\text{frequency}}{\text{total}}.[/latex]

For example, the relative frequency of taking public transit is 0.4982, which means 49.82% of the students came to school by public transit. Note that the relative frequencies always add up to 1 across all the categories.
The fourth column gives the percentage, which is calculated as [latex]\text{percentage=relative frequency}\times 100[/latex].

Based on the (relative) frequency table, we can draw a bar chart or a pie chart to summarize the data.

A bar chart represents each category with a bar whose height equals each category’s relative frequency or frequency. The bars are plotted next to each other without touching each other. One bar for one category.
A pie chart is a disc divided into wedge-shaped pieces whose areas are proportional to the relative frequencies. One slice for one category, the angle of each slice = relative frequency x 360°.

A bar graph on the left panel showing relative frequency of how students came to school. A pie chart on the right panel showing percentages of how students came to school. Image description available. — **Figure 1.2**: Bar Chart (left panel) and Pie Chart (right panel) of “Transportation” [Image Description (See Appendix D Figure 1.2)]

If the bar chart and pie chart are generated based on counts, the charts won’t change except for the scale—the relative frequency in the bar chart and the percentage in the pie chart will be replaced by counts or frequency.

Suppose that another qualitative variable recorded in the study was “gender.”, we can also present the data characterized by two qualitative variables in what is referred to as a contingency table. Below is the contingency table with the two qualitative variables, gender and transport:

Table 1.4: Contingency Table of “Gender” and “Transportation”

	Car	Public	Bicycle	Walking	Others	Total
Female	25	80	5	38	0	148
Male	28	56	4	36	1	125
Total	53	136	9	74	1	273

The variable “Gender” is called the row variable (shown in red font in the table) and “Transportation“ is the column variable (shown in blue font in the table). The totals “148” and “125” (in red) are called row totals (sum across transportation for each category of “gender”). The totals “53,” “136,” “9,” “74,” “1” are called column totals (sum across gender for each category of “Transportation”), and “273” is the grand total. Those 10 numbers in bold are called cells.

One interesting question is whether the pattern of transportation among females is the same as that among males. We can compare the relative frequencies of all the categories for females with their counterparts among males. There are 148 female students and 125 male students in total; therefore, the relative frequencies of the five categories for females are 25/148, 80/148, 5/148, 38/148, 0/148 as compared to 28/125, 56/125, 4/125, 36/125, 1/125 for males.

The distributions of “transportation” for females and males can be compared graphically using a side-by-side pie chart and a side-by-side bar chart.

A pie chart for female on the left panel showing percentages of how female students came to school. Image description available. — **Figure 1.3**: Side-by-side pie Chart of “Transportation” for Females and Males. [Image Description (See Appendix D Figure 1.3)]

The side-by-side pie chart shows that the patterns in transportation among females and male are very similar, since the two pie charts are almost identical. The side-by-side bar chart based on the relative frequency gives the same conclusion: the distributions of “transportation” for female and male are very similar, which implies there is no significant difference between female and male in the way they come to school.

A side-by-side bar graph comparing relative frequency of how female and male students came to school. Image description available. — **Figure 1.4**: Side-by-Side Bar Chart of “Transportation” for Female and Male. [Image Description (See Appendix D Figure 1.4)]

When we compare the distributions of two different groups using a side-by-side bar chart, we should use the relative frequency as the y-axis. Using frequency as the y-axis and comparing the frequencies alone, without taking into account the total of each group, can be misleading.

1.4.2 Organizing Quantitative Discrete Data

We are able to list all possible values for a quantitative discrete variable; therefore, for a quantitative discrete variable with only a few different values, we can describe it using tools similar to those for qualitative variables, i.e., a (relative) frequency table and histogram.

A histogram is somewhat similar to a bar chart. The x-axis shows the value of the variable of interest and the y-axis displays either frequencies or relative frequencies. Histograms can be used to describe both quantitative discrete and quantitative continuous variables. For a continuous variable, we cut the range of the variable into subintervals of equal width and draw one rectangle for each subinterval; the height of the rectangle is the number of observations falling into the corresponding subinterval. For a discrete variable with a small number of possible values, we can draw a rectangle with equal width for each value, the height of each rectangle is either the frequency or relative frequency.

Example: Organizing Quantitative Discrete Variables

There are 100 students in a class; ten have no siblings, thirty have one sibling, thirty-five have two siblings, fifteen have three siblings, and ten have more than three siblings.

We can use a (relative) frequency table and a histogram to summarize the data.

Table 1.5: Frequency and Relative Frequency Table of “# of Siblings”

Total	100	1.00
# of Siblings	Frequency	Relative Frequency
0	10	0.10
1	30	0.30
2	35	0.35
3	15	0.15
>3	10	0.10

Histogram of number of siblings, the y-axis is frequency and x-axis is the number of siblings. Image description available — **Figure 1.5**: Histogram of “# of Siblings” [Image Description (See Appendix D Figure 1.5)] Click on image to enlarge.

Difference between a bar chart and a histogram:

The bars of a bar chart do not touch one another. Since there is often no inherent ordering among the categories, the order among the bars is usually irrelevant (i.e., bars can be switched without affecting the usefulness of the graph).
The adjacent bars of a histogram do touch one another. Since there is ordering among numbers, that ordering is to be preserved among the bars of a histogram. That is, the first bar corresponds to the smallest value (or the interval of the smallest values), the second bar corresponds to the second smallest value (or the interval of the second smallest values), and so on.

1.4.3 Organizing Quantitative Continuous Data

Example: Organizing Quantitative Continuous Variables

Here are the 50 grades for an exam:

68	72	59	56	60	40	55	68	76	75
46	59	37	54	83	85	29	55	56	42
50	49	65	68	61	53	55	92	68	48
79	51	24	57	48	71	90	81	34	60
47	39	65	74	49	52	59	9	62	37

How to present and summarize these data?

Grouping Table and Histogram

Recall that all values of a discrete variable can be listed. However, this is not the case for a continuous variable: we cannot list all possible values for a continuous variable. For example, even though the above 50 grades are all reported as whole numbers, there is no reason why a grade couldn’t contain a decimal, such as [latex]46.5[/latex], or [latex]66. \bar{6}[/latex]. For this reason, it is most appropriate to view the grade variable as a continuous variable. Even though we cannot list all possible values of a continuous variable, we can cut the range of a continuous variable into subintervals of equal width and use histograms to summarize quantitative continuous data. The range of grade is [0, 100], a convenient and neat cut is by intervals with width of 10 or 20. If we cut by intervals of 10, the resulting grouping data and histogram are as follows:

Table 1.6: Grouping Table of Grade for Histogram

Total	50	1.00
Interval	Frequency	Relative Frequency
[0, 10)	1	1/50=0.02
[10, 20)	0	0/50=0.00
[20, 30)	2	2/50=0.04
[30, 40)	4	4/50=0.08
[40, 50)	8	8/50=0.16
[50, 60)	14	14/50=0.28
[60, 70)	10	10/50=0.20
[70, 80)	6	6/50=0.12
[80, 90)	3	3/50=0.06
[90, 100]	2	2/50=0.04

Please note that we still need to keep those intervals that have no observations. For example, the interval [10, 20) includes 10 but excludes 20, and has no observations. We need to keep this interval when we draw a histogram for the data.

Histogram of grade, the y-axis is frequency and x-axis is grades. Image description available — **Figure 1.6**: Histogram of Grade [Image Description (See Appendix D Figure 1.6)]

A common question when drawing histograms is whether to use [, ) or (, ] intervals. Please note that different software may follow different rules. It is important to consistently follow the same rule for all intervals in your histogram.
Another common question is how many bins is proper. A rule of thumb is the square root of the number of observations. For the grade example, since [latex]n=50[/latex] and [latex]\sqrt{n}=\sqrt{50}=7.07[/latex]. The range of grade is [0, 100], to create convenient cuts, we can divide the range either into 10 subintervals with equal length, i.e., [latex][0, 10), [10, 20), \cdots, [90, 100][/latex] or 5 subintervals with equal width, i.e., [latex][0, 20), [20, 40), \cdots, [80, 100][/latex].
Note that histograms with different number of bins might appear very different. When investigating the shape of the distribution of a variable using a histogram, it is always better to draw a boxplot and normal Q-Q plot as well. Boxplot and normal Q-Q plot will be covered in sections 2.4 and 5.6 respectively.

Stem-and-Leaf Diagram

Another way to present quantitative data is a stem-and-leaf diagram. To construct a stem-and-leaf diagram:

Think of each observation consisting of a stem (all but the rightmost digit) and a leaf (the rightmost digit, a single digit).
Draw a vertical line, write the stems from the smallest to the largest in a vertical column to the left of the vertical line.
Write each leaf to the right of the vertical line in the same row as its corresponding stem.
Arrange the leaves in each row from the smallest to the largest.
Indicate the decimal place of the data if applicable.

Let’s return to the grades data:

68	72	59	56	60	40	55	68	76	75
46	59	37	54	83	85	29	55	56	42
50	49	65	68	61	53	55	92	68	48
79	51	24	57	48	71	90	81	34	60
47	39	65	74	49	52	59	9	62	37

We can group the grades by the first digits (in intervals of 10) as follows,

Table 1.7: Working Table for Stem-and-Leaf Diagram

Interval	Data
[0, 10)	9
[10, 20)
[20, 30)	24, 29
[30, 40)	34, 37, 37, 39
[40, 50)	40, 42, 46, 47, 48, 48, 49, 49,
[50, 60)	50, 51, 52, 53, 54, 55, 55, 55,56, 56, 57, 59, 59, 59
[60, 70)	60, 60, 61, 62, 65, 65, 68, 68, 68, 68, 68
[70, 80)	71, 72, 74, 75, 76, 79
[80, 90)	81, 83, 85
[90, 100]	90, 92

If we take apart the grades and mark the first digit at left side of the line and the second digit at the right side of the line, it becomes a stem-leaf diagram:

Stem and leaf diagram of grade. The stems are listed vertically to the left and the leaves of each stem are listed horizontally to the right. Image description available

Decimal place: 9|0 = 90

Figure 1.7: Stem-and-Leaf Diagram of Grade [Image Description (See Appendix D Figure 1.7)]

The part “Decimal place: 9|0 = 90” indicates that the decimal point is one digit to the right of the vertical line.

Some other useful guidelines of the stem-and-leaf diagram are as follows:

Keep the stems within the range of the data even though they have no leaf.
If there are too many leaves, break down each stem into two lines. Leaves from 0 to 4 are placed in the first line and 5 to 9 in the second. (See in plot on the right for the grade example. Because the interval [50, 60) has too many leaves, we break the leaves into 2 lines: the first line lists those leaves ranging from 0 to 4, the second line lists leaves ranging from 5 to 9. Do the same to all the other stems.
Divide or multiply the numbers by 10, 100, etc., and then round if necessary to create integers that have at most three digits, and indicate the decimal point if applicable.

Example

Let’s consider two data sets. Data set I: 3600, 1500, 6900 and Data set II: 0.36, 0.15, 0.69. It is not a good idea to draw the stem-and-leaf diagram based on the original data sets. Take Data set I for example, all three numbers have a leaf of 0 (the right most digit) and there are many stems without leaves between 150 and 360. Therefore, we divide the numbers by 100 and transform the numbers to 36, 15 69 to draw a stem-and-leaf diagram. Finally, we indicate the decimal point by putting 6|9=6900 at the bottom of the graph. Similarly, we multiply all three numbers 0.36, 0.15, 0.69 by 100 to create a new data set: 36, 15, 69 and then draw a stem-and-leaf diagram.

These two data sets have the same resulting stem-and-leaf diagram as the data set 36, 15, and 69. However, the decimal point is 3 digits to the right of the vertical line for Date set I, i.e., we should indicate 6|9=6900; the decimal point is one digit to the left of the vertical line for Date set II, i.e., 6|9=0.69.

Stem-and-Leaf Diagram for Data set 1: 3600, 1500, 6900

Stem	Leaf
1	5
2
3	6
4

6	9

Decimal 6|9=6900

Stem-and-Leaf Diagram for Data set 2: 0.36, 0.15, 0.69

Stem	Leaf
1	5
2
3	6
4

6	9

Decimal 6|9=0.69

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

68	72	59	56	60	40	55	68	76	75
46	59	37	54	83	85	29	55	56	42
50	49	65	68	61	53	55	92	68	48
79	51	24	57	48	71	90	81	34	60
47	39	65	74	49	52	59	9	62	37

68	72	59	56	60	40	55	68	76	75
46	59	37	54	83	85	29	55	56	42
50	49	65	68	61	53	55	92	68	48
79	51	24	57	48	71	90	81	34	60
47	39	65	74	49	52	59	9	62	37

68	72	59	56	60	40	55	68	76	75
46	59	37	54	83	85	29	55	56	42
50	49	65	68	61	53	55	92	68	48
79	51	24	57	48	71	90	81	34	60
47	39	65	74	49	52	59	9	62	37

68	72	59	56	60	40	55	68	76	75
46	59	37	54	83	85	29	55	56	42
50	49	65	68	61	53	55	92	68	48
79	51	24	57	48	71	90	81	34	60
47	39	65	74	49	52	59	9	62	37