13.7 Simple Linear Regression Model (SLRM)

So far, we have focused on a sample of 15 used cars. If we would like to draw conclusions about the population (all used cars), we need some inferential methods. First, we introduce some concepts from conditional probability. Consider, for example,  all five-year-old used cars; their prices follow a certain distribution. We call this distribution the conditional distribution of price given age is equal to 5, and the corresponding mean and standard deviation are referred to as the conditional mean and conditional standard deviation. For variables [latex]Y[/latex] and [latex]X[/latex], the conditional mean and standard deviation of [latex]Y[/latex] given [latex]X=x[/latex] are denoted with [latex]\mu_{Y|x}[/latex] and [latex]\sigma_{Y|x}[/latex]. Now, our objective is to model [latex]\mu_{Y|x}[/latex] and the predictor variable [latex]x[/latex] using a straight line (see the figure below). That is

[latex]\mu_{Y|x} = \beta_0 + \beta_1 x[/latex],

where [latex]\beta_0[/latex] and [latex]\beta_1[/latex] are the population intercept and population slope, respectively, with interpretation as follows:

  • [latex]\beta_0[/latex]: the average of the response variable [latex]y[/latex] when [latex]x = 0.[/latex]
  • [latex]\beta_1[/latex]: the change in the mean value of [latex]y[/latex] when the predictor variable [latex]x[/latex] increases by 1 unit.

For each given value of the predictor variable [latex]x[/latex], the conditional distribution of the response variable [latex]Y[/latex] given [latex]x[/latex] is assumed to follow a normal distribution with a mean [latex]\mu_{\scriptsize Y|x} = \beta_0 + \beta_1 x[/latex] and a standard deviation [latex]\sigma_{y|x} = \sigma[/latex] (that is, the conditional standard deviation of [latex]Y[/latex] is assumed constant for all values of [latex]x[/latex]). For example, most five-year-old cars are not equal in price to the conditional mean [latex]\mu_{y|x=5}[/latex]; [latex]\sigma[/latex] measures the typical difference between the conditional mean  [latex]\mu_{y|x=5}[/latex] and the actual value of any given five-year-old car. To capture this variation (sampling error), we introduce an error term [latex]\epsilon[/latex] into the model. The regression model becomes

[latex]Y = \beta_0 + \beta_1 x + \epsilon , \epsilon \sim N(0, \sigma)[/latex].

A demonstration of the use of the error term to account for the variation in data. Image description available.
Figure 13.9: Simple Linear Regression Model. [Image Description (See Appendix D Figure 13.9)] Click on the image to enlarge it.
 

  • This figure models the relationship between the conditional mean  [latex]\mu_{\scriptsize Y|x}[/latex] (the mean of the response variable [latex]Y[/latex]) and the predictor variable [latex]X[/latex] using a straight line, i.e., [latex]\mu_{\scriptsize Y|x} = \beta_0 + \beta_1 x[/latex].
  • In general, a single [latex]y[/latex] value is not equal to the conditional mean [latex]\mu_{\scriptsize Y|x}[/latex], so we use an error term [latex]\epsilon[/latex] to quantify the deviation of the [latex]y[/latex] values from the conditional mean [latex]\mu_{\scriptsize Y|x}[/latex]. The model assumes that [latex]\epsilon[/latex] has a mean of 0 since the conditional mean of [latex]Y[/latex] given [latex]x[/latex] is assumed to be [latex]\mu_{\scriptsize Y|x}[/latex].
  • Therefore, the simple linear regression model becomes [latex]Y = \mu_{\scriptsize Y | x} + \epsilon = \beta_0 + \beta_1 x + \epsilon, \epsilon \sim N(0, \sigma).[/latex]

Key Facts: Assumptions for Inferential Methods in Simple Linear Regression

In general, we need the following assumptions to apply the inferential methods in simple linear regression:

  • Linearity assumption: There is a linear association between the conditional mean and the predictor variable. This assumption can be checked by plotting a scatter plot ([latex]Y[/latex] against [latex]X[/latex]). If this assumption is met, all the points should roughly lie on a straight line.
  • Normal population: Given each value of the predictor variable [latex]x[/latex], the conditional distribution of the response variable is normally distributed. That is, [latex]Y | x \sim N(\beta_0 + \beta_1 x, \sigma)[/latex] or equivalently, the error term [latex]\epsilon[/latex] follows a normal distribution with mean 0 and standard deviation [latex]\sigma[/latex] for all values of [latex]x[/latex].
  • Equal standard deviation: The conditional standard deviation of the response variable is the same for all values of the predictor variable [latex]X[/latex]. This common standard deviation is denoted as [latex]\sigma[/latex]. This is equivalent to assuming that the error term [latex]\epsilon[/latex] has the same standard deviation for all values of [latex]x[/latex]
  • Independent observations: The observations of the response variable are independent of one another. This is hard to check unless we know how the data were collected.

Checking whether the model assumptions are satisfied or not can be performed by residual analysis. Recall that the residual of the ith observation, [latex]e_i[/latex], is defined as the difference between the observed value ([latex]y_i[/latex]) and the fitted value ([latex]\hat{y}[/latex]), i.e., [latex]e_i = y_i - \hat{y}_i = y_i - (b_0 + b_1 x_i)[/latex]. If the model assumptions are satisfied, the residuals [latex]e_i, i = 1, 2, \dots, n[/latex] can be regarded as a simple random sample from a normal distribution with mean 0 and standard deviation [latex]\sigma[/latex]; this means the residuals should follow an approximate normal distribution with standard deviations that are similar for each value of [latex]x[/latex]. To check the assumptions, we draw two graphs of the residuals:

  • Normal population assumption: Draw a Q-Q (normal probability) plot on the residuals. If the normality assumption is met, the points should roughly fall on a straight line.
  • Equal standard deviation assumption and linearity assumption: Plot the residuals [latex]e_i[/latex] (y-axis) versus fitted values [latex]\hat{y}_i = b_0 + b_1 x_i[/latex] (x-axis). We can also plot the residuals [latex]e_i[/latex] ([latex]y[/latex]-axis) versus the [latex]x_i[/latex] values ([latex]x[/latex]-axis). If the equal standard deviation assumption is met, we will see a horizontal band centred and symmetric about the line [latex]y[/latex] = 0. If the points form some curvature, the linearity assumption is violated.

Example: Residual Analysis

Comment on the following residual plots (first row) and normal probability plots (second row) of three sets of residuals.

a) A scatter plot of standardized residuals. The data is scattered in a wide but even band around y = 0. Image description available. b) A scatter plot of standardized residuals. the data falls in a u-shape that intersects the line y=0 twice. Image description available. c) A scatter plot of standardized residuals. The data falls in a wide scatter pattern not centred on y = 0. Image description available.
a) A Q-Q plot of the residual plot a. The data falls in a tight band with an upward trend. Image description available. b) The Q-Q plot of residuals shown in b. The data curves toward then away from the normal line. Image description available. c) The Q-Q plot of the residuals shown in c. The data forms an s-shaped curve around the normal line. Image description available.
(a) (b) (c)

Figure 13.10: Residual Plots and Normal Probability Plots of Three Sets of Residuals. [Image Description (See Appendix D Figure 13.10)] Click on the image to enlarge it.

  1. Based on the residuals plot, both the equal standard deviation and the linearity assumptions appear met since the points form a horizontal band centred at 0. According to the normal probability (Q-Q) plot, points are roughly on a straight line and hence the normality assumption is satisfied.
  2. Based on the residual plot, the linearity assumption appears violated since the residual plot shows “U” shaped curvature. The normal probability plot provides evidence against the normality assumption.
  3. Based on the residuals plot, the equal standard deviation assumption appears violated since the residual plot does not show a horizontal band. The variation of the residuals gets larger when [latex]x[/latex] increases. The normal probability plot shows that the normality assumption is violated.

Exercise: Residual Analysis for the Used Cars

The residuals for the 15 used cars are given in the following table.

  1. Given that the least–squares regression line is [latex]\widehat{\text{price}} = 14.118 - 0.9432 \times \text{age}[/latex]. Verify the residual for the first car.
    A table showing the data on used cars. Image description available.
    Table 13.2: Fitted Value and Residuals of 15 Used Cars. [Image Description (See Appendix D Table 13.2)]
  2. Comment on the following graphs based on the residuals of the 15 used cars. Explain whether the assumptions for regression inference are met or not.
A Q-Q plot of residuals. There is a slight curve to the data points. Image description available. A scatter plot of standardized residuals. The data is widely but fairly evenly scattered around the line y = 0. Image description available.

Figure 13.11: Normal Probability Plot of Residuals (Left) and Scatter Plot of Residuals V.S. Predictor Variable (Right) [Image Description (See Appendix D Figure 13.11)]

Show/Hide Answer

Answers:

  1. Given that the least–squares regression line is [latex]\widehat{\text{price}} = 14.118 - 0.9432 \times \text{age}[/latex], the fitted value for the first used car with age = 1 is [latex]\hat{y}_1 = 14.118 - 0.9432 \times 1 = 13.1748[/latex], and the residual is [latex]e_1 = y_1 - \hat{y}_1 = 14 - 13.1748 = 0.8252[/latex] ($1000), which is the same as the residual given in the table.
  2. The plot in the left panel is the normal probability plot (normal QQ plot) for the residuals. The points in the QQ plot of the residuals are roughly on a straight line. Therefore, the normal population assumption appears to be met. The plot in the right panel is the scatterplot of residuals (y-axis) versus the predictor variable (x-axis). The points are roughly within a horizontal band centred at 0, without any obvious curvature. Therefore, both the linearity and the equal standard deviation assumptions appear to be met.

 

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Introduction to Applied Statistics Copyright © 2024 by Wanhua Su is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.