by Evan Hubener, Sr. Consultant
Linear regression is a powerful statistical tool, deployed by nearly every field that utilizes data, including cost estimation. There are two primary applications of linear regression, predictive and descriptive. In predictive applications, a model is typically built with a subset of training data and evaluated using test data, with the goal of building a model that closely predicts a response (dependent) variable. In descriptive applications, the model is built to describe the relationship between variables.
In the cost estimation field, including in defense acquisition, the response would typically be cost or price. Predictors vary by program and project but could include variables that measures workforce levels and experience, facilities available, input costs, project-specific investments or complexity, and many other factors.
In either case, if only one descriptive (independent) variable is used, the method is know as simple linear regression. If more than one predictor is used, the method is known as multiple linear regression. In this tutorial, we will walk through an example of simple linear regression, covering the assumptions, building a model, and evaluating the results. We will use the cars dataset from R’s datasets package, which contains two variables, measuring the speed and stopping distance of cars in the 1920s.
We will need 3 packages for this exercise, dplyr, ggplot, and datasets.
Let’s read in the cars dataset, which lists the speed (measured in miles-per-hour) and stopping distances (measured in feet). Once we load the data, we can examine the rows and visually inspect.
The histograms look pretty good, though distance is slightly right-tailed. We may need to address this, but we will press on for now and create a scatter plot. We are looking for a linear relationship between our variables, a key assumption of linear regression.
There does appear to be a linear relationship between distance and speed, so let's fit a model.
Testing our hypothesis
We are interested in the distance it takes a car in the 1920s to stop, and if speed is associated with distance. Our null hypothesis is that there is no relationship between speed and distance.
We’ll start our analysis of the model by checking the F-statistic. The F-statistic tells us if the overall model is significant, which becomes more important as we move on to multiple regression, which may involve many more predictors. If the model is significant, we can move on to the coefficients, p-values, and the R-squared. Our F-statistic is definitely significant, with a p-value of 1.49e-12, or as-near-as-makes-no-difference 0.
Next, we can check the speed coefficient. We have significance at the 99.9% level, an excellent result! Now we can interpret. Our coefficient for speed, 3.9324, means that for every 1-unit increase in miles-per-hour, stopping distance increases by 3.9324 feet. If we want to estimate a stopping distance for a given car traveling at 15mph, we multiply the speed by the coefficient and add the intercept (-17.58 + 3.93*15 = 41.4ft). Speed and distance are indeed related.
Additional regression assumptions
There are other assumptions of simple linear regression, and they deal with the normality and distribution of the errors. Let’s graph the model results and run a couple tests to check the assumptions. We can visualize the model results using geom_smooth(), specifying that the line we fit uses the “lm” (linear model) method.
The bands on the least-squares line are 95% confidence interval. Upon visual inspection, we see that a number of values are outside of the confidence interval, so we can infer that there is more going on with stopping distance that just speed. An obvious answer is that some cars probably had better brakes than others, but there could be numerous factors at play: weight of the car, tire quality, test surface variability, driver reaction times, or measurement error. Remember, the data is from the 1920s, or the dark ages for automobiles. Wait, no… the dark ages were the definitely the 1970s.
Distribution of errors along the x-variable
Look at the prior plot. The distances between the points and the least-squares line are known as the residual errors. Do they vary along the x-values? If speed increases, do the errors increase or decrease? If so, the data are heteroscedastic, and this would violate an assumption of linear regression. If they do not change much, then we have homoscedasticity, and our assumption is met.
We can also check this by graphing the residuals. We should see no particular pattern in these errors (distances between y-hat and y-actual), and we can quickly check these with the plot() function. A cone shape would indicate heteroscedasticity, but in this case, the residuals look pretty good.
Distribution of errors around the y-hat
The final assumptions of simple linear regression are that the residual errors are both independent, or not correlated, and that they are normally distributed around the y-hat (least squares line). That is, more errors are clustered close to the line, and fewer errors are farther away from the line. Another way to think about the distribution is by drawing vertical bell curves, centered on the least squares line. Most of the observations would be captured under the fat part of the bell, with fewer observations under the tails. The R-squared is a good indicator for this assumption, measuring the percent of total variation in Y that is explained by the least-squares line.
If the assumption does not hold, then there are consequence for both predictive and descriptive applications. For a descriptive application, the model will leave the user with a large percentage of unexplained variation in the response. If the model is deployed in a predictive application, then many predictions may be wide of the mark.
We can check the normality of the residuals using a Q-Q plot. If the points more or less follow a straight line, then our errors meet the normality assumption. In our case, the points are pretty straight.
Circling back to the R-squared (always between 0 and 1), our hunch about other factors at play is confirmed. The R-squared of 0.65 tell us that our model explains 65% of the variability in distance. There are clearly other factors at play.
The power of transformations
Remember those histograms? The histogram for speed looked great, but the histogram for distance was slightly right-tailed, which can be common with many types of data. When this occurs, the data can be transformed so that they are more normally distributed, potentially improving model performance. A variable can be transformed by taking a log, square root, or Box-Cox transformation; transformations can be performed on either the predictor variables or the response. Let’s see what happens if we transform distance by taking its square root.
This histogram is definitely improved by the transformation. Let's refit the model and check the results.
The model results are improved. The p-value of our F-statistic is again significant; our coefficient for speed is highly significant, with an even higher t-stat (estimate/std error), and our R-squared has improved by about 5 percentage points.
If we use the second model to estimate distance, the result is the square root of the distance, but we can square this to get the units back in feet. For example, if we want to estimate the stopping distance for a car traveling at 20mph, we get (1.27705 + 0.32241*20)^2, or 59.7ft. Similarly, if we take a log transformation, and only the response is transformed, we could exponentiate the estimate and subtract 1, and multiply by 100. For example, if our coefficient is 0.25, a one unit increase in the independent variable corresponds with a 28% increase ((exp(0.25)-1) x 100) in our response.
Let's run our cress-checks, looking at the scatter plot with the least-squares line, the residuals and Q-Q plot.
The difference are not enormous, but all plots show more normally distributed errors for the second model, with model 2 especially improving at the tails.
Getting the right data, and verifying the assumptions are the biggest hurdles to linear regression. Once these hurdles are cleared, the regression is as simple as filling in the formula to the linear model function. In cost estimating, the right data means data for comparable projects. How much do comparable projects cost, and what are the predictors of those costs? Once these are known, linear regression can be deployed for estimation.