Residuals
Return to Topics page
Another topic related to regression is residuals.
Without using the name residuals we have been using the
concept since we started working with regression. We
start with some original data points, the observed values
We generate a regression equation that
has the smallest sum of the squares of the differences between the
observed y values and the expected y values for
all of the observed x values. The difference between an
observed y value and its corresponding
expected y value is the residual for the associated
x-value. Consider the data in Table 1.
We can generate, display, and create the linear regression model
for this data via the following commands:
gnrnd4( key1=967170506, key2=6121050612, key3=8600004 )
L1
L2
lm_L2L1 <- lm(L2~L1)
lm_L2L1
cor(L1,L2)
These produce the console text in Figure 1.
Figure 1
We use the information in Figure 1 to form the
regression equation, namely,
y = 0.853 + 0.6971 * x.
We can even generate a plot of the observed points
and the regression line via
plot(L1,L2,xlim=c(0,80),xaxp=c(0,80,16),
ylim=c(0,60),yaxp=c(0,60,12),
main="Graph of Table 1 and Regression",
las=1, cex.axis=0.7, pch=18, col="darkgreen")
abline(v=seq(0,80,5), col="darkgray", lty=3)
abline(h=seq(0,60,5), col="darkgray", lty=3)
abline(lm_L2L1, col="red", lwd=2)
That plot is shown in Figure 2.
Figure 2
We are looking for the residuals.
To find the residuals we need to know,
for each observed x value, both the associated
observed y value and the associated
expected y value.
The commands
c_vals <- coefficients(lm_L2L1)
c_vals
y_vals <- c_vals[1]+c_vals[2]*L1
y_vals
first retrieve the intercept and coefficient of x from
our model and store them in the variable c_vals.
Then we display the values in c_vals, noting that
they are the slightly longer versions
of the values we saw in Figure 1.
The statement y_vals<-c_vals[1]+c_vals[2]*L1
uses those
values, along with the
observed x values stored in L1, to implement the regression
equation and, therefore, to compute all of the expected y values.
Those are stored in the variable y_vals.
The last line, y_vals, then displays the values.
We use this
just so that we can see those values.
Figure 3
Those expected y values are the y coordinates
for the associated x values of points on the
regression line.
We add the plot of these points
to our graph via the
command
points(L1,y_vals,pch=17,col="blue")
with the result shown in Figure 4.
Figure 4
On the graph we can
represent the difference between the observed and expected y values
by drawing a line segment between those points.
The code
for(i in 1:length(L1))
{ lines(c(L1[i],L1[i]),c(L2[i],y_vals[i]),lwd=2)}
accomplishes this as we see in Figure 5.
Figure 5
Although the graphic representation is nice, we really want to get the
numeric values for the residuals.
The observed y values are in L2.
The expected y values are in y_vals.
The statements
r_vals <- L2-y_vals
r_vals
find those differences, store the differences in r_vals
and then display those residual values.
The console result is shown in Figure 6.
[When matching the values shown in Figure 6 with the line segments
shown in Figure 5 remember that the values in Figure 6
are associated, in order, with the values in L1. Looking back at
Table 1 we see that those values in L1 were not given in ascending order.
Therefore, we need to check the x value in Table 1 to
determine which of the line segments in Figure 5 matches which of the
values shown in Figure 6.]
Figure 6
A most careful reader of these pages may recall that we
actually saw a display of residuals in a much earlier page.
There we used the summary() command to get
more information about our linear model.
We can use that command here, the summary(lm_L2L1) command,
to see a great deal of information about
the linear model that we stored in the variable lm_L2L1.
Figure 7 shows the console output from that command.
Figure 7
Sure enough, there in Figure 7, are the residuals,
though given with fewer decimal places.
Recall that we used the coefficients() function to
pull the intercept and coefficient of x out of the
linear model.
In the same way, instead of computing the residuals as we did
by finding the expected values (in Figure 3) and then
the observed minus the expected values (in Figure 7),
we could have just used the function residuals() to
extract the values from the linear model.
The command would be lm_resid<-residuals(lm_L2L1)
as shown in Figure 8.
Figure 8
The values shown in Figure 8 match the values we worked so hard to produce
earlier.
Residuals are important because we want to be sure that there
is no apparent pattern to them.
That is, if we plot the residuals (by matching
the residual value to its associated x value),
then the plotted points shown seem
randomly placed on the graph.
For our problem a quick and simple statement such as
plot(L1,lm_resid)
will produce such a graph, in this case the gaph in Figure 9.
Figure 9
Table 1 only had 6 values in it.
With
so few values it is hard to see or even not see a
pattern in the residual plot.
Perhaps it would be better to look at
a slightly larger problem.
Table 2 provides the data for a new problem.
This example has 59 pairs of values.
However, the analysis process is the same.
As before, we use
gnrnd4( key1=967175806, key2=6121050612, key3=8600004 )
L1
L2
lm_L2L1 <- lm(L2~L1)
lm_L2L1
cor(L1,L2)
to generate and display (just for verification) the data, as
well as to create and display a new linear model.
Console output is shown in Figure 10.
Figure 10
The commands
plot(L1,L2,xlim=c(0,85),xaxp=c(0,85,17),
ylim=c(0,65),yaxp=c(0,65,13),
main="Graph of Table 2 and Regression",
las=1, cex.axis=0.7, pch=18, col="darkgreen")
abline(v=seq(0,85,5), col="darkgray", lty=3)
abline(h=seq(0,65,5), col="darkgray", lty=3)
abline(lm_L2L1, col="red", lwd=2)
produce a graph, as shown in Figure 11.
Figure 11
Although we learned above that to just find the residuals
we
could jump down to the commands following Figure 15, here we will walk through
our process just to illustrate what R has already done for us.
We generate the expected y values via
c_vals <- coefficients(lm_L2L1)
c_vals
y_vals <- c_vals[1]+c_vals[2]*L1
y_vals
with the console output in Figure 12.
Figure 12
We graph the expected points and draw the line segments via
points(L1,y_vals,pch=17,col="blue")
for(i in 1:length(L1))
{ lines(c(L1[i],L1[i]),c(L2[i],y_vals[i]),lwd=2)}
changing the graph to that in Figure 13.
Figure 13
We compute and display the residuals via
r_vals <- L2-y_vals
r_vals
as shown in Figure 14.
Figure 14
Then, just so that we can see the values
we use the summary(lm_L2L1) command
to produce the more detailed display of the values in
our linear model.
That display is shown in Figure 15.
Figure 15
Note that there is a subtle change to the format of the
output in Figure 15 compared to the format that we
saw back in Figure 7.
In the earlier display, R actually gave us the
6 residual values.
Here, in Figure 15, rather than display all 59 values,
R shows us a summary of those values,
namely the Min, 1Q, Median, 3Q, and Max values.
Note that our linear model still has the residual values
in it, but the workings of the summary() command have changed to show
these 5 summary values rather than to display all of the individual residual values.
We use the commands
lm_resid <- residuals(lm_L2L1)
lm_resid
to retrieve the residual values from our model and to
display those values, as shown in Figure 16.
Figure 16
The display in Figure 16 takes up a bit more room than did our earleir
display, in Figure 14, of the r_vals that we had computed.
This longer display is just a change in formatting style
where the Figure 16 version also gives the position number of the
residual values.
Finally, we use the command plot(L1,lm_resid)
to plot
the residuals so that we can see that there is no real pattern to them.
The plot appears in Figure 17.
Figure 17
Return to Topics page
©Roger M. Palay
Saline, MI 48176 November, 2015