Residuals

Return to Topics page
Another topic related to regression is residuals. We have been using the concept since we started working with regression. We start with some original data points, the observed values We generate a regression equation, the equation for a line that has the smallest sum of the squares of the differences between the observed y values and the expected y values for all of the observed x values. The difference between an observed y value and its corresponding expected y value is the residual for the associated x-value. Consider the data in Table 1.
We can generate, display, and create the linear regression model for this data via the following commands:

gnrnd4( key1=967170506, key2=6121050612, key3=8600004 )
L1
L2
lm_L2L1 <- lm(L2~L1)
lm_L2L1
cor(L1,L2)

These produce the console text in Figure 1.

Figure 1

We use the information in Figure 1 to form the regression equation, namely, y = 0.853 + 0.6971 * x.

We can even generate a plot of the observed points and the regression line via

plot(L1,L2,xlim=c(0,80),xaxp=c(0,80,16),
     ylim=c(0,60),yaxp=c(0,60,12),
     main="Graph of Table 1 and Regression",
     las=1, cex.axis=0.7, pch=18, col="darkgreen")
abline(v=seq(0,80,5), col="darkgray", lty=3)
abline(h=seq(0,60,5), col="darkgray", lty=3)
abline(lm_L2L1, col="red", lwd=2)

That plot is shown in Figure 2.

Figure 2

We are looking for the residuals. To find the residuals we need to know, for each observed x value, both the associated observed y value and the associated expected y value. To get those expected y values we form the linear equation by taking the intercept and the slope from the output of the lm() command. We did this above to get regression equation, namely, y = 0.853 + 0.6971 * x. Rather than copy the admittedly rounded values of the intercept and the slope from the earlier output, we can use the coefficients() function to pull those two values out of the linear model. The commands

 
  c_vals <- coefficients(lm_L2L1) 
  c_vals
  y_vals <- c_vals[1]+c_vals[2]*L1 
  y_vals

first retrieve the intercept and slope, i.e, the coefficient of x, from our model and store them in the variable c_vals. The intercept is stored in c_vals[1] and the slope is stored in c_vals[2]. Then we display the values in c_vals, noting that they are the slightly longer versions of the values we saw in Figure 1. The statement y_vals<-c_vals[1]+c_vals[2]*L1 uses those values, along with the observed x values stored in L1, to implement the regression equation and, therefore, to compute all of the expected y values. Those are stored in the variable y_vals. The last line, y_vals, then displays the values. We use this just so that we can see those values.

Figure 3

Those expected y values are the y coordinates for the associated x values of points on the regression line. We add the plot of these points to our graph via the command

points(L1,y_vals,pch=17,col="blue")

with the result shown in Figure 4.

Figure 4

On the graph we can represent the difference between the observed and expected y values by drawing a line segment between those points. The code

for(i in 1:length(L1))
  { lines(c(L1[i],L1[i]),c(L2[i],y_vals[i]),lwd=2)}

accomplishes this as we see in Figure 5.

Figure 5

Although the graphic representation is nice, we really want to get the numeric values for the residuals. The observed y values are in L2. The expected y values are in y_vals. The statements

r_vals <- L2-y_vals
r_vals

find those differences, store the differences in r_vals and then display those residual values. The console result is shown in Figure 6. [When matching the values shown in Figure 6 with the line segments shown in Figure 5 remember that the values in Figure 6 are associated, in order, with the values in L1. Looking back at Table 1 we see that those values in L1 were not given in ascending order. Therefore, we need to check the x value in Table 1 to determine which of the line segments in Figure 5 matches which of the values shown in Figure 6.]

Figure 6

A most careful reader of these pages may recall that we actually saw a display of residuals in a much earlier page. There we used the summary() command to get more information about our linear model. We can use that command here, the summary(lm_L2L1) command, to see a great deal of information about the linear model that we stored in the variable lm_L2L1. Figure 7 shows the console output from that command.

Figure 7

Sure enough, there in Figure 7, are the residuals, though given with fewer decimal places.

Recall that we used the coefficients() function to pull the intercept and coefficient of x out of the linear model. In the same way, instead of computing the residuals as we did by finding the expected values (in Figure 3) and then the observed minus the expected values (in Figure 7), we could have just used the function residuals() to extract the values from the linear model. The command would be lm_resid<-residuals(lm_L2L1) as shown in Figure 8.

Figure 8

The values shown in Figure 8 match the values we worked so hard to produce earlier.

Residuals are important because we want to be sure that there is no apparent pattern to them. That is, if we plot the residuals (by matching the residual value to its associated x value), then the plotted points shown seem randomly placed on the graph. For our problem a quick and simple statement such as plot(L1,lm_resid) will produce such a graph, in this case the gaph in Figure 9.

Figure 9

Table 1 only had 6 values in it. With so few values it is hard to see or even not see a pattern in the residual plot. Perhaps it would be better to look at a slightly larger problem.

Table 2 provides the data for a new problem.
This example has 59 pairs of values. However, the analysis process is the same. As before, we use

gnrnd4( key1=967175806, key2=6121050612, key3=8600004 )
L1
L2
lm_L2L1 <- lm(L2~L1)
lm_L2L1
cor(L1,L2)

to generate and display (just for verification) the data, as well as to create and display a new linear model. Console output is shown in Figure 10.

Figure 10

The commands

plot(L1,L2,xlim=c(0,85),xaxp=c(0,85,17),
     ylim=c(0,65),yaxp=c(0,65,13),
     main="Graph of Table 2 and Regression",
     las=1, cex.axis=0.7, pch=18, col="darkgreen")
abline(v=seq(0,85,5), col="darkgray", lty=3)
abline(h=seq(0,65,5), col="darkgray", lty=3)
abline(lm_L2L1, col="red", lwd=2)

produce a graph, as shown in Figure 11.

Figure 11

Although we learned above that to just find the residuals we could jump down to the commands following Figure 15, here we will walk through our process just to illustrate what R has already done for us.

We generate the expected y values via

c_vals <- coefficients(lm_L2L1)
c_vals
y_vals <- c_vals[1]+c_vals[2]*L1
y_vals

with the console output in Figure 12.

Figure 12

We graph the expected points and draw the line segments via

points(L1,y_vals,pch=17,col="blue")
for(i in 1:length(L1))
{ lines(c(L1[i],L1[i]),c(L2[i],y_vals[i]),lwd=2)}

changing the graph to that in Figure 13.

Figure 13

We compute and display the residuals via

r_vals <- L2-y_vals
r_vals

as shown in Figure 14.

Figure 14

Then, just so that we can see the values we use the summary(lm_L2L1) command to produce the more detailed display of the values in our linear model. That display is shown in Figure 15.

Figure 15

Note that there is a subtle change to the format of the output in Figure 15 compared to the format that we saw back in Figure 7. In the earlier display, R actually gave us the 6 residual values. Here, in Figure 15, rather than display all 59 values, R shows us a summary of those values, namely the Min, 1Q, Median, 3Q, and Max values. Note that our linear model still has the residual values in it, but the workings of the summary() command have changed to show these 5 summary values rather than to display all of the individual residual values.

We use the commands

lm_resid <- residuals(lm_L2L1)
lm_resid

to retrieve the residual values from our model and to display those values, as shown in Figure 16.

Figure 16

The display in Figure 16 takes up a bit more room than did our earleir display, in Figure 14, of the r_vals that we had computed. This longer display is just a change in formatting style where the Figure 16 version also gives the position number of the residual values.

Finally, we use the command plot(L1,lm_resid) to plot the residuals so that we can see that there is no real pattern to them. The plot appears in Figure 17.

Figure 17

The following script contains all of the R statements used to generate the images shown above with the addition of comments and a few extra statements for further illustration of the topic of residuals.


#This is a look at residuals in a linear regression

# first load gnrnd4 so that we can generate some numbers
source( "../gnrnd4.R")

# then here is a set of values to examine
gnrnd4( key1=967170506, key2=6121050612, key3=8600004 )
L1  # the x-values
L2  # the y-values

#  how about a small plot of those values
plot( L1, L2 )

# now we will find the linear regression for those values
lm_L2L1 <- lm(L2~L1)  # and save it in lm_L2_l1

# we can look at that model
lm_L2L1  # thus the linear equation is y=0.8530 + 0.6971*x

# we could get even more information about our 
#  linear model by using the summary function
summary( lm_L2L1 )  # you can find the same values in this

# and we can find the correlation coefficient
cor(L1,L2)

# take a moment here to generate a fancier plot
plot(L1,L2,xlim=c(0,80),xaxp=c(0,80,16),
     ylim=c(0,60),yaxp=c(0,60,12),
     main="Graph of Table 1 and Regression",
     las=1, cex.axis=0.7, pch=18, col="darkgreen")
abline(v=seq(0,80,5), col="darkgray", lty=3)
abline(h=seq(0,60,5), col="darkgray", lty=3)
abline( h=0,v=0,col="black", lwd=2 )  # shade in the two axes

# then add the linear regression line
abline(lm_L2L1, col="red", lwd=2)


#  L2 holds all of the observed y values
#  Let us compute all of the expected y values

# rather than type in the coefficients, we will
# pull them out of our stored model
c_vals <- coefficients(lm_L2L1)
c_vals  # and look at them again

#  then we can use those coefficients to generate 
#  all of the expected y values
y_vals <- c_vals[1]+c_vals[2]*L1
# and here they are
y_vals

# we can add those to the plot.  Of course they will
#   all be on the regression line.
points(L1,y_vals,pch=17,col="blue")

# We can get a little fancy and draw in the 
# difference between the observed and expected y values
for(i in 1:length(L1))
{ lines(c(L1[i],L1[i]),c(L2[i],y_vals[i]),lwd=2)}

#  Or, we could find the length of those line segments
#  by looking at the   observed - expected   y-values

r_vals <- L2-y_vals
r_vals  # these are the residual values

# we actually saw those same residual values in the
#  output of the summary command above.  
# Let us do that command again to see the residual 
# values that it displays.
summary( lm_L2L1 )

# The summary command, in this case, displayed the 
# actual residual values.  It did that because there were 
# so few points and therefore there are only a small
# number of residual values.  If there were more points,
# as we will see in the next example, then the summary
# function would not show the residuals.  

#  Our last concern with the residuals is that 
#  a plot of our x-values and the corresponding 
#  residual values does not show a pattern.
plot(L1,r_vals)

# However, we can still pull the residual values out of 
# the model using the residuals function
lm_resid<-residuals(lm_L2L1)
lm_resid

##################################################
##################################################

#  let us look at a problem with more data points
gnrnd4( key1=967175806, key2=6121050612, key3=8600004 )
L1  # the x-values
L2  # the y-values

#  how about a small plot of those values
plot( L1, L2 )

# now we will find the linear regression for those values
lm_L2L1 <- lm(L2~L1)  # and save it in lm_L2_l1

# we can look at that model
lm_L2L1  # thus the linear equation is y=4.988 +  0.563*x

# we could get even more information about our 
#  linear model by using the summary function
summary( lm_L2L1 )  # you can find the same values in this

# and we can find the correlation coefficient
cor(L1,L2)

# take a moment here to generate a fancier plot
plot(L1,L2,xlim=c(0,85),xaxp=c(0,85,17),
     ylim=c(0,65),yaxp=c(0,65,13),
     main="Graph of Table 2 and Regression",
     las=1, cex.axis=0.7, pch=18, col="darkgreen")
abline(v=seq(0,85,5), col="darkgray", lty=3)
abline(h=seq(0,65,5), col="darkgray", lty=3)

abline( h=0,v=0,col="black", lwd=2 )  # shade in the two axes

# then add the linear regression line
abline(lm_L2L1, col="red", lwd=2)


#  L2 holds all of the observed y values
#  Let us compute all of the expected y values

# rather than type in the coefficients, we will
# pull them out of our stored model
c_vals <- coefficients(lm_L2L1)
c_vals  # and look at them again

#  then we can use those coefficients to generate 
#  all of the expected y values
y_vals <- c_vals[1]+c_vals[2]*L1
# and here they are
y_vals


# we can add those to the plot.  Of course they will
#   all be on the regression line.
points(L1,y_vals,pch=17,col="blue")

# We can get a little fancy and draw in the 
# difference between the observed and expected y values
for(i in 1:length(L1))
{ lines(c(L1[i],L1[i]),c(L2[i],y_vals[i]),lwd=2)}

#  Or, we could find the length of those line segments
#  by looking at the   observed - expected   y-values

r_vals <- L2-y_vals
r_vals  # these are the residual values


#  The output of the summary command above did not
#  give the actual residuals.
# Let us do that command again to see the residual 
# values that it displays.
summary( lm_L2L1 )

# The summary command, in this case, displayed the 
# Min, Q1 Median Q2 and Max values of the residuals.
# It did that because there were so many points
# and therefore there are so many residual values.
  



# However, we can still pull the residual values out of 
# the model using the residuals function
lm_resid<-residuals(lm_L2L1)
lm_resid

#  Our last concern with the residuals is that 
#  a plot of our x-values and the corresponding 
#  residual values does not show a pattern.
plot(L1,lm_resid)