## More Linear Regression

Here is another typical textbook linear regression problem. Table 1 gives 30 pairs of values, an X value and a Y value in each pair. Thus, the first pair of values is the point (9.6,13.4) and the second pair is the point (15.5,24.4) and so on.

We will follow the usual procedure for doing our work on the USB drive. In particular, we have
1. inserted our USB drive,
2. created a directory called `worksheet06` on that drive
3. have copied `model.R` from our root folder into our new folder,
4. have renamed that new copy of the file to the name `ws06.R`, and
5. have double clicked on that file to open RStudio.
The result should be a window pretty much identical to the one shown in Figure 1. {Recall that the images shown here may have been reduced to make a printed version of this page a bit shorter than it woud otherwise appear. In most cases your browser should allow you to right click on an image and then select the option to View Image in order to see the image in its original form.}

Figure 1 Our next step is to generate and verify the data that we were given in Table 1.

Figure 2 The Console, in Figure 3, shows the result of running the commands given in Figure 2.

Figure 3 Next we will get a quick plot of the points.

Figure 4 There is not much to see in the Console.

Figure 5 But the Plot pane, in Figure 6, shows the scatter plot of the data.

Figure 6 That plot certainly suggests a strong linear relation between the values in `L1` and the "paired" values in `L2`. However, before we look at that relationship, we can draw the graph again but this time clean it up a bit with a few more commands. Those are shown, in the Editor pane in Figure 7.

Figure 7 Running those commands produces the much nicer looking plot shown in Figure 8.

Figure 8 Then we can look at the linear model of the relationship by using the function `lm(L2~L1)` shown in Figure 9.

Figure 9 Running those commands produces the rudimentary output shown in the Console view in Figure 10.

Figure 10 The information in Figure 10 is enough for us to fomulate the linear regression equation
`y = 8.205 + (-2.008)x`
We have storeed the linear model in the variable `lm_hold`. Therefore, the command shown in Figure 11, when run, will produce the linear regression line in red on our graph.

Figure 11 Of course, this does not produce much at all in the Console pane.

Figure 12 But the graph, in the Plot pane, now shows that line.

Figure 13 Where Figure 10 showed only the most essential information about our linear model, we can formulate the command `summary(lm_hold)` to give us more information about the model.

Figure 14 The result of running that command is shown in Figure 15. Here we get information on the residual values, more information about each of the two Coefficients, and even more data that includes the value for r2, namely 0.9849.

Figure 15 The commands shown in Figure 17 allow us to look at an, eventually use, the coefficient values without retyping them.

Figure 16 Looking at Figure 17 and refering back to Figures 10 and 15, we can see that we have the coefficient values in variables that we can use.

Figure 17 One such use would be to find the values that we would expect, the values that the equation ould predict for given values of x. That is, we can answer the question "What are the exected values from our model if x has the value -4? Or if x has the value 5, or 6, or 16?"

To do this we can put those x-values into a variable, in this example it is called `x_vals`, and then we can evaluate our linear regression equation for each of those values. Figure 18 holds the commands to do this.

Figure 18 Running those commands produces the output shown in Figure 19.

Figure 19 The interpretation of those values is
• if x = -4, then the regression equation expects y to be 16.236643,
• if x = 5, then the regression equation expects y to be -1.835156,
• if x = 6, then the regression equation expects y to be -3.843134, and
• if x = 16, then the regression equation expects y to be -23.922911.
It is interesting to note that raising the x value by 1 changes the y value by the exat amount of the slope, the second coefficient. This is exactly what we would expect from the equation
`y = 8.205 + (-2.008)x`
Also, please note that the four values that we used for x, namely, -4, 5, 6, and 16, are all within the range of the x-values; they go from -8.3 to 25.5. Therefore, the four examples are all interpolations. There is nothing stopping us from using the regression equation to find the expeted value when x is outside of that range, say for x=-12 or x=30, but those would be extrapolations, and as such they are muh more suspect.

Rather than just find the four points given above, we could use the same tehnique to find the expected values for all of the original x-values. The commands to do this are given in Figure 20.

Figure 20 Figure 21 show the results of running those commands.

Figure 21 But, since we have the observed values, the y-values in Table 1, and we now have the expected values, we can find the residual values by finding the observed - expeted values via the command `L2-all_pred_y`, as shown in Figure 22.

Figure 22 Figure 23 gives the Console view of running those commands.

Figure 23 Of course, we did not have to do any of that computation. The linear model, `lm_hold`, actually contains the residuals. We use the command `residuas(lm_hold)` to extract those values from the model. See Figure 24.

Figure 24 The values are seen in Figure 25. There each value is shown with its sequential number, but the residual values displayed are the same as those that we computed and saw in Figure 23.

Figure 25 The reason that we want the residual values is that we want to look at the plot of the `L1` values agains the residual values. The command is shown in Figure 26.

Figure 26 The desired plot is shown in Figure 27. The important part of this is that we should find the plotted points spread out, without a pattern, aross the graph. In this case that is true. If it were not true then we would have some concerns about the regression equation. [See worksheet06a for an example where the residuals do not scatter all over the place.]

Figure 27 Recall that in Figure 15 we saw that the r2 value was 0.9849, indicating that the linear model explains about 98.5% of the variability of the data. This is actually the square of the correlation coefficient that we call r. To compute r directly we use the `cor(L1,L2)` command shown in Figure 28.

Figure 28 The result, shown in Figure 29, gives r= = -0.9924282. Note that the value is negative, which we expect sine the slope of our linear regression line is negative.

Figure 29 That is enough for this example. We note that we have yet to save our `ws06.R` file, since its name is in red letters (Figure 30).

Figure 30 We click on the icon to save the file, turning the file name use black letters (Figure 31).

Figure 31 In the Console we give the command `q()`, press ENTER, then respond to the prompt with `y` (Figure 32).

Figure 32 Then press ENTER to terminate the RStudio session. Figure 33 shows the then current status of our directory (folder), showing our `ws06.R` file along with, since this was done on a Windows machine, the two hidden files `.RData` and `.Rhistory`.

Figure 33 Here is a listing of the complete contents of the `descriptive.R` file:
```#This is used for worksheet 06

# load the gnrnd4 function and then run it to get our values
source("../gnrnd4.R")
gnrnd4( key1=6479962906, key2=3170200704, key3=34700085 )
# and, of course, look at the generated values
L1
L2

# the values look good.
# generate a quick and dirty scatter plot
plot(L1,L2)

# we can improve that a bit and draw in some grid lines
# along with the two axes via
plot(L1, L2, main="For Worksheet 6", xlab="x values",
ylab="y values", las=1)
abline(h=0,lwd=2, col="black")
abline(v=0,lwd=2, col="black")
abline(h=seq(-40, 20,10),col="dark grey", lty="dotted")
abline(v=seq(-5, 25, 5), col="dark grey", lty="dotted")

# the graph certainly suggests a linear relation
# We will have R do the computation
lm_hold <- lm(L2~L1)
lm_hold

# we can even add that line to our previous plot
abline(lm_hold, col="red", lty="solid", lwd=2)

# we can get a slightly more in-depth look at the linear model
summary( lm_hold )

# rather than type in the displayed valeus of the
# intercept and the slope, we can pull them out of the model
coeffs <- coefficients( lm_hold )
# and we can look at them together
coeffs
# or separately
coeffs
coeffs

# That means that we can answer questions such as "What is the
# predicted y value when x is -4, 5, 6, and 16?"
x_vals <- c(-4, 5, 6, 16 )
pred_y <- coeffs + coeffs*x_vals
pred_y

# We could even get all of the predicted (expected) values
# from each of the x-values that we had in the table.
# They are in L1
all_pred_y <- coeffs + coeffs*L1
all_pred_y

# Then, we could compute all of the residual values, that is the
# observed-expected values
all_resid <- L2-all_pred_y
all_resid

# Of course, we really did not need to do all that work to get
# the residual values because we can extract them from the
# linear model
residuals( lm_hold )

# However we obtain the residuals, we want to look at the plot
# of the x-values and their associated residual values to
# be sure that they are scattered around.

plot(L1, all_resid )

# one other thing that we may want to do it to find the
# correlation coefficient for this model
cor(L1,L2)
```