Here is a typical textbook linear regression problem. Table 1 gives 22 pairs of values, an X value and a Y value in each pair. Thus the first pair of values is the point (81,254) and the second pair is the point (54,182) and so on. If we were to make a scatter plot of the 22 points it would look like the plot shown in Image 1.

There are a few special things to point out about the scatter plot shown in Image 1. First, this is a 0-based plot. Second, before doing the plot, I computed the mean of both the X-values (60.18182) and the Y-values (200.13636). Then I could graph, in red, both of those values. The point where the two lines cross, that is, the point (xbar,ybar), will be on the final regression line. Third, the orange line on the plot is a graph of the equation

`y=(35/13)x+(546/13)`

`y=(35/13)x+(546/13)`

The points plotted in Image 1 do appear to be in some general linear relation. Our task is to determine the coefficients of the linear regression equation computed as the line that best fits the data of Table 1. We will do this in R.

As usual, our first step is to insert our USB drive, create a new folder (directory), copy our

`model.R`

`worksheet05`

`more_regression.R`

Double clicking on the

`more_regression.R`

`more_regression.R`

This is clearly not what we need. We can replace this with the following comments and statements:

# This session will be used to demonstrate setting up a # textbook problem. That is, we will just be given two # lists with values paired across the lists and we will be # to find the linear regreession equation for that data. # first we generate the lists, and then look at them # just to confirm the values. [Since this is a new # workspace we need to load gnrnd4 before we can use it.] source( "../gnrnd4.R") gnrnd4( key1=441852106, key2=5125463513, key3=8400026 ) L1 L2This is shown in Figure 3 where we have highlighted the actual comments.

Running the statements of Figure 3 gives us the

After we have verified that we have the correct values by comparing them to the ones in Table 1 we might want to get at least a quick look at the scatter plot of those pairs of values. To do this we can add the new lines

# We might as well have a quic peek at the scatter # plot so that we can get a visual feel for the # possible relationship plot(L1,L2) # just a unadorned scatter plotshown in Figure 5.

Running those highlighted lines produces the plot shown in Figure 6.

The plot in Figure 6 is not as detailed as was our original image, Image 1. There is no need, here, to duplicate that detail, but we can take a minute just to add a few grid lines to make the plot easier to read. The new lines to do this are

# but once we see it we can do a few quick steps # to make it easire to read abline( h=seq(150,300,50),lty=3, col="darkgreen") abline( v=seq( 30,90,10), lty=3, col="darkgreen")

Running those statements changes the graph to that shown in Figure 8.

The plot shown in Figure 8 serves most of our purposes. We can easily identify each pair of (X,Y) values of Table 1 with a plotted point in the graph. Also, we see that the points, though not on a line, are at least "sort of along a line". Back in algebra we had learned that the equation of a line is

`y=mx+b`

`y=a+bx`

`a`

`b`

`a`

`b`

# Then getting the linear model, i.e., doing the # linear regression is an easy step best_line <- lm(L2~L1) # notice that we will store the result # for subsequent useshown in Figure 9.

Running that command gives us the output of Figure 10.

In particular, there was no output because we assigned the value produced by the

`lm(L2~L1)`

`best_line`

The output of running the highlighted lines of Figure 11 is shown in Figure 12.

This output really gives us just what we wanted, the values for the coefficients

`a`

`b`

`y=a+bx`

`a`

`43.223`

`b`

`2.607`

y = 43.223 + 2.607*x

summary(best_line)shown in Figure 13, to get much more information.

Running that command produces the output shown in Figure 14.

Our same two values, though with a bit more displayed digits, appear in Figure 14, in the middle of the display, under the title

`Estimate`

`best_line`

# And just to get a view of this we can plot it abline( best_line, col="blue")where the choice of the blue color was just an arbitrary one.

The result of running the highlighted command of Figure 15 is the change in the plot shown in Figure 16.

We could use that line to "read" values predicted by the equation. For example, following the grid lines, if X has the value 90 we expect the Y value to be around 275. It is a bit harder to estimate the expected value if we start with X being 33. Again, trying to read the graph, we expect that if X is 33 then Y should be about 130.

Reading values from the graph is just not good enough. We want to be able to calculate such values. Using the

y = 43.223 + 2.607*x

43.223 + 2.607*33

Running those highlighted lines produces the output shown in Figure 18.

In Figure 18 we get even more significant digits for the two coefficient values.

Then we can create a command to evaluate

a + b*33

`coeff[1]+coeff[2]*33`

The result, shown in Figure 20 shows that our approximation from looking at the graph was not all that bad.

We can follow that same pattern to get the predicted (expected) values for a whole collection of X-values. Figure 21 shows the two statements to generate 11 values from 30 through 80 and then display those values. We follow that with a command, almost identical to the one we used to generate Figure 20, namely,

`coeff[1]+coeff[2]*x`

`x`

Running the highlighted liens of Figure 220 generates the output shown in Figure 21.

Another thing that we should do is to look at the scatter plot of the residual values against the

`x`

`off_by<-residuals(best_line)`

`x`

The

The generated plot is shown in Figure 25.

As before, grid lines would be helpful in that plot. Therefore, we can construct such helpful grid lines by using statements such as those shown in Figure 26.

Once those are run the plot changes to that shown in Figure 27.

What we wanted to see in Figure 27 was that there is no particular pattern to those residual values. that is indeed the case.

The last thing we want to see is the

`cor(L1,L2)`

The computed

A

`0.9244026`

`0.8545`

At this point we can start to close down our systems. First, we click on the icon to save the file that is listed in our

Then, in the

`q()`

`y`

Once we press the ENTER key the

`.RData`

`.Rhistory`

Here is a listing of the complete contents of the

`more_regression.R`

# This session will be used to demonstrate setting up a # textbook problem. That is, we will just be given two # lists with values paired across the lists and we will be # to find the linear regreession equation for that data. # first we generate the lists, and then look at them # just to confirm the values. [Since this is a new # workspace we need to load gnrnd4 before we can use it.] source( "../gnrnd4.R") gnrnd4( key1=441852106, key2=5125463513, key3=8400026 ) L1 L2 # We might as well have a quic peek at the scatter # plot so that we can get a visual feel for the # possible relationship plot(L1,L2) # just a unadorned scatter plot # but once we see it we can do a few quick steps # to make it easire to read abline( h=seq(150,300,50),lty=3, col="darkgreen") abline( v=seq( 30,90,10), lty=3, col="darkgreen") # Then getting the linear model, i.e., doing the # linear regression is an easy step best_line <- lm(L2~L1) # notice that we will store the result # for subsequent use # the first such use is just to get a quick look at # the result best_line # Then we can get a more detailed look at the values summary( best_line) # And just to get a view of this we can plot it abline( best_line, col="blue") # We want to aim for being abe to get the value predicted # by the linear model for a given value of X, say 33. # First we can just pull off the coefficients in a # separate valiable. coeff <- coefficients( best_line ) coeff # Then we can use those coefficients to express the # right side of our linear model y = a + b*x # thus giving us the predicted (the expected) value. coeff[1]+coeff[2]*33 # We could get a wole pile of expected values, say # for all the x values that are multiples of 5 and between # 0 and 80, inclusive. # first generate those x values x <- seq( 30, 80, 5 ) x # then do the computation on all of them coeff[1] + coeff[2]*x # Then, just to be more confident in our model # we might want to look at a plot of the residual values. # First pull those values out of the linear model off_by <- residuals( best_line ) off_by # Then plot them against the known x values plot( L1, off_by ) # and, again, we can make the plot a bit easier # to read abline( h=seq(-20,40,20),lty=3, col="darkgreen") abline( v=seq( 30,90,10), lty=3, col="darkgreen") # Finally, we can get the correlation coefficient cor( L1, L2 )

©Roger M. Palay Saline, MI 48176 September, 2016