Worksheet 04: Linear Regression and Correlation

Worksheet 04: Linear Regression and Correlationt

Return to Topics page
Years ago, back in the early and mid 90's to be more precise, I would go to an exerccise program. They had a stationary bike that had various sensors on it and that provided the users with data related to the workout session. In particular, every 30 seconds the device would print out a new line giving the

Elaspsed time,
Current cadence (i.e., number of pedal rotations per minute),
The watts generated (i.e., the power of the pedalling),
The users pulse (i.e., heart rate, beats per minute),
The resistence (i.e., a measure of the load that the bike was putting on the pedals), and
Calories expended (i.e., the number of calories brned during the previous 30 seconds).

Image 1 shows a table of such values for the first seven (7) minutes of my Sept. 10, 1996 session.

Image 1

Just looking at the figures in Image 1 it looks like my pulse just rises steadily over the seven minutes. We can use R to take a look at that. To do this we will use our USB drive and we will do the usual, create a new folder (in this case called worksheet04, copy model.R into that folder, and rename the file (in this case rogerbike.R). This is shown in Figure 1.

Figure 1

Then, we double click on the rogerbike.R file to open a new RStudio session. What we want to do is to get the data that was in Image 1 into our RStudio session. Figure 2 shows some comments and the four statements that not only specify the data but also display those values so that we can verify our action.

Figure 2

Once we run the highlighted lines of Figure 2 we get the output shown in Figure 3 in the Console pane.

Figure 3

It is important to verify that the values shown in Figure 3 match the values we had in Image 1 and they do. In addition, in the Environment pane, shown in Figure 4, we can see the two newly created variables and at least a few of the values that are in those variables.

Figure 4

The next thing to do is to at least look at a scatter plot of the pulse values versus the time (the minutes). Figure 5 shows the command as written in the Editor pane.

Figure 5

Figure 6 shows that nothing much happens in the Console pane when we run the highlighted lines of Figure 5.

Figure 6

But we do get, in the Plots tab, a nice scatter plot of the values.

Figure 7

Figure 7 is really quite accurate, but without grid lines it is a bit hard to see just where each of the dots is on the graph. Figure 8 shows the commands, in the Editor pane, needed to add grid lines to our graph.

Figure 8

Once those have been run we see the updated graph in Figure 9.

Figure 9

Clearly the dots in Figure 9 do not lie on a straight line, but they are not far from being on one. Our chalenge is to find the equation of the line that best fits this data. To do this we will use the function lm(pulse~minutes). That function finds the best inear model to fit the data. Note the formation of the command, especially the use of the ~ character between the two variables that hold the data values. Also, note the order of the variables. We want pulse to be considered as a function of minutes. Therefore, we put the pulse variable first.

When we use the function in Figure 10 we assign the result of the fnction to a new variable, lin_model. We did this so that we can make reference to this result later. In fact the next command, again shown in Figure 10, just uses the name of the variable. That way R will display the contents of the variable, i.e., the result of using the lm() function.

Figure 10

When the highlighted lines of Figure 10 are run the result appears in the Console pane, shown in Figure 11.

Figure 11

The very brief output shown in Figure 11 is meant to tell us that the linear equation that best fits the data is given by pulse = 82.593 + 8.127*minutes Tht is really all that we need to know. With that equation, if you specify a certain number of minutes, say 3.2 minutes, then we can "predict" that the pulse, the heart rate, at that time in the exercise session will be 82.593+8.127*3.2 or 108.5994, or, in practical terms, either 108 or 109.

The linear model actually has much more in it than just the values shown in Figure 11. The summary(line_model) command shown in Figure 12 will display much more information.

Figure 12

Running the highlighted lines of Figure 12 gives Figure 13 in the Console pane.

Figure 13

Notice that in the expanded output of Figure 13 the same two values, 82.5934 and 8.7125 are given again, though in a completely different place (highlighted by the green box that has been placed in Figure 13).

Of course, the first thing that we want to do is to see the result of our work. Therefore, we formulate the abline() function shown in Figure 14 to add the computed line, in dar red, to our existing graph.

Figure 14

Sure enough, when we run that command we get the graph shown in Figure 15.

Figure 15

As another example of using the linear model to predict the pulse based onthe heart rate, we could form the command shown in Figure 16, using the values for the coefficients given back in Figure 13.

Figure 16

Running that command produces the output shown in Figure 17 which tells us that the predicted, the expected, value is 121.199.

Figure 17

In the two instances above where we have used the values from the linear model we have relied on our being able to accurately transcribe those values. We can just extract them from the model. This is demonstrated by the commands shown in Figure 18.

Figure 18

Once those commands are executed we can see that the variable lm_coeff holds our two values, as shown in Figure 19.

Figure 19

In order to use the values we just refer to them as lm_coeff[1] and lm_coeff[2]. Therefore, we can redo the problem from Figure 16 as the commands shown in Figure 20.

Figure 20

The result, shown in Figure 21, is actually a bit more accurate since it is not using the rounded values that we used previously for the coefficients.

Figure 21

An interesting feature of the design of R is that we can get multiple values computed for a problem. For example, if we want to know the expected (prediced ) value for minutes equal to 1.75, 2.25, 3.75, 4.25, and 8.3333 then we can use the statements shown in Figure 22

Figure 22

Running those highighted lines produces the output of Figure 23.

Figure 23

Using that same strategy allows us to compute the residual values, that is, the observed minus the expectd value for each item in the list of minutes. The commands of Figure 24 will, when run, show the observed values, then show the expected values, and finally show the (observed-expected) values.

Figure 24

The results can be seen in Figure 25.

Figure 25

Looking at the residual values is important. It is so important that the linear model that we created, lin_model, actually holds all of those values. In fact, if we look back at Figure 13, shown again below, we see that the output gives us a summary of the residual values (outlined in red here).

Figure 13 (repeated)

We can easily scan the list of our computed residuals in Figure 25 and verify that the minimum value was -3.9121 and the maximum was 4.3429, as shown in that summary area of Figure 13.

In order to pull out the residuals from the linear model, rather than compute them as we did above, we use the function residuals(). In Figure 26 we have the statement that uses that function and assigns the results to the variable resid_values. Then we display the values in that variable.

Figure 26

Figure 27 shows those values. When we compare these to the values that we computed and displayed in Figure 25 we see that they are the same.

Figure 27

Now that we have the values stored in resid_values we can take the next step, create of a plot of the residula values against the minutes associated with them. The command in Figure 28 will create such a plot.

Figure 28

Once we have run the commands of Figure 28 we get the plot shown in Figure 29.

Figure 29

What we want to see here is that there is no pattern to the plotted points. They should seem like they were just randomly placed on the graph. If they formed a pattern then we would be suspect of our linear model.

What remains is for us to find the correlation coefficient for the linear model. The function cor() will do this. The command as formulated in the Editor pane is shown in Figure 30.

Figure 30

The result of running the command is shown in Figure 31.

Figure 31

Recall that a correlation coefficient that is close to either 1 or -1 is an indication of a strong linear relationship between the variables. In this case, the value 0.9900357 is a really high correlation.

We can take a minute to redo that computation but store the result in the variable hold_cor. Then we find the square of that value.

Figure 32

THe resut of the computation is shown in Figure 33.

Figure 33

We have seen the square of the correlation coefficient before! Again, back in Figure 13 it was given toward the bottom of the display as Multiple R-squared: 0.9802. That portion of Figure 13 is repeated below with the specific area outlined in red.

Figure 13 (just a portion)

The correlation coefficient is often symbolized by the name r. Thus the square of the correlation coefficient is known as r², or r-squared.

All that remains to be done is to end our session. We start by saving the file shown in the Editor, clicking on the

icon, thus changing the file name to display in black, as shown in Figure 34.

Figure 34

THen, in the Console pane we enter the q() function and then we respond to the question with y as shown in Figure 35.

Figure 35

That ends our RStudio session and leaves us looking at the file explorer display, shown in Figure 36. [As usual, on a Mac in Finder we would not see the .RData and .Rhistory files even though they are in the directory.]

Figure 36

Here is a listing of the complete contents of the rogerbike.R file:

#  We will look at real data from a 9/10/1996 exercise
#  session that Roger had working on a special stationary
#  bike that recorded various values during the exercise
#  session.  In particular, we will check out a linear
#  model that tries to predice PULSe from the TIME elapsesd
#  in the session.

#  first we need to enter the data points, TIME and PULSE
#  from the session record
minutes <- seq( from=0.5, to=7.0, by=0.5 )
minutes
pulse <- c(91, 93, 96, 97, 99, 105, 109, 112,
           119, 123, 128, 132, 139, 140)
pulse
#  now, how about a quick look at a scatter plot of the
#  data points, nothing fancy.
plot( minutes, pulse )
  # it might be a bit easier to read the graph if we
  # have some grid lines
abline( h=seq(90,140,5), col="darkgrey", lty="dotted")
abline( v=seq(0.5,7,0.5), col="darkgrey", lty="dotted")

  # then we can ask for the linear model for these
  # data points
lin_model <- lm( pulse~minutes )
  # and we can look at that model
lin_model
  # that gave us a little of the information
  # but we could get
  # much more by using the summary function
summary( lin_model )
  # based on the results we have seen we can plot
  # the linear
  # regression line by using the abline function
abline( lin_model, col="darkred" )
  # if we want to find the expected value of the pulse
  # when the minutes is 4.75 then we could just type
  # the following command
82.5934+8.1275*4.75
  # where the 82.5934 and 8.1275 come from the
  # linear model output

  # we can actually extract those values,
  # 82.5934 and 8.1275,
  # from the linear model so that we can use them
  # without typing them
lm_coeff <- coefficients( lin_model )
lm_coeff
  # then our previous question,
  # find the expected value when
  # minutes is 4.75 becomes
lm_coeff[1]+lm_coeff[2]*4.75
  # the slight difference is due to the fact that this
  # latter method is not rounding the values as with
  # only 4 decimal places.

  # The beauty of this approach is that
  # we can get expected
  # values for lots of possible values for minutes.
  # For example, for minutes having the values
  # 1.75, 2.25, 3.75, 4.25, and 8.3333 we could say
use_minutes <- c(1.75, 2.25, 3.75, 4.25,  8.3333)
lm_coeff[1]+lm_coeff[2]*use_minutes

  # remember, we know all of the observed values
pulse
  # and now we can get all the expected valeus
lm_coeff[1]+lm_coeff[2]*minutes
  # the residual values are the observed - expected values
  # and we could compute these as
pulse - (lm_coeff[1]+lm_coeff[2]*minutes)

  # However, we have a way to get these directly.
  # We can get the residual values for all of our
  # original data points
  # (that is the Observed values - the
  # expected value) by
resid_values <- residuals( lin_model )
resid_values
  # and we can get a new plot, this time of the
  # residual values vs. minutes to see if there
  # is any pattern to that.
plot( minutes, resid_values )


  # and if we need to know the correlation coeeficient
  # we can get that via the cor() funtion
cor(minutes,pulse)
  # By the way, if we were to save that and get its square,
  # we will get the same value shown back in the summary
  # output that was named R-squared
hold_cor <- cor(minutes,pulse)
hold_cor^2