## Worksheet 04: Linear Regression and Correlationt

Years ago, back in the early and mid 90's to be more precise, I would go to an exerccise program. They had a stationary bike that had various sensors on it and that provided the users with data related to the workout session. In particular, every 30 seconds the device would print out a new line giving the
• Elaspsed time,
• Current cadence (i.e., number of pedal rotations per minute),
• The watts generated (i.e., the power of the pedalling),
• The users pulse (i.e., heart rate, beats per minute),
• The resistence (i.e., a measure of the load that the bike was putting on the pedals), and
• Calories expended (i.e., the number of calories brned during the previous 30 seconds).
Image 1 shows a table of such values for the first seven (7) minutes of my Sept. 10, 1996 session.

Image 1 Just looking at the figures in Image 1 it looks like my pulse just rises steadily over the seven minutes. We can use R to take a look at that. To do this we will use our USB drive and we will do the usual, create a new folder (in this case called `worksheet04`, copy `model.R` into that folder, and rename the file (in this case `rogerbike.R`). This is shown in Figure 1.

Figure 1 Then, we double click on the `rogerbike.R` file to open a new RStudio session. What we want to do is to get the data that was in Image 1 into our RStudio session. Figure 2 shows some comments and the four statements that not only specify the data but also display those values so that we can verify our action.

Figure 2 Once we run the highlighted lines of Figure 2 we get the output shown in Figure 3 in the Console pane.

Figure 3 It is important to verify that the values shown in Figure 3 match the values we had in Image 1 and they do. In addition, in the Environment pane, shown in Figure 4, we can see the two newly created variables and at least a few of the values that are in those variables.

Figure 4 The next thing to do is to at least look at a scatter plot of the pulse values versus the time (the minutes). Figure 5 shows the command as written in the Editor pane.

Figure 5 Figure 6 shows that nothing much happens in the Console pane when we run the highlighted lines of Figure 5.

Figure 6 But we do get, in the Plots tab, a nice scatter plot of the values.

Figure 7 Figure 7 is really quite accurate, but without grid lines it is a bit hard to see just where each of the dots is on the graph. Figure 8 shows the commands, in the Editor pane, needed to add grid lines to our graph.

Figure 8 Once those have been run we see the updated graph in Figure 9.

Figure 9 Clearly the dots in Figure 9 do not lie on a straight line, but they are not far from being on one. Our chalenge is to find the equation of the line that best fits this data. To do this we will use the function `lm(pulse~minutes)`. That function finds the best inear model to fit the data. Note the formation of the command, especially the use of the ~ character between the two variables that hold the data values. Also, note the order of the variables. We want pulse to be considered as a function of minutes. Therefore, we put the `pulse` variable first.

When we use the function in Figure 10 we assign the result of the fnction to a new variable, `lin_model`. We did this so that we can make reference to this result later. In fact the next command, again shown in Figure 10, just uses the name of the variable. That way R will display the contents of the variable, i.e., the result of using the `lm()` function.

Figure 10 When the highlighted lines of Figure 10 are run the result appears in the Console pane, shown in Figure 11.

Figure 11 The very brief output shown in Figure 11 is meant to tell us that the linear equation that best fits the data is given by
`pulse = 82.593 + 8.127*minutes`
Tht is really all that we need to know. With that equation, if you specify a certain number of minutes, say 3.2 minutes, then we can "predict" that the pulse, the heart rate, at that time in the exercise session will be `82.593+8.127*3.2` or `108.5994`, or, in practical terms, either `108` or `109`.

The linear model actually has much more in it than just the values shown in Figure 11. The `summary(line_model)` command shown in Figure 12 will display much more information.

Figure 12 Running the highlighted lines of Figure 12 gives Figure 13 in the Console pane.

Figure 13 Notice that in the expanded output of Figure 13 the same two values, `82.5934` and `8.7125` are given again, though in a completely different place (highlighted by the green box that has been placed in Figure 13).

Of course, the first thing that we want to do is to see the result of our work. Therefore, we formulate the `abline()` function shown in Figure 14 to add the computed line, in dar red, to our existing graph.

Figure 14 Sure enough, when we run that command we get the graph shown in Figure 15.

Figure 15 As another example of using the linear model to predict the pulse based onthe heart rate, we could form the command shown in Figure 16, using the values for the coefficients given back in Figure 13.

Figure 16 Running that command produces the output shown in Figure 17 which tells us that the predicted, the expected, value is `121.199`.

Figure 17 In the two instances above where we have used the values from the linear model we have relied on our being able to accurately transcribe those values. We can just extract them from the model. This is demonstrated by the commands shown in Figure 18.

Figure 18 Once those commands are executed we can see that the variable `lm_coeff` holds our two values, as shown in Figure 19.

Figure 19 In order to use the values we just refer to them as `lm_coeff` and `lm_coeff`. Therefore, we can redo the problem from Figure 16 as the commands shown in Figure 20.

Figure 20 The result, shown in Figure 21, is actually a bit more accurate since it is not using the rounded values that we used previously for the coefficients.

Figure 21 An interesting feature of the design of R is that we can get multiple values computed for a problem. For example, if we want to know the expected (prediced ) value for minutes equal to 1.75, 2.25, 3.75, 4.25, and 8.3333 then we can use the statements shown in Figure 22

Figure 22 Running those highighted lines produces the output of Figure 23.

Figure 23 Using that same strategy allows us to compute the residual values, that is, the observed minus the expectd value for each item in the list of `minutes.` The commands of Figure 24 will, when run, show the observed values, then show the expected values, and finally show the (observed-expected) values.

Figure 24 The results can be seen in Figure 25.

Figure 25 Looking at the residual values is important. It is so important that the linear model that we created, `lin_model`, actually holds all of those values. In fact, if we look back at Figure 13, shown again below, we see that the output gives us a summary of the residual values (outlined in red here).

Figure 13 (repeated) We can easily scan the list of our computed residuals in Figure 25 and verify that the minimum value was `-3.9121` and the maximum was `4.3429`, as shown in that summary area of Figure 13.

In order to pull out the residuals from the linear model, rather than compute them as we did above, we use the function `residuals()`. In Figure 26 we have the statement that uses that function and assigns the results to the variable `resid_values`. Then we display the values in that variable.

Figure 26 Figure 27 shows those values. When we compare these to the values that we computed and displayed in Figure 25 we see that they are the same.

Figure 27 Now that we have the values stored in `resid_values` we can take the next step, create of a plot of the residula values against the minutes associated with them. The command in Figure 28 will create such a plot.

Figure 28 Once we have run the commands of Figure 28 we get the plot shown in Figure 29.

Figure 29 What we want to see here is that there is no pattern to the plotted points. They should seem like they were just randomly placed on the graph. If they formed a pattern then we would be suspect of our linear model.

What remains is for us to find the correlation coefficient for the linear model. The function `cor()` will do this. The command as formulated in the Editor pane is shown in Figure 30.

Figure 30 The result of running the command is shown in Figure 31.

Figure 31 Recall that a correlation coefficient that is close to either `1` or `-1` is an indication of a strong linear relationship between the variables. In this case, the value `0.9900357` is a really high correlation.

We can take a minute to redo that computation but store the result in the variable `hold_cor`. Then we find the square of that value.

Figure 32 THe resut of the computation is shown in Figure 33.

Figure 33 We have seen the square of the correlation coefficient before! Again, back in Figure 13 it was given toward the bottom of the display as `Multiple R-squared: 0.9802`. That portion of Figure 13 is repeated below with the specific area outlined in red.

Figure 13 (just a portion) The correlation coefficient is often symbolized by the name r. Thus the square of the `correlation coefficient` is known as r2, or r-squared.

All that remains to be done is to end our session. We start by saving the file shown in the Editor, clicking on the icon, thus changing the file name to display in black, as shown in Figure 34.

Figure 34 THen, in the Console pane we enter the `q()` function and then we respond to the question with `y` as shown in Figure 35.

Figure 35 That ends our RStudio session and leaves us looking at the file explorer display, shown in Figure 36. [As usual, on a Mac in Finder we would not see the `.RData` and `.Rhistory` files even though they are in the directory.]

Figure 36 Here is a listing of the complete contents of the `rogerbike.R` file:
```#  We will look at real data from a 9/10/1996 exercise
#  session that Roger had working on a special stationary
#  bike that recorded various values during the exercise
#  session.  In particular, we will check out a linear
#  model that tries to predice PULSe from the TIME elapsesd
#  in the session.

#  first we need to enter the data points, TIME and PULSE
#  from the session record
minutes <- seq( from=0.5, to=7.0, by=0.5 )
minutes
pulse <- c(91, 93, 96, 97, 99, 105, 109, 112,
119, 123, 128, 132, 139, 140)
pulse
#  now, how about a quick look at a scatter plot of the
#  data points, nothing fancy.
plot( minutes, pulse )
# it might be a bit easier to read the graph if we
# have some grid lines
abline( h=seq(90,140,5), col="darkgrey", lty="dotted")
abline( v=seq(0.5,7,0.5), col="darkgrey", lty="dotted")

# then we can ask for the linear model for these
# data points
lin_model <- lm( pulse~minutes )
# and we can look at that model
lin_model
# that gave us a little of the information
# but we could get
# much more by using the summary function
summary( lin_model )
# based on the results we have seen we can plot
# the linear
# regression line by using the abline function
abline( lin_model, col="darkred" )
# if we want to find the expected value of the pulse
# when the minutes is 4.75 then we could just type
# the following command
82.5934+8.1275*4.75
# where the 82.5934 and 8.1275 come from the
# linear model output

# we can actually extract those values,
# 82.5934 and 8.1275,
# from the linear model so that we can use them
# without typing them
lm_coeff <- coefficients( lin_model )
lm_coeff
# then our previous question,
# find the expected value when
# minutes is 4.75 becomes
lm_coeff+lm_coeff*4.75
# the slight difference is due to the fact that this
# latter method is not rounding the values as with
# only 4 decimal places.

# The beauty of this approach is that
# we can get expected
# values for lots of possible values for minutes.
# For example, for minutes having the values
# 1.75, 2.25, 3.75, 4.25, and 8.3333 we could say
use_minutes <- c(1.75, 2.25, 3.75, 4.25,  8.3333)
lm_coeff+lm_coeff*use_minutes

# remember, we know all of the observed values
pulse
# and now we can get all the expected valeus
lm_coeff+lm_coeff*minutes
# the residual values are the observed - expected values
# and we could compute these as
pulse - (lm_coeff+lm_coeff*minutes)

# However, we have a way to get these directly.
# We can get the residual values for all of our
# original data points
# (that is the Observed values - the
# expected value) by
resid_values <- residuals( lin_model )
resid_values
# and we can get a new plot, this time of the
# residual values vs. minutes to see if there
# is any pattern to that.
plot( minutes, resid_values )

# and if we need to know the correlation coeeficient
# we can get that via the cor() funtion
cor(minutes,pulse)
# By the way, if we were to save that and get its square,
# we will get the same value shown back in the summary
# output that was named R-squared
hold_cor <- cor(minutes,pulse)
hold_cor^2
```