Years ago, back in the early and mid 90's to be more precise, I would go to an exerccise program. They had a stationary bike that had various sensors on it and that provided the users with data related to the workout session. In particular, every 30 seconds the device would print out a new line giving the

- Elaspsed time,
- Current cadence (i.e., number of pedal rotations per minute),
- The watts generated (i.e., the power of the pedalling),
- The users pulse (i.e., heart rate, beats per minute),
- The resistence (i.e., a measure of the load that the bike was putting on the pedals), and
- Calories expended (i.e., the number of calories brned during the previous 30 seconds).

Just looking at the figures in Image 1 it looks like my pulse just rises steadily over the seven minutes. We can use R to take a look at that. To do this we will use our USB drive and we will do the usual, create a new folder (in this case called

`worksheet04`

`model.R`

`rogerbike.R`

Then, we double click on the

`rogerbike.R`

Once we run the highlighted lines of Figure 2 we get the output shown in Figure 3 in the

It is important to verify that the values shown in Figure 3 match the values we had in

The next thing to do is to at least look at a scatter plot of the pulse values versus the time (the minutes). Figure 5 shows the command as written in the

Figure 6 shows that nothing much happens in the

But we do get, in the

Figure 7 is really quite accurate, but without grid lines it is a bit hard to see just where each of the dots is on the graph. Figure 8 shows the commands, in the

Once those have been run we see the updated graph in Figure 9.

Clearly the dots in Figure 9 do not lie on a straight line, but they are not far from being on one. Our chalenge is to find the equation of the line that best fits this data. To do this we will use the function

`lm(pulse~minutes)`

`pulse`

When we use the function in Figure 10 we assign the result of the fnction to a new variable,

`lin_model`

`lm()`

When the highlighted lines of Figure 10 are run the result appears in the

The very brief output shown in Figure 11 is meant to tell us that the linear equation that best fits the data is given by

`pulse = 82.593 + 8.127*minutes`

`82.593+8.127*3.2`

`108.5994`

`108`

`109`

The linear model actually has much more in it than just the values shown in Figure 11. The

`summary(line_model)`

Running the highlighted lines of Figure 12 gives Figure 13 in the

Notice that in the expanded output of Figure 13 the same two values,

`82.5934`

`8.7125`

Of course, the first thing that we want to do is to see the result of our work. Therefore, we formulate the

`abline()`

Sure enough, when we run that command we get the graph shown in Figure 15.

As another example of using the linear model to predict the pulse based onthe heart rate, we could form the command shown in Figure 16, using the values for the coefficients given back in Figure 13.

Running that command produces the output shown in Figure 17 which tells us that the predicted, the expected, value is

`121.199`

In the two instances above where we have used the values from the linear model we have relied on our being able to accurately transcribe those values. We can just extract them from the model. This is demonstrated by the commands shown in Figure 18.

Once those commands are executed we can see that the variable

`lm_coeff`

In order to use the values we just refer to them as

`lm_coeff[1]`

`lm_coeff[2]`

The result, shown in Figure 21, is actually a bit more accurate since it is not using the rounded values that we used previously for the coefficients.

An interesting feature of the design of R is that we can get multiple values computed for a problem. For example, if we want to know the expected (prediced ) value for minutes equal to 1.75, 2.25, 3.75, 4.25, and 8.3333 then we can use the statements shown in Figure 22

Running those highighted lines produces the output of Figure 23.

Using that same strategy allows us to compute the residual values, that is, the observed minus the expectd value for each item in the list of

`minutes.`

The results can be seen in Figure 25.

Looking at the residual values is important. It is so important that the linear model that we created,

`lin_model`

We can easily scan the list of our computed residuals in Figure 25 and verify that the minimum value was

`-3.9121`

`4.3429`

In order to pull out the residuals from the linear model, rather than compute them as we did above, we use the function

`residuals()`

`resid_values`

Figure 27 shows those values. When we compare these to the values that we computed and displayed in Figure 25 we see that they are the same.

Now that we have the values stored in

`resid_values`

Once we have run the commands of Figure 28 we get the plot shown in Figure 29.

What we want to see here is that there is no pattern to the plotted points. They should seem like they were just randomly placed on the graph. If they formed a pattern then we would be suspect of our linear model.

What remains is for us to find the

`cor()`

The result of running the command is shown in Figure 31.

Recall that a

`1`

`-1`

`0.9900357`

We can take a minute to redo that computation but store the result in the variable

`hold_cor`

THe resut of the computation is shown in Figure 33.

We have seen the square of the

`Multiple R-squared: 0.9802`

The

`correlation coefficient`

is known
as All that remains to be done is to end our session. We start by saving the file shown in the

THen, in the

`q()`

`y`

That ends our

`.RData`

`.Rhistory`

Here is a listing of the complete contents of the

`rogerbike.R`

# We will look at real data from a 9/10/1996 exercise # session that Roger had working on a special stationary # bike that recorded various values during the exercise # session. In particular, we will check out a linear # model that tries to predice PULSe from the TIME elapsesd # in the session. # first we need to enter the data points, TIME and PULSE # from the session record minutes <- seq( from=0.5, to=7.0, by=0.5 ) minutes pulse <- c(91, 93, 96, 97, 99, 105, 109, 112, 119, 123, 128, 132, 139, 140) pulse # now, how about a quick look at a scatter plot of the # data points, nothing fancy. plot( minutes, pulse ) # it might be a bit easier to read the graph if we # have some grid lines abline( h=seq(90,140,5), col="darkgrey", lty="dotted") abline( v=seq(0.5,7,0.5), col="darkgrey", lty="dotted") # then we can ask for the linear model for these # data points lin_model <- lm( pulse~minutes ) # and we can look at that model lin_model # that gave us a little of the information # but we could get # much more by using the summary function summary( lin_model ) # based on the results we have seen we can plot # the linear # regression line by using the abline function abline( lin_model, col="darkred" ) # if we want to find the expected value of the pulse # when the minutes is 4.75 then we could just type # the following command 82.5934+8.1275*4.75 # where the 82.5934 and 8.1275 come from the # linear model output # we can actually extract those values, # 82.5934 and 8.1275, # from the linear model so that we can use them # without typing them lm_coeff <- coefficients( lin_model ) lm_coeff # then our previous question, # find the expected value when # minutes is 4.75 becomes lm_coeff[1]+lm_coeff[2]*4.75 # the slight difference is due to the fact that this # latter method is not rounding the values as with # only 4 decimal places. # The beauty of this approach is that # we can get expected # values for lots of possible values for minutes. # For example, for minutes having the values # 1.75, 2.25, 3.75, 4.25, and 8.3333 we could say use_minutes <- c(1.75, 2.25, 3.75, 4.25, 8.3333) lm_coeff[1]+lm_coeff[2]*use_minutes # remember, we know all of the observed values pulse # and now we can get all the expected valeus lm_coeff[1]+lm_coeff[2]*minutes # the residual values are the observed - expected values # and we could compute these as pulse - (lm_coeff[1]+lm_coeff[2]*minutes) # However, we have a way to get these directly. # We can get the residual values for all of our # original data points # (that is the Observed values - the # expected value) by resid_values <- residuals( lin_model ) resid_values # and we can get a new plot, this time of the # residual values vs. minutes to see if there # is any pattern to that. plot( minutes, resid_values ) # and if we need to know the correlation coeeficient # we can get that via the cor() funtion cor(minutes,pulse) # By the way, if we were to save that and get its square, # we will get the same value shown back in the summary # output that was named R-squared hold_cor <- cor(minutes,pulse) hold_cor^2

©Roger M. Palay Saline, MI 48176 September, 2016