Worksheet 03.1: Descriptive Measures

Return to Topics page
The task here is to come up with descriptive measures for the data in Table 1.

Assuming that you have read through earlier pages and that you have mastered many of the steps that we use to set up our work, you can skim through the first ten Figures and their associated discussions.

We start by inserting our special USB drive. On the computer used for this demonstrations that drive was asigned the letter F. The File Manager view of that drive is shown in Figure 1. {Recall that the images shown here may have been reduced to make a printed version of this page a bit shorter than it woud otherwise appear. In most cases your browser should allow you to right click on an image and then select the option to View Image in order to see the image in its original form.}

Figure 1

Then we create a new folder, on that drive. To do this here I just clicked on the New Folder icon toward the middle top of the window of Figure 1. The result is shown at the bottom of Figure 2.

Figure 2

Rather than accept the default folder name of New folder, we give it a new name, in this case worksheet031.

Figure 3

Once the folder is there we want to copy the model.R file that is on the USB drive into the folder. First we right click on the model.R file. This opens the options shown in Figure 4, where we point to the Copy option and then click on it.

Figure 4

Next we double click on our new folder name, Worksheet031, to move into that folder, shown in Figure 5.

Figure 5

We can see, in Figure 5, that the folder is empty. But now we can right click in the folder and select the Paste option. That puts the copy of model.R into this folder.

Figure 6

However, we have learned that it is probably best to rename this file. To do that we click (just once) on the name of the file. That will allow us to edit the name of the file, as shown in Figure 7.

Figure 7

For this project we will use the name ws31.R. We change the file name to that, as we see in Figure 8.

Figure 8

And we can press the Enter key to move to Figure 9.

Figure 9

At this point we have our new folder and in that folder we have a renamed copy of our model file. We can double click on that file name to open a session of RStudio. This is shown in Figure 10.

Figure 10

An important relation her is that because we started this session of RStudio from the file in our directory the result is that this directory, this folder, is our working directory. We do get a feeling for this in that the lower right pane of the RStudio window shows us the contents of our very own directory.

We also note, because we have not done any work in this currect directory, we have a blank Environment pane, the top right pane in the window. Therefore, if we want to use any of the functions that have been supplied on the USB drive we will have to load those functions, via the source command, into the Environment.

We will consistently type commands into the Editor pane, then highlight those commands, and then run the highlighted portion by clicking on the run icon,

, in the Editor pane. That last step will copy the highlighted commands to the Console pane and use R to execute them.

Certainly, you may decide to type all of the commands (and hopefully the comments) yourself as you follow along. However, all of the commands used in this page have been provided in machine readable form at the bottom of this page. If it were me doing the work I would find that listing, copy the lines from this page and paste them into the editor. Then, to follow along, just highlight the lines that you wish to execute.

We were given values above in Table 1. We want to generate those same values in our RStudio session. To do this we need to load the gnrnd4 function into our environment and then run the function using the values given with our table above. Once generated it makes sense that we look at the values so that we can compare them to the values in Table 1 The commands to do this are given in Figure 11.

Figure 11

When we run those commands we get the lines shown in Figure 12 in to Console pane, as is shown in Figure 12. In particular, comparing the numbers displayed in Figure 12 to the values given in Table 1 we see that we have generated exactly the required values.

Figure 12

If we look at the Environment pane, shown in Figure 13, we see, among other things that our function is defined and that there are now 97 values in the variable L1.

Figure 13

We recall that the built-in function summary will tell us a lot about the data in our table. It will give us the median, the mean, the first and third quartle points, and the minimum and maximum values in our data. It will not give us the standard deviation for those values. The command sd(L1) will give us that value, but it only gives us the standard deviation assuming that the data represents a sample. We will have to load and use the function pop_sd to find the standard deviationassuming that the data that we have is a population. Figure 14 holds the commands for doing all of this.

Figure 14

We run the commands highlighted in Figure 14 to get the Console output shown in Figure 15.

Figure 15

We read the output in Figure 15 to find that medianis 235.0, the mean is 230.2, the minimum is 121.0, the maximum is 344.0, and the 1^st and 3^rd quartiles havng the values 200.0 and 258.0, respectively. Then considerting the data is a sample, we have the standard deviation is 46.04279 and therefore, the variance is 2119.938. On the other hand, if the data represents a population then the standard deviation is 4.80484 and the variance is 2098.083.

In all of this we might be a little concerned that the mean is only given to 1 decimal place. We can look at some other ways to display this value a bit more accurately. First, we could compute it separately and assign its value to a variable. The commands to do this are given in Figure 16.

Figure 16

Running those commands produces the output in Figure 17. There we see the value of the mean is 230.2474.

Figure 17

However, if we look in the Environment pane, shown in Figure 18, we can see that R has calculated the mean, stored in xbar to be 230.247422680412.

Figure 18

The 15 digits shown in Figure 18 is clearly more than we need. The 7 digits shown in Figure 17 is quite helpful. The 4 digits shown in Figure 15 do not really give us enough information, but the summary ommand, which produced Figure 15, is just so convenient!

There is a way to tell R to give us more digits in the display by default. We can use the options(digits=10) command to change the default output setting. This, along with another run of the summary command are given in the Editor pane in Figure 19.

Figure 19

The result of running those two commands is in Figure 20. There we see that all of the values are given with more displayed digits.

Figure 20

One might reasonabley ask, "I set digits to 10, why are there only 7 shown for each value in Figure 20?" The answer is that the setting, digits=10, is taken by R as a "suggestion" not as a rule. We got what we wanted, more displayed digits, and that is enough.

At this point we have found our measures of central tendency and our measures of dispersion. Let us look at some graphs of the data. First, we can get a histogram.

Figure 21

When we run the command of Figure 21 we get the histogram shown in Figure 22.

Figure 22

The graph in Figure 22 is completely accurate, but it is a bit hard to understand, in part because we have no idea of where R has decided to make the breaks for the different bins or groups. If we had to guess it certailnly seems that one break is at 200 and another is at 300 with that interval broken into 5 bins. Therefore, we would expect that the bins are each 20 units wide and we can compute that the break points are at 120, 140, 160, 180, 200, 220, 240, 260, 280, 300, 320, 340, and 360.

For the purpose of this course, the default graph is good enough. However, with just a few additional values we can really improve the quality of the graph. Three immediate changes will be:

to include the wierd las=3 option which will cause all of the axis values to appear perpendicular to the axis,
to include the option breaks=seq(110,350,15) which will set the breaks for the bins to start at 110 and to be 15 units wide and to not go over 350, and
to include the option xaxp=c(110,350,16) which will set the x-axis labels to start at 110 and go to 350 and to have 16 even steps along the way, thus coinciding with the values for our break points.

The new command is shown in Figure 23.

Figure 23

Running the command does not produce anything to speak of in the Console pane, shown in Figure 24.

Figure 24

However, it does produce the histogram shown in Figure 25, a significant improvement over the default graph of Figure 22.

Figure 25

One aspect of the histograms that we have yet to examine is where do you place a value that is on a break point. In the data in Table 1 there is exactly one 245. In Figure 25 there appears to be one more value in the 230-245 bin than there is in the 245-260 bin. The question is "In which bin is the data value 245?" The answer is that by dfault R uses the break points to include values at the right side of the break but not on the left side. Thus, in mathematical nomenclature, the two bins are for values (230,245] and (245,260], where the parenthesis indicates the "open" end of the interval and the bracket indiates the closed end.

We can change this default by including the option right=FALSE in our command. This is shown in Figure 26.

Figure 26

As expected, running the command does little but echo it n the Console pane.

Figure 27

But runing the command does produce the new histogram shown in Figure 28.

Figure 28

We can see some change from Figure 26 to Figure 28. In particular, the 245-260 bin has grown. That is because it picked up the 245 value that used to be in the 230-245 bin. That 230-245 bin did not shrink because it pickd up a replacement for the lost 245 value, namely the 230 value. We can see that because the 215-230 bin is now shorted than it was in Figure 26. The three bins that we have examined, in Figure 28, would now be described as [215,230), [230,24), and [245,260). We would say that these are closed on the left and open on the right.

There are a few more tweaks that we could add to our histogram, just to make it easier to read:

the option main="For Worksheet 3.1" give the graph a main title,
the option xlab="Table Values" replaces the default label for the x-axis,
the option ylab="count of values" replaces the default label for the y-axis,
the option xlim=c(110,350) explicitly sets the x-axis to that range,
the option ylim=c(0,16) explicitly sets the y-axis to that range,
the option cex.axis=.7 causes the labels for both axes to be given at 70% the regular size,
the option yaxp=c(0,16,8) makes the tick marks on the y-axis start at 0, end at 16, and have 8 steps along the way, and
the additional command abline( h=seq(2, 16, 2 ), col="blue", lty="dotted") will write over the histogram dotted blue horizontal lines starting at 2, ending at 16, and going in steps of 2.

This gives the commands shown in Figure 29.

Figure 29

Running those commands changes the Console to appear as in Figure 30.

Figure 30

And, doing so produces the graph shown in Figure 31.

Figure 31

It would be inapproprite to use a barplot for the data in Table 1. Remembere that a barplot graphs the value given in the list. In this case the first bar would be of height 264, the second of height 27, the third of height 277, and so on. We could, however, get a barplot of the frequency of each different value in L1 by using the command barplot(table(L1)).

Figure 32

As usual, not much shows up on the Console pane.

Figure 33

The generated barplot found in the Plot pane, is shown in Figure 34. From that plot we can tell that 12 values appear twice, five values appear three times , and two values appear 4 times. However, it is really hard to figure out just what those multiple values might be.

Figure 34

We do remember that to find the mode of the values in L1 we can use the command Mode(L1), but only after that function has been loaded into the Environment.

Figure 35

The result of running the two commands is shown in Figure 36 where we see that there are two mode values, 241 and 243, and that each appeas 4 times in the data.

Figure 36

We can move on to explore a boxplot but it will help to recall the values from our summary function at the same time. Figure 37 has both commands. Note that the boxplot command has been modified to include the option horizontal=TRUE.

Figure 37

We see the output of the summary command in Figure 38.

Figure 38

The boxplot is shown in Figure 39. As expected, the heavy middle bar is at the median; the left end of the box is at the 1^st quartile; the right end of the box is at the 3^rd quartile; the left end of the whisker is at the minimum; and the right end of the whisker is at the maximum.

Figure 39

Again, as was the case with the histogram it is sufficient for this course to be able to produce such a graph. However, with just a few additional options we could dramatically improve the graph. Figure 40 shows the changes that we would make.

Figure 40

Figure 41 shows the resulting Console pane.

Figure 41

Figure 42 shows the new graph. It is much easier to approximate the various values by reading this chart than it was by reading the default chart shown in Figure 39.

Figure 42

Another way to look at the data is via a stem and leaf diagram. The built-in function is stem. We will just try that command.

Figure 43

The result is shown in Figure 44 below.

Figure 44

Figure 44 looks like a stem and leaf diagram. It was produced from the data via R. One would expect that it is completely correct. However, it does seem a bit strange that the stems go up by 2. Looking back at the original values we see that the second was 257 and the third was 277. Where are they in Figure 44?

The problem is that the built-in function groups values according to its own need. For the stem function it seems to want to group values so that the number of output lines remains fairly small. This produces the strange arrangement that we see in Figure 44. There is an option that we can specify to try to override that behaviour. Figure 44.1 shows that modified command.

Figure 44.1

Running the command of Figure 44.1 produces the diagram shown in Figure 44.2 where we now find the complete set of values. Note that 257and 277 are both in the new diagram.

Figure 44.2

Recognizing the default strange behavior of the built-in command, we do have a different function that you can load and run, namely, stem_leaf. Those commands appear in Figure 45.

Figure 45

Running those commands produces the diagram shown in Figure 46.

Figure 46

It is worth comparing these two solutions. Each has its advantages.

That leaves us with attempting a dot plot. To do that we need to first load the function into our Environment and then run the function.

Figure 47

The Console output, shown in Figure 48, doesnot tell us much other than there were no errors.

Figure 48

The Plot pane, however, now holds the dot plot.

Figure 49

That plot is not particularly interesting. In fact, it is no more informative than was the bar plot that we saw in Figure 34.

Before we leave we should make sure that we have saved our Editor file. Because we have made changes in that file ts name appears in red in its Editor tab, as we see in Figure 50.

Figure 50

By clicking on the floppy disk icon,

, we save the file and turn its name to a black font as in Figure 51.

Figure 51

Finally, we need to close our RStudio session. we do that in the Console pane via the q() command. Press enter to perform the command and then respond to the question with y and use the Enter key again.

Figure 52

Here is a listing of the complete contents of the ws31.R file:

#we want to use gnrnd4 to generate our data
#  first we need to get that function into our
#  environment

source("../gnrnd4.R")

#  Then we can use it to generate our data

gnrnd4(503029604, 5200234)

# and we should at least look at it so that we can 
# compare it to the table on the web page
L1

#  Then we can get some descriptive measures
summary(L1)
#  along with the sample standard deviation and variance
sd(L1)
sd(L1)^2
# and then the population standard deviation and variance
source("../pop_sd.R")
pop_sd(L1)
pop_sd(L1)^2

# The summary command did not give us many 
# significant digits.  Look at it another way...

xbar <- mean(L1)
xbar

# Here is another way to get more out of commands 
# like summary
options(digits=10)
summary(L1)

# If we look back at the data we see that we have many 
# different values ranging from 121 to 344.  Let us get
# graphs of those values.
hist(L1)

# That was a histogram with the various parameters set
# by R.  We could, of course, specify some of them 
# so that we get a nicer histogram.  This is not
# a requirement of our course but it does not hurt to
# learn a little extra...
hist( L1, las=3, breaks=seq(110,350,15),
      xaxp=c(110,350,16))

# Note that these intervals are closed on the right
# but we can change that with 
hist( L1, las=3, breaks=seq(110,350,15),
      xaxp=c(110,350,16), right=FALSE)

# And, returning to the closed on the right intervals, 
# we could fix up a little more with 
hist(L1, main="For Worksheet 3.1", xlab="Table Values",
     ylab="count of values", xlim=c(110,350), ylim=c(0,16),
     breaks=seq(110,350,15), xaxp=c(110,350,16),las=3,
     cex.axis=.7, yaxp=c(0,16,8))
abline( h=seq(2, 16, 2 ), col="blue", lty="dotted")

# we could try to do a barplot, but remember that a barplot
# will show the count (i.e., frequency) of each unique value

barplot( table(L1) )

# Looking at the bar plot we can see that there a a number of 
# values that repeat, 12 values appear twice, 5 values appear
# three times, and two values appear 4 times.  It is hard to 
# read the plot to find those two values, but we could "load" 
# and run the Mode function to help us.
source("../mode.R")
Mode(L1)

# while we are at it, let us see, again the summary values 
# and then generate a box and whisker chart
summary(L1)
boxplot(L1, horizontal = TRUE)

# and it would not hurt to try to make that look a little
# better
boxplot(L1, ylim=c(110,350), xaxp=c(110,350,16),
        horizontal=TRUE, main="For Worksheet 3.1 Data",
        las=3)
abline( v=seq(110, 350, 15 ), col="blue", lty="dotted")


# we can try a stem and leaf plot.  After all, we have values
# running from 121 to 344 and we could look at stems
# running from 12 to 34.
# first we can try this with the built-in stem function
stem(L1)

# that produced some questionalbe output since, looking
# at the data we do not find a 123 but we do find a 133
# which is not in the stem-leaf plot.  There is an option
# that may help this:
stem(L1, scale=3)


#And then we can load our version and try that...
source("../stem_leaf.R")
stem_leaf( L1, place=0)

# we can also do a dot plot, but that will not be much
# different from that bar plot that we did before.
source("../dot_plot.R")
dot_plot(L1)