Worksheet 02: Descriptive Measures

Return to Topics page
The task here is to come up with descriptive measures for the data in Table 1.

This page assumes that you have read through earlier pages and that you have mastered many of the steps that we use to set up our work. [If that is not the case then you should go back to the earlier pages where we walk through all of those steps.]

We start by inserting our special USB drive. On the computer used for this demonstrations that drive was asigned the letter F. Then we create a new folder on that drive. For this demonstration I have used the name worksheet02. We make a copy of the model.R file that is on the USB drive. We open the new folder and paste the copy of the file into the new folder. Finally, we rename the file; in this instance it has been called descriptive.R. That leaves us in the situation shown in Figure 1.

Figure 1

Then, double click on the descriptive.R file to open RStudio with a copy of the contents of descriptive.R in the Editor pane.

Just thinking about the task at hand, we now that we will need to load a few of the functions supplied on the USB drive. Figure 2 shows the comments and commands written into the editor to start our project. A complete listing of the commands that we will use in this demonstration is given at the end of this page.] Note that those lines have been highlighted. We click on the

icon to have those highlighted commands prformed.

Figure 2

Note that those lines have been highlighted. We click on the

icon to have those highlighted commands prformed. The result is shown in Figure 3.

Figure 3

With no error messages in Figure 3 we assume everything went well. However, we can check in th Environment tab to verify that the two functions have been loaded. We see this as true in Figure 4.

Figure 4

Back in the Editor pane, we can add the comments and commands reguired to generate and list the new data.

Figure 5

Those lines were highlighted in Figure 5. We run them to get the output shown in Figure 6.

Figure 6

We make the essential check that the numbers we show in Figure 6 are identical to those shown in Table 1.

They are so we can move forward. The summary() function will give us lots of important descriptive values. In Figure 7 you find a lengthy comment leading up to the use of the summary() function.

Figure 7

Once those lines are run we get the output shown in Figure 8.

Figure 8

We see that

the mean of the values in the table is 84.82,
the median of the values in the table is 82.70, so half the values are less than 82.70 and half of them are greater than 82.70,
the range of the values in the table is from 56.30 to 117.30,
the first quartile, Q₁, of the values in the table is 74.65, and
the third quartile, Q₃, of the values in the table is 96.30.

As noted in the comment, it would not make sense to find the mode of the values in the table because there is so little repetition of values. [In fact, 87.8 happens 3 times in the table but that is not much in a data collection of 95 values.]

We can compute the standard deviation, but we recall that there are two different definitions depending upon if the data in Table 1 is a sample or a population. Since nobody bothered to tell us that we will compute both of them. Those commands are in Figure 9.

Figure 9

The result of running the commands is in Figure 10.

Figure 10

From that output we see that

if this is a sample then the standard deviation is 13.45017, but
if this is a population then the standard deviation is 13.37919.

So much for the measures describing the data. Let us turn to a picture of the data. The function hist() will create such an image.

Figure 11

Running the highlighted text of Figure 11 does not produce an image in the Console shown in Figure 12.

Figure 12

However, it does create the histogram shown in Figure 13 in the Plot tab.

Figure 13

The histogram gives us quite a good feel for the distribution of values in the data collection. We can see that there are just a few extreme values and that most of the values are clustered toward the middle. Looing back at the mean (84.82) and the median (82.70) we see that both values are in the tall middle rectangle. We can even try to read the frequency of values in each "bin" in the figure. For example, it looks like there are 10 values in the 60 to 70 range.

A box and whisker plot should give us a different view of the data collection. The boxplot() function will produce such a view.

Figure 14

Aagain, running the highlighted text of Figure 14 does not produce much in the Console pane shown in Figure 15.

Figure 15

But it does produce, in the Plots tab, the diagram shown in Figure 16.

Figure 16

In Figure 16 we see the range of the values, the location of the first and third quartiles, and the position of the median.

The vertical alignment of Figure 16 is merely the default alignment. A horizontal alignment is often more pleasing. With just the small change shown in Figure 17 we can ask for the alternative arrangement.

Figure 17

Figure 18 shows the horizontal box and whisker plot of the same data.

Figure 18

Having seen the hstogram before raises our expectations for getting more detailed information about the distribution of values in the data collection. A frequency table would provide much more specific information about the values in the table. When we first went through the development of a frequency table we found that through a series of steps we could have R create such a table. Rather than memorize and repeat those steps each time that we want to build such a frequency table we developed a function, called collate3(), that will create such a frequency table.

Recall that the process for building the frequency table depends upon us specifying the lower end of the first "bin" and the width of the "bins". We wrote collate3() so that if we just give it the data collection then it just tells us the range of the collection and it suggests a width for the "bins". Figure 19 shows the collate3() function when it is just given the data collecton.

Figure 19

The result of running those higlighted lines is given in Figure 20.

Figure 20

We, however, will construct our "bins' to conform to the "bins" that R used in making the histogram shown back in Figure 13. In particular, we want to use the function as collate3(L1,use_low=50,use_width=10) as shown in Figure 21. Please note that the result of the function (it produces an entire table) is assigned to a new variable, in this case ft. Once ft has been given a value we just use its name to cause the system to display its contents, namely, the frequency table.

Figure 21

Running those highlighted statements produces the output, in the Console pane, shown in Figure 22.

Figure 22

In that frequency table we see the actual number if items in each "bin", along with the endpoints of each "bin", and th midpoint of each "bin'. In addition we can see the relative frequency, the cumulative frequency, the cumulative relative frequency, and even the number of degrees that we would ascribe to sectors of a pie chart for the data colloection.

The table of Figure 22 gives all of the inforamtion that we have, but it is not pretty. Figure 23 shows the command, View(ft), that causes a pretty version of that same table to appear as a new tab in the Editor pane.

Figure 23

Running the highlighted lines of Figure 23 produces no new output in the Console pane as shown in Figure 24.

Figure 24

However, it does create a new tab in the Editor pane, as shown in Figure 25, to show the pretty version of the frequency table.

Figure 25

Figure 25 does not contain any more information than did Figure 22. It just looks much better. [And, as was noted in the earlier introductory pages, there are some extra features available in this new presentation, though we will not present them again here.]

Before leaving this description of the data in Table 1 we note that the frequency table gave a bit more specific data than we could read from the histogram. However, we could enhance that histogram to make it much easier to read and understand. This added enhancement is presented just to show you how to do these sorts of things. You are not expected, in this course, to be proficient at making such enhancements.

Figure 26 shows the enhanced and expanded commands.

Figure 26

Once we run the commands highlighted in Figure 26 the Console display is shown in Figure 27.

Figure 27

The enhanced histogram appears in th Plots tab shown in Figure 28.

Figure 28

Before we leave all of this we should save the editor file. We click on the descriptive.R tab there and then click on the

icon. Doing so will cause the file name to be displayed in black, as shown in Figure 29.

Figure 29

Then we use the q() function, and we respond to the prompt with a y to quit our session, as shown in Figure 30.

Figure 30

At that point, when we look at the folder, shown in Figure 31, we should see not only our much larger descriptive.R file but also, if we are on a PC, the two hidden files .Rdata and .Rhistory which hold the working environment and a historical list of all the commands that we gave in our session, respectively.

Figure 31

Here is a listing of the complete contents of the descriptive.R file:

   #This session will create a collection of data
   # and then find the appropriate descriptive
   # measures for that data.

   #first we will load a few of the functions that we
   #   will need here.  More may follow later.
source("../gnrnd4.R")
source("../pop_sd.R")
source("../collate3.R")
   # Now we are ready to create the data collection
gnrnd4( key1=1300259404, key2=13200853)
   # Then we can look at the data that we generated
L1
   # Looking at the data it appears that these
   # are continuous measures and that there is
   # little if any repetition of values.  Thus the
   # appropriate measures of central tendency are the mean
   # and the median.  We will look at the range and the
   # quartile points as measures of dispersion.  All of that
   # is done via the summary() function
summary(L1)
   # We also want to find the standard deviation of the
   # data, but to do this we need to know if this data
   # is a population or a sample.  For a sample we can use
   # the built-in function sd()
sd( L1 )
   # For a population we will use the function that we
   # already loaded, pop_sd()
pop_sd( L1 )
   # To get some graphic description we can generate a
   # histogram, the most basic form of which is created
   # by the function hist()
hist( L1 )
   # We can get a box and whisker plot of the data via
   # the function boxplot()
boxplot( L1 )
   # not that we need to do it, but we could turn this into a
   # horizontal form via a slightly modified use of that
   # same function
boxplot( L1, horizontal=TRUE )
   # and then, to follow up on the histogram, we could
   # produce a frequency table of grouped data by
   # using the collate3() function that we loaded earlier.

   # We already know the low and high values of the data
   # from our earlier summary() function.  Therefore, we
   # could skip the first step in running collate3().
   # However, in this example, we will do the full two
   # step process.  First run the function collate3() but
   # with only the one argument, L1.
collate3( L1 )
   # Now run it again, but this time we will tell it to
   # use 50 as the low value, and we will set the width
   # of the intervals to be 10.  Also, we will save the
   # result in a new variable, ft
ft <- collate3( L1, use_low=50, use_width=10 )
   # to see the result we can just give the variable name
ft
   # We could get a nicer looking version of that output
   # by using the View() function, note the capital V
View( ft )
   # it might be nice to return to our histogram and improve
   # it by changing the labels, changing the y-scale, and
   # then adding some horiontal lines. This will make it
   # easier to compare the histogram to the frequency table.
hist( L1, main="Worksheet #2 Histogram",
      xlab="values in table", ylab="Frequency of values",
      ylim=c(0,30), yaxp=c(0,30,10), las=2)
abline(h=seq(0,30,3), col="darkgrey",
       lty="dotted")