Explanation of collate3.R

Return to Computing in R: Frequency Tables -- Grouped Values

We will start this page with a straight forward listing of the file collate3.R almost as it exists at the web site. You can download that file to your computer by right clicking on the link collate3.R. The only difference between the real file and the listing here is that this listing has a trailing comment on every line giving the line number of that line.

collate3 <- function( lcl_list, use_low=NULL, use_width=NULL, ...)        #  1
  {                                                                       #  2
    ## This is a function that will mimic, to some extent, a program      #  3
    ## that we had on the TI-83/84 to put a list of values into           #  4
    ## bins and then compute the frequeny, midpint, relative requency,    #  5
    ## cumulative frequency, cumuative relative frequency, and the        #  6
    ## number of degrees to allocate in a pie chart for each bin.         #  7
                                                                          #  8
    ## One problem here is that getting interactive user input in R       #  9
    ## is a pain.  Therefore, if the use_low, and or use_width            # 10
    ## parameters are not specified, the function returns summary         # 11
    ## information and asks to be run again with the proper values        # 12
    ## specified.                                                         # 13
                                                                          # 14
   lcl_real_low <- min( lcl_list )                                        # 15
   lcl_real_high <- max( lcl_list )                                       # 16
   lcl_size <- length(lcl_list)                                           # 17
                                                                          # 18
   if( is.null(use_low) | is.null(use_width) )                            # 19
   {                                                                      # 20
                                                                          # 21
     cat(c("The lowest value is ",lcl_real_low ,"\n"))                    # 22
     cat(c("The highest value is ", lcl_real_high,"\n" ))                 # 23
     suggested_width <- (lcl_real_high-lcl_real_low) / 10                 # 24
     cat(c("Suggested interval width is ", suggested_width,"\n" ))        # 25
     cat(c("Repeat command giving collate3( list, use_low=value, use_width=value)","\n")) # 26
     cat("waiting...\n")                                                  # 27
     return( "waiting..." )                                               # 28
   }                                                                      # 29
  ## to get here we seem to have the right values                         # 30
  use_num_bins <- floor( (lcl_real_high - use_low)/use_width)+1           # 31
  lcl_max <- use_low+use_width*use_num_bins                               # 32
  lcl_breaks <- seq(use_low, lcl_max, use_width)                          # 33
  lcl_mid<-seq(use_low+use_width/2, lcl_max-use_width/2, use_width)       # 34
                                                                          # 35
  lcl_cuts<-cut(lcl_list, breaks=lcl_breaks, ...)                         # 36
  lcl_freq <- table( lcl_cuts )                                           # 37
  lcl_df <- data.frame( lcl_freq )                                        # 38
  lcl_df$midpnt <- lcl_mid                                                # 39
  lcl_df$relfreq <- lcl_df$Freq/lcl_size                                  # 40
  lcl_df$cumulfreq <- cumsum( lcl_df$Freq )                               # 41
  lcl_df$cumulrelfreq <- lcl_df$cumulfreq / lcl_size                      # 42
  lcl_df$pie <- round( 360*lcl_df$relfreq, 1 )                            # 43
                                                                          # 44
  lcl_df                                                                  # 45
  }                                                                       # 46
We can "walk through" the program to see what it does.

Lines 1 and 2

The first line assigns to the variable collate3 a value. That value is defined to be a function. The function is defined to have parameters and in this instance the function will have at least one parameter called lcl_list. This means that when someone actually invokes the function collate3() they will have to give it at least one argument which will be assigned to the parameter lcl_list. Thus, we could call the function with a statement such as collate3(L1) and that simple statement will start the function with L1 assigned to lcl_list.

The function heading continues, in line 1, by specifying two more parameters, namely, use_low and use_width. However, unlike the first parameter, lcl_list, these two parameters need not be specified when we call the function collate3() function. That is why we did not need to have them when we used the command collate3(L1). Here is an example use of the collate3() function that uses these two additional parameters by specifiying two more arguments: collate3( L1, 190, 10). Such a statement will cause 190 to be assigned to use_low and 10 to be assigned to use_width. R even lets us change the order of the assignment as long as the statement we use includes the names of the parameters. Thus, the statement collate3( L1, use_width=10, use_low=190) will cause 190 to be assigned to use_low and 10 to be assigned to use_width, even though the arguments are given in the opposite order. However, if we do not specify arguments for use_low and/or use_width then they will have the default value specified in line 1 as NULL.

Finally, on line 1, the function heading includes an ellipsis, the three dots ..., and they indicate that we could send even more arguments to this function, but that at most we will be passing those arguments on to some function that we may call from within the body of this function collate3(). Thus, when we make the statement collate3( L1, 190, 10, right=FALSE), R finds no fault with the statement even though there is not a specific parameter in the heading to match the argument given as right=FALSE. As we will see below, in dicussing line 36, the extra argument may be passed to yet another function.

Once the function heading is complete, with the closing right parenthesis at the end of line 1, we need to specify the body of the function. The body of the function will be enclosed in braces, between { and a matching }. Line 2 holds our opening left brace.

Please note that the location of that brace on a separate line is just a matter of style. In fact, there are many people who prefer to put that opening brace at the end of the heading, that is, at the end of line 1. Such a choice is again just a matter of style, though that is a popular style.

Lines 3 through 14

These lines are either blank or they are comments. Blank lines are helpful to a person reading the function because they provide vertical spacing between logical portions of the function. Comments start with the # symbol and extend from there to the end of the line. The use of ## in this particular function is again a matter of style. It helps to really set off the lines as comments. However, the second # is not needed at all.

As noted at the start of this page, this particular listing actually has a comment on every line because every line ends with the # followed by the line number. This was done to make it possible to reference specific lines in the function. Because the line numbers are in comments, the function even with the additional line numbers will work just as well as it does without the line numbers.

The comments in these lines are meant to explain what this function does. It is always nice if the comments accurately describe the actions of the function, but, obviously, since these are just comments, we could say anything here.

Lines 15 through 17

Understanding that any time collate3() is used it must be used with an argument that is matched to the parameter called lcl_list, these three statements set some local variables to values representing the minimum value, the maximum value, and the number of values in the calling argument.

Lines 19 through 29

These lines are all part of one statement. The statement starts with the if in line 19. That if is followed by a logical expression, enclosed in parentheses, also on line 19. If the value of the logical expression is true then the single statement following the closing right parenthesis will be performed. In this case that single statement is the block indicated by the opening left brace, {, on line 20, and extending through the closing left brace, }, on line 28. That block of statements is considered to be the one statement following the if() even though there are multiple statements inside the block. Again, not that it is a matter of style to located the opening left brace, {, on a separate line. A different, popular style is to place that brace immediately after the closing parenthesis of the logical condition.

In the logical structure of the collate3() function, these are the statements that will take care of the situation where we do not specify arguments that match lcl_low and lcl_width. That is, when we call the function via the statement collate3(L1) these are the statements that provide a response that tells us, the user, the lowest value, thi highest value, and makes a suggestion as to the width to use.

On line 19, the structure is.null(use_low) | is.null(use_width) uses the vertical bar, |, to represent the idea of or. Therefore, the meaning of that structure is to produce a true value if we are missing either one or both of the values use_low and use_width. If the calling statement had provided both of the values, as in collate3(L1,190,10), then that structure would evaluate as false or false which is the value false.

All of this means that if we are missing one or both of those values then the function will display (that is what cat() does) the lowest value, the highest value, and then a suggested width that is just one tenth of the difference between the high value and the low value. All of that is followed by a text message that the command should be given again specifying more values, and a message that the system is waiting.

Finally, on line 28, we find the find the return("Waiting...\n") command. The return causes R to leave the function, ending any further computation within the function, and, in this case, passing the value Waiting...\n back as the final value of the function. In this the \n is the character that ends a line of output and starts the next output text on a new line.

Line 30

Line 30 is another comment line. In this case the comment serves to let us know that they only way that we should be this far into the function is if we actually have values for use_low and use_width.

Lines 31 through 34

Our first task is to set up the partitions. We know the low end of the partitions, it is in use_low, and we know the width of the partitions, it is in use_width. In order to use the seq() function to set up the break points we still need to know the high value.

Line 31 computes the number of partitions we will need to cover the range of values by finding the range between the high data value, stored in lcl_real_high, and the starting value for our partitions, stored in use_low, dividing that range by the width of the partitions, using the function floor() to drop any decimal part of that answer, and then adding 1 to replace that dropped decimal part with a whole interval.

Line 32 can then compute the maximum value that we will use in our partitons. It does this by starting at the low value and then adding a width value for every partition that we will use.

Line 33 uses the seq() function to actually create the break points.

Line 34 uses a slight modification of that same seq() function to find the midpoints of each of the partitions. It does this by starting the sequence half an interval width above the low end and ending the sequence half an interval below the top end. Of course, the width between midpoint values is the same as the width of the intervals.

Line 35

Line 35 was left blank to add a little vertical spacing and to separate the preparation work from the rest of the work of the function. Again, this is a matter of style. It does not hurt R to have such vertical spacing and it can be a big help to a person reading the function.

Lines 36 through 43

This section is the real work of the function. Here the function cuts up the data, finds the frequencies, creates the data frame, and finds and appends more values to that data frame.

Line 36 uses the cut() function to create the partitions that correspond to each of the data values in lcl_list based on the breaks that were calculated back in line 33. In addition, because the function call ends with the ellipsis, if the function collate3() was called with extra arguments, such as right=FALSE, then any such extra arguments are passed on to this call of the cut() function.

Once the partitions have been created, line 37 uses them to create a table structure to hold the frequency of each partition which is the same as holding the frequency of the values in lcl_list that fall into each of the intervals. Remember that the table structure also holds the different values of the partitions, such as (210,220], as the labels for the frequencies.

The next step is to convert the table structure that we had stored in lcl_freq into a data frame. Line 38 does this, creating the variable lcl_df.

Lines 39 through 43 merely add more columns to the data frame. First we append the already computed midpoints. Then we compute and append, in order, the relative frequency, the cumulative frequency, the relative cumulative frequency, and the number of degrees to allocate if we were to have to draw a pie chart.

Line 44

Just another blank line that provides a little vertical spacing.

Line 45

Line 45 could have been written as return( lcl_df ) but because it is the last line to be performed in collate3() writing it as just lcl_df does the same thing. This last expression becomes the value of the function, the result of the calling the function. Thus, if we had called the function with a statement such as
the result would be the value of the data frame that we created in the function. As such, that value would be displayed in the Console. On the other hand, if we had called the function with a statement such as
df <- collate3(L1,190,10)
the result would still be the value of the data frame that we created in the function but that value would be assigned to the variable df.

Line 46

This is the closing brace, }, that matched the opening brace found back on line 2. This marks the end of the block statement that serves as the one statement following the function header.

Return to Computing in R: Frequency Tables -- Grouped Values

©Roger M. Palay     Saline, MI 48176     November, 2015