Probability: Assessing Normality

Return to Topics page
We know that the normal distribution is continuous, symmetric, unimodal (it only has one modal region), and relatively free from outliers (extreme values, relative to the spread of other values, are extremely rare). We should also know that we will never find a real population or sample that is exactly normal. Every collection of real values is at best close to being normally distributed. It would be nice to have some method to say that some sample is approximately normal. This turns out to be a big challenge. About the best we can do is to come up with a list of characteristics that should disqualify a sample from being called approximately normal. We can go back to our earlier characterization and say that any one of the following should disqualify a sample from being called approximately normally distributed.

Disqualify as normal if a sample

has distinct values, and not a huge number of distinct values, so that it is hard to believe it is continuous,
is clearly not symmetric,
has more than one modal region,
has excessively many or extreme outliers, or
does not have a linear quantile plot

That last item will need a bit of explanation, but we will get to it later.

First, consider an example that has too few different values to be called continuous.
The sample in Table 1 may have 23 values but there are only 3 distinct values, 56, 57, and 58. The does not look like it is a continuous measure. We could certainly have a sample that has no more spread than this and yet it could easily be construed as continuous. Such a sample appears in Table 2.
Small samples mean that we just do not have enough data to claim that something is normal. How small is small? Nobody knows but clearly 4 is too small and 30 is probably more than enough. All the examples below will have more than 30 items so there should not be any question about sample size in those examples.

Our next sample is shown in Table 3.
We have plenty of values here. How do we look at things like symmetry, modal regions, and outliers? We have some tools to help, namely, dot plots, box and whisker plots, and histograms. We can do each of these for the data in Table 3.

We recall from our page on making dot plots that R has no built-in command to do dot plots. Instead, we developed our own function to make these plots. A listing of that function is included here for your convenience.

dot_plot<-function( this_list, ... )
{ 
  ## the first thing to do is to just sort the list into a local copy
  
  lcl_list <- sort( this_list )
  
  ## then we want a second list that is just as long as was the
  ## original list, because, in that second copy we will place the
  ## vertical position of the associated value in the sorted copy
  
  lcl_count <- lcl_list
  
  ## then, to start, we begin at the first item in the sorted list 
  ## It will have  avertical position of 1
  cur_val <- lcl_list[1]
  m <- 1
  lcl_count[1]<-1
  
  ## now we just move through the rest of the sorted
  ## list and if we are at the same value then we go up one 
  ## vertical level, but if we are at a new value we reset
  ## the vertical position to 1
  
  for (i in 2:length(lcl_list))
  { 
    x <- lcl_list[i]
    if ( x==cur_val )
    { m <- m+1
    lcl_count[ i ] <- m
    }
    else
    {
      cur_val <- x
      m <- 1
      lcl_count[i] <- m
    }
  }
  
  ## once we are done with that, we can just do a scatter plot on
  ## the two vectors that we have created.
  
  
  plot( lcl_list,lcl_count, xlab="", ylab="Frequency", ...)
}

Once the function is defined in our R session we can use it with the data in L1, the data values generated by the gnrnd4( key1=357848501, key2=15200083 ) function call. The command to do this is dot_plot(L1, ylim=c(0,14)). Figure 1 displays the contents of L1 along with our dot_plot() function call.

Figure 1

The result of the dot_plot() command is the dot plot shown in Figure 2.

Figure 2

That dot plot does not indicate that there is any problem with symmetry. There do not seem to be any outliers. The dot plot does raise questions about the sample in L1 having even one modal area. The values seem to be spread out, not collected in any areas at all.

A box plot might give us another view of the same data. The web page on Box plots gave numerous examples of ways to configure those charts, but for out purposes we can just take the default version of the command, boxplot(L1, horizontal=TRUE). It produces the plot shown in Figure 3.

Figure 3

Unfortunately, we do not learn much from this box plot. The left tail is a bit shorter than is the right tail, suggesting some asymmetry where the plot in Figure 2 did not, but the difference is not all that dramatic. There is no help here about the question of having a mode. There is some help in that the box plot does not indicate any recognized outliers. Recall that the box plot whisker lines stop at 1.5*IQR (the interquartile range) and leave outliers as marks beyond those whiskers. There are no such outliers in Figure 3. A histogram should give us better idea on the question of having a modal region. A simple hist(L1) produces the graph in Figure 4.

Figure 4

The histogram, for this data, could be interpreted in two ways. Fortunately, the conclusion about normality drawn from both interpretations is the same. On the one hand, we could say that the graph does suggest that there is a modal region at the extreme left and that data is skewed to the right (the long tail is on the right). That would mean that the data is asymmetric and not normally distributed. On the other hand, we could say that there is no strong modal area (we will see other examples where there will be one). Sure, there is a small pile of values to the left, but that could just be by chance. However, with no mode in the middle, we would conclude that this sample is not normally distributed.

Our last approach to this question is the quantile plot. We have not seen this before. The concept is a bit involved but worth understanding.

The idea is that if L1 represents an approximately normal distribution then we could sort the values in L1 and the resulting sorted list would correspond to the density of a normal distribution. That is, the lower and upper values would be more spread out and, as we approach the center of the values, they would be more tightly packed. We have 86 values in L1. In effect, we expect about 1/86th of the area under a normal curve to be between the sorted points. Well, we need to leave a little area for the tails too.

We could construct our own list where we are sure that the values correspond to the density of the normal distribution. The qnorm() function will allow us to find z-values for any size area that we choose. The entire area under the curve is 1 square unit. What we need is a list of z-scores that split the region into equal size areas. We start by finding 86 evenly spaced values strictly between 0 and 1. We can do this by first finding the number that is twice 86. That will be 86*2=172. Then the values that we want are 1/172, 3/172, 5/172, 7/172, ..., 167/172, 169/172, and 171/172. Each of those values has 2/172=1/86 between them. Of course, there are only 85 "between" areas. The final area is in the region below 1/172 (and above 0) and in the region above 171/172 (and below 1). From those values we can use qnorm() to find the corresponding z-score that has that area to its left. Our list of generated z-scores will be 86 values long and there will be an equal area between the scores.

Finally, if the values in L1 represent values from a normal distribution then when we plot the sorted version of L1 against our constructed list of z-scores the resulting plot should be points on a diagonal line. The following annotated statements will walk through this process.

n<-length(L1)            # find the number of items
q <- 2*n                 # make q be twice that number
p <- seq(1, q-1, 2 )     # make a sequence of the numerators
L2 <- p/q                # make a list of the values
L3 <- qnorm( L2 )        # get a list of z-scores
sorted_data <- sort(L1)  # get a sorted version of L1
plot(sorted_data, L3)    # make the plot

Figure 5 shows the console view of the statements.

Figure 5

Figure 6 shows the resulting plot.

Figure 6

Our expectation, if L1 is approximately normal, is to have the points fall on a diagonal line. The result shown in Figure 6 does not conform to that. As a result, we would say that we cannot justify calling L1 approximately normal.

Table 4 presents a new collection of data values.
First we use dot_plot(L1, ylim=c(0,14)) to generate the plot shown in Figure 7.

Figure 7

This shows a skewed distribution with a heavy count of items at the left end, meaning a long tail on the right: skewed to the right. We will use boxplot(L1, horizontal=TRUE) to generate the chart in Figure 8 to get another view of this.

Figure 8

Figure 8 confirms the view that the data values in L1 are skewed to the right. This is not a normal distribution. Still, we look at the histogram in Figure 9 generated by hist(L1).

Figure 9

The histogram shows the skewed distribution too.

Next we want to look at the quantile plot. We could just repeat the statements that we saw in Figure 5 to generate the plot for this data. However, since this sequence of statements is always the same we might as well put that sequence into a function. The statements in that function are:

assess_normality <- function( data_list )
{
  n <- length( data_list )
  sorted_data <- sort( data_list )
  q <- 2*n
  p <- seq(1, q-1, 2 )
  L2 <- p/q
  L3 <- qnorm(L2)
  plot( sorted_data, L3, 
        ylab="z values"
      )
}

and Figure 10 shows the console view of that function definition followed by our first use of the function.

Figure 10

The resulting graph is shown in Figure 11.

Figure 11

We already knew that the values in L1 do not represent a normal distribution and the fact that the dots in Figure 11 do not lie on a diagonal line merely confirms that determination.

Table 5 presents a new collection of data values.
The dot plot appears in Figure 12.

Figure 12

Those location of those dots look to be symmetric (within reason). There do not seem to be any outliers. And there seems to be a central modal region to the distribution. Nothing here says that this is not a normal distribution.

We look at the box plot in Figure 13.

Figure 13

This too looks completely normal. The median is about midway between the first and third quartile points, the whiskers seem about even and are even wider that the two middle regions.

We check the histogram in Figure 14.

Figure 14

Again, we find nothing to suggest that this is not a normal distribution.

Move to do a quantile plot via assess_normality( L1 ), shown in Figure 15.

Figure 15

Finally, we have a collection of data where the dots all nearly fall on a diagonal line. This is a confirmation that the distribution is approximately normal.

Table 6 presents a new collection of data values.
The dot plot appears in Figure 16.

Figure 16

We have a problem here. There seems to be two modal areas. We will see if there confirmations from the other views.

The box plot appears in Figure 17.

Figure 17

Figure 17 does not show any problems. We move to look at the histogram in Figure 18.

Figure 18

Figure 18 shows the two modal regions. This does not conform to the normal distribution.

The quantile plot should give us further confirmation.

Figure 19

Indeed, the dots in Figure 19 do not fall on a diagonal. The data in Table 6 do not represent a normally distributed sample.

Table 7 presents a new collection of data values.
The dot plot appears in Figure 20.

Figure 20

The distribution seems to be heavy around about 475 without having that be the central modal area. Also, the extreme low value is a concern. More views are needed.

We check out the box chart in Figure 21.

Figure 21

Figure 21 shows two clear outliers. Also, in that box plot we see that the median is not midway between the first and third quartile values. Finally, the first 25% of the values are much more spread out that are the last 25% of the values.

We turn to the histogram shown in Figure 22.

Figure 22

Figure 22 echoes the concerns that we have expressed above.

Finally, we look at the quantile plot in Figure 23.

Figure 23

The fact that the dots do not come close to being on a diagonal line confirms our view that the values in Table 7 are not a normally distributed collection of values.

Listing of all R commands used on this page

source( file="http://courses.wccnet.edu/~palay/math160r/gnrnd4.R")

gnrnd4( key1=745122201, key2=200056 )
gnrnd4( key1=2344122201, key2=20005600 )


gnrnd4( key1=357848501, key2=15200083 )


dot_plot<-function( this_list, ... )
{ 
  ## the first thing to do is to just sort the list into a local copy
  
  lcl_list <- sort( this_list )
  
  ## then we want a second list that is just as long as was the
  ## original list, because, in that second copy we will place the
  ## vertical position of the associated value in the sorted copy
  
  lcl_count <- lcl_list
  
  ## then, to start, we begin at the first item in the sorted list 
  ## It will have  avertical position of 1
  cur_val <- lcl_list[1]
  m <- 1
  lcl_count[1]<-1
  
  ## now we just move through the rest of the sorted
  ## list and if we are at the same value then we go up one 
  ## vertical level, but if we are at a new value we reset
  ## the vertical position to 1
  
  for (i in 2:length(lcl_list))
  { 
    x <- lcl_list[i]
    if ( x==cur_val )
    { m <- m+1
    lcl_count[ i ] <- m
    }
    else
    {
      cur_val <- x
      m <- 1
      lcl_count[i] <- m
    }
  }
  
  ## once we are done with that, we can just do a scatter plot on
  ## the two vectors that we have created.
  
  
  plot( lcl_list,lcl_count, xlab="", ylab="Frequency", ...)
}


dot_plot(L1, ylim=c(0,14))
boxplot(L1,  horizontal=TRUE)
hist(L1)


n<-length(L1)            # find the number of items
q <- 2*n                 # make q be twice that number
p <- seq(1, q-1, 2 )     # make a sequence of the numerators
L2 <- p/q                # make a list of the values
L3 <- qnorm( L2 )        # get a list of z-scores
sorted_data <- sort(L1)  # get a sorted version of L1
plot(sorted_data, L3)    # make the plot


gnrnd4( key1=734054702, key2=13900145 )
dot_plot(L1, ylim=c(0,14))
boxplot(L1,  horizontal=TRUE)
hist(L1)

assess_normality <- function( data_list )
{
  n <- length( data_list )
  sorted_data <- sort( data_list )
  q <- 2*n
  p <- seq(1, q-1, 2 )
  L2 <- p/q
  L3 <- qnorm(L2)
  plot( sorted_data, L3, 
        ylab="z values"
      )
}

assess_normality(L1)

gnrnd4( key1=236389404, key2=0001100438 )
dot_plot(L1, ylim=c(0,14))
boxplot(L1,  horizontal=TRUE)
hist(L1)
assess_normality( L1 )


gnrnd4( key1=227358705, key2=1800215,
        key3=2100290 )
dot_plot(L1, ylim=c(0,14), las=2)
boxplot(L1,  horizontal=TRUE)
hist(L1)
assess_normality( L1 )


gnrnd4( key1=830577509, key2=550819432219 )
dot_plot(L1, ylim=c(0,14), las=2)
boxplot(L1,  horizontal=TRUE)
hist(L1)
assess_normality( L1 )