0.3 Measures of Central Tendency

Chapter 0: Descriptive Statistics
0.1 Introduction
0.2 Kinds of Measurements
0.3 Measures of Central Tendency
0.4 Measures of Dispersion
0.5 Displaying Data

0.3 Measures of Central Tendency

There are three measures of central tendency: the mean, the median, and the mode. To some extent, and in various cases, each of these values represents the "average" value in a set of data. These values produce a "typical" value for a set of data. Section 0.1 of this material demonstrated the difference between looking at a small set of data and looking at a larger set of data. For a tiny set of data we really do not need to do all of this work. However, for a larger set we compute the mean, median, and/or mode to give us a feel for the "typical" or "average" value in the set.

The mean is the "average" that we have learned in most math classes starting in elementary school. To find the mean we calculate the sum of the items in the data set, and then we divide that total by the number of items in the data set. Thus, the mean of the values

8, 7, 15, 4, 17, 6, 4, 5, 7, 8, 7 is

8 + 7 + 15 + 4 + 17 + 6 + 4 + 5 + 7 + 8 + 7
11

88
11

= 8

The median is the "middle" value after all of the values have been sorted. We can sort our list of 11 values to get

4, 4, 5, 6, 7, 7, 7, 8, 8, 15, 17 and we find that 7 is the middle value. In the sorted list above we have 11 items, which means that the 6^th item is in the middle. There are five numbers on each side of it in the sorted list. The value of the 6^th item is 7, so 7 is the median value.

The mode is the most common, the most frequently occurring value in the list. If we examine the list again, we note that the number 7 appears 3 times, which is more times than any other value appears. Therefore, 7 is the mode value.

The Mean

Again, to find the mean we need to add all the values in the data set and divide the sum by the number of values. We can do this with any set of numbers but we should do it only in cases where adding the values makes sense. Interval and ratio measures are appropriate data for finding the mean. Nominal measures are not appropriate for finding the mean. Remember that nominal measures assign numbers as names. Let us go back to the example of the M&M's^®. There we assigned numbers to the different colors of pieces of candy. We even reported a data set taken from a real pack of M&M's^®. That data set was
{ 1, 1, 2, 1, 6, 6, 6, 7, 4, 1, 4, 3, 2, 3, 6, 2, 1, 6, 6, 6, 6, 7, 6, 6, 7, 2, 1, 6, 6, 7, 6, 4, 4, 2, 1, 6, 2, 4, 7, 6, 7, 3, 6, 6, 6, 6, 3, 6, 3, 7, 1, 6, 6, 2, 6}
There are 55 numbers in that data set. We could add up the values and divide by 55. This would produce the mean of that set of numbers. However, these are nominal measures. It does not make sense to add the numbers. A red (1) plus a green (2) do not make a yellow (3). Furthermore, our choice of coding values was totally arbitrary. We could have assigned 1 to dark brown and 2 to red and 3 to orange, and so on. This would have given us different values and would produce a different sum, and therefore a different mean. As tempting as it might be to compute a mean for nominal measures, the result is meaningless.

What about using the mean with ordinal measures? In that case, the order of the numbers reflects the order of the data items. That is still not enough meaning and consistency to justify calculating a mean. One example of an ordinal measurement was assigning numeric values to the responses on an opinion questionnaire:

Strongly Disagree
Disagree
Neutral or no opinion
Agree
Strongly Agree

If two people rate a question as a 4 that does not mean that they have anywhere near the same sense of "agreement" with the question. Even the same person may mark two questions with a 4 and feel quite different about the actual extent of agreement with the two statements. Adding two Disagees (2+2) does not have the same meaning as one Agree (4). Finding the mean of ordinal values does not make sense, and it should not be done.

Please be aware of the fact that few people understand the absurdity and inappropriateness of calculating the mean for ordinal values. In fact, finding such a mean is done all the time. Opinion surveys are given, results are scanned and coded, and the data is fed into computer programs that automatically calculate the mean of the data set. Then the results are studied and even published. The ease of doing a computation should not be confused with the appropriateness of doing it. The computer programs do not ask for the kind of data. Those programs are more than happy to compute the mean for any kind of data. It is the task of the responsible user to know the kind of data being studied, and the appropriate measures to use for that data.

When interval and ratio measures were introduced in an earlier section we noted that there is a degree of accuracy associated with these kinds of measurements. Finding the mean of interval and ratio measures produces a value that could easily be one of the measurements. It would make sense to find that the mean height of female math students is 143.5 cm. To find that the mean color of our package of M&M's^® is 4.436 (which is the inappropriately computed mean of the values given above) does not make sense. There is no color assigned to that value. We had 4 as a blue and 5 as a light brown. The mean is not a color "between" these two. In the same way, to find that the mean response of students to an opinion survey statement is 3.741 may be an interesting calculation, but it is a meaningless value. One person's Strongly Disagree does not balance with another person's Strongly Agree to give a net opinion of "Neutral".

The mean is a powerful measure of central tendency for interval and ratio measurements. However, the mean is not without its problems. The value that we compute as the mean of a set of data can be highly influenced by extreme values in that data. This is especially true for small data sets or even for larger ones if the extreme values are outrageously extreme. Let us demonstrate this.

Our earlier example had the 11 values,

8, 7, 15, 4, 17, 6, 4, 5, 7, 8, 7 with a sum of 88 making the mean equal 88/11 or 8. All of the values in that data set are at least close to the mean, and one gets the feeling that the mean, 8, is representative of the numbers in the data set. Let us see what happens to the mean if we change just one value to be much more extreme, that is, to be far away from the rest of the values. Let us change 17 to 160. Now our data set is: 8, 7, 15, 4, 160, 6, 4, 5, 7, 8, 7 and the sum of those values is 88-17+160 or 231. There are still 11 data items in the set. However, we now calculate the mean as 231/11 or 21. The calculations are correct, but this value, 21, no longer seems representative of the 11 values n the data set. The one extreme value has "moved the mean" away from the rest of the values.

It is important to note that the influence of the one extreme value in the case above is related to both the magnitude of that value and to the size of the data set. As noted earlier, we tend to find small data sets in math problems. In those cases, an extreme value, such as the 160 above, will have a major impact on the value of the mean. However, if we have a data set that is similar to the our original example, but with 1000 items that have a mean of 8, then changing one item from 17 to 160 will have a much smaller effect. For the mean of 1000 items to be 8 we must have the sum of those 1000 items be 8000 (since the mean is the sum of the items divided by 1000). If we change that one item from 17 to 160 we merely increase the sum of all the items by 143 (we take away the 17 and add the 160). Therefore, the new sum of all the items will be 8143 and the new mean will be 8143/1000 or 8.143, different from the original value of 8, but not all that different, and still feeling representative of the numbers in the data set. 160 is quite extreme from the values in our original list, but its impact is minimized by the large number values (now 1000 values) in the much larger data set.

At the same time, even with a larger data set, outrageously extreme values, such as changing the 17 to a 160,000,000 will still make a huge difference in the value of the mean. Even with our 1000 element set, changing the one item to 160,000,000 will change the sum of all values to 160,007,983, and that makes the mean be 160,007,983/1000 or 160,007.983, quite a change from the original value of 8.

Attention should be paid to outrageously extreme values. Looking at the mean wealth of people staying at a particular hotel on a particular night would certainly be affected by a Bill Gates visit, even if the hotel had 500-guests. However, looking at the mean height of guests at that same 500-guest hotel, the mean would be changed by a visit from an NBA team, but it would not be changed by that much. The NBA players are much taller than the other guests, but they are not outrageously taller.

For data sets with a large number of items, and where outrageously extreme values are not possible, it is hard to change the mean of the data set by adding or changing a small percentage of the values. For example, let us say that there are 10,000 registered students at Washtenaw Community College, and that the mean age of those students is 28.5 years. Assuming we are not going to change those 10,000 students, how many 18-year-old students (additional new high school graduates) would WCC need to enroll in order to bring the mean age down to 25? The answer is 5,000 new, additional 18-year-old students would have to be enrolled. [The actual calculation of this is left as an exercise.] WCC would need a 50% increase in enrollment, all 18 year olds, in order to lower the mean age by 3.5 years.

The Median

To find the median of a data set we need to sort the values and find the middle value. We cannot sort something if it does not have an order. Of course our numbers have an order. We can always sort a set of numbers. Ordinal, interval, and ratio measurements use the numbers to reflect the order of the underlying data. Therefore, it is appropriate to look at the median of ordinal, interval, and ratio measurements. Nominal measurements merely assign numbers as names. There is no order to the underlying data. Therefore, even though we could sort the numbers used in a nominal measure, it is not appropriate to do so, and it is inappropriate to talk about the median of nominal measures.

The 11 number data set that we have been using as an example is:

8, 7, 15, 4, 17, 6, 4, 5, 7, 8, 7 We never stated the kind of measurements that were behind this set of numbers. If these are nominal measures then we should not use the example data set to demonstrate finding the median. If, however, these 11 values represent ordinal, interval, or ratio measures then we can use the example to find a median. We sort these values as 4, 4, 5, 6, 7, 7, 7, 8, 8, 15, 17 and then we find the middle term, the 6^th one from either end, and we identify that item as being the median. Note that there are just as many values that are to the left of the median in the sorted list as there are to the right of the median value.

It was convenient to have an odd number of items in our example. With an odd number of items there is always a middle term. What do we do when we have an even number of items? What if we remove the first three from our old list and look at the remaining list of 8 items:

4, 17, 6, 4, 5, 7, 8, 7 Those items sort to 4, 4, 5, 6, 7, 7, 8, 17 We note that there is no middle item. We have a rule to take care of this case. The rule tells us to find the two middle items, find their mean, and then use the mean of those two middle items as the median. In our short example we have 4, 4, 5, 6, 7, 7, 8, 17 and (6+7)/2 = 6.5. Therefore, the median is 6.5. Note that for this data the median turned out to be a value that was not even in the set of data values.

Before we go any further, let us develop a method for calculating the position of the middle item or items. Let us say that we have an odd number of values, which we will call n. We know that n is an odd number. Therefore, n+1 is an even number, which means that n+1 is evenly divisible by 2. In fact, if n is odd then (n+1)/2 is the item number of the middle term in the sorted list. In our earlier example, when we had 11 items in the list, the middle term was (11+1)/2 or the 6^th item in the sorted list. The 6^th item has 5 items to its left and 5 items to its right.

On the other hand, if we have an even number of items then we need the item numbers of the two middle terms. If the number of items is n, and n is even, then n is evenly divisible by 2, and the two middle terms will be items n/2 and (n/2)+1. In our example of 8 items, the two middle items are 8/2 (the 4^th) and (8/2)+1 (the 5^th) items in the sorted list. Remember that once we find the two middle terms, the median is the mean of those two values.

In the 8-item example given above, the median turned out to be 6.5, a value that was not one of the values in the original list. This happens often in math class and test problems, but rarely in the real world where we have much larger data sets. With a large data set with an even number of items each item tends to appear many times. Most of the time with large data sets the two middle items will have the same value. In that case, the median is the mean of those two values, which is the same value.

We recall that the mean is affected by extreme values. The median is not so affected. We found the median of

8, 7, 15, 4, 17, 6, 4, 5, 7, 8, 7 to be 7. If we replace the 17 by 160 (as we did when we were looking at the mean), the median remains the same. We could even replace the 17 by 160,000,000 and the median would stay the same. The median is not affected by even extremely outrageous values. The median wealth of people staying at a hotel on a given night will not change much at all if Bill Gates suddenly registers there. Nor will the median height of the all the guests change much if 11 or 12 NBA payers check in. Adding 12 very tall guests merely shifts the "middle guest" up 6 people, and we expect that the people in the middle height range are all about the same height. The median value will probably increase, but not by a huge amount.

The Mode

The mode is the most frequently occurring value in the data set. The mode requires neither adding the values nor sorting them into an order (although sorting does help group values together so that we can count the frequency of each and determine the mode). As such, the mode is appropriate for all kinds of measurements. It is the only measure of central tendency that should be used with nominal measures. However, as universally applicable as the mode is, it really does not tell us much by itself. It is merely a popularity contest. To be really helpful we need to know the extent of that popularity. If we find the mode manufacturer of vehicles in the parking lot is 2 (Ford, in our much earlier example), we do not know if there was just one more Ford vehicle than there were each of GM, DC, Toyota, Honda, and/or others, or if there were more Ford vehicles than all others combined, or somewhere between these extremes. All we know is that Ford vehicles outnumber those of any other vehicle manufacturer.

In our introduction to mode we need to cover the case where there is a tie. For such a case we say that there are two mode values, or three if there is a three-way tie, and so on. We will often see these kinds of data sets in math problems. For example, we might have the data set

8, 7, 15, 4, 17, 6, 4, 5, 7, 8, 7 where we find that the mode is 7, or the data set 4, 7, 15, 4, 17, 6, 4, 5, 7, 8, 7

where there are two mode values, 4 and 7. This is called a bi-modal data set.

In real life, with large data sets, there are rarely any ties. Here is a data set of 163 values:

The most frequently occurring If you reload this page your will see another list. It is possible to have two or more values that appear equally often in these lists. It may have happened here. However, it should not happen all that often. (The lists generated for this section do not have a requirement for a single or double mode. The numbers are generated and then the mode is determined. However, we expect to find a bi-modal distribution for about 1 out of every 12 lists generated here. This means that if you reload this page over and over you will probably see a bi-modal list within the first 12 examples, even more likely within the first 24 examples, and almost certainly within the first 36 examples.) It is interesting to note how the median, the mean, and the mode compare in the previous data set. The median is and the mean is All three measures of central tendency are close, and all seem to represent the center of the values in the list. Even if there is more than one mode, the modal values should be quite close together. All of this is a consequence of the way the values were generated within this page. Often when we have interval or ratio measures, the expectation is that extreme values will happen rarely, while the values toward the center of the data will happen more often. Thus, in those situations, it is reasonable to expect that the mode will be close to the mean and the median. Remember that for nominal measures we should not be computing the mean or median. The mode stands by itself in those cases. The mode joins the median, though not the mean, as being reasonably computed for strictly ordinal values (that is, for values that have an order, but that are not interval or ratio in nature). However, the mode may be far different from the median in ordinal measurements.

Up to this point we have been emphasized the overall most popular value as being the mode. Furthermore, we noted that a two-way tie in overall popularity is called a bi-modal data set. We will return to this concept later, after we have presented frequency distribution charts in Section 0.5.

Summary for Measures of Central Tendency

The measures of central tendency are the mean, the median, and the mode. These values are computed from the data that we are given. The mean and median are two different attempts to find a single number to represent the "center" of the data values. The mode (which may be one or more values) is used to identify the most popular value or values in the data set. In many cases, as in the example above, the mode will approximate the "center" of the data values. Each measure has its strengths and weaknesses. Each measure has its appropriate and its inappropriate uses.

The mode is the only appropriate measure of central tendency for data that is nominal in nature, data that does not come from a "yardstick" measurement and that does not have an underlying order.

The median is appropriate if the original data has an underlying order. The median is especially useful if extreme values significantly move the mean of the data. The median is generally the preferred measure of central tendency if our values consists of counting things, rather than measuring them.

The mean is the most appropriate measure of central tendency if the original data stems from measurements. We need to remember that the mean may be influenced by extreme values. The mean is useful in characterizing large sets of counted values, especially when used in conjunction with the median of the data values.

With these measures of central tendency in hand, we now turn to "Measures of Dispersion" to tell us how close the data values are to the central values.

Chapter 0: Descriptive Statistics
0.1 Introduction
0.2 Kinds of Measurements
0.3 Measures of Central Tendency
0.4 Measures of Dispersion
0.5 Displaying Data