There are three measures of central tendency: the mean, the median, and the mode. To some extent, and in various cases, each of these values represents the "average" value in a set of data. These values produce a "typical" value for a set of data. Section 8a.1 of this material demonstrated the difference between looking at a small set of data and looking at a larger set of data. For a tiny set of data we really do not need to do all of this work. However, for a larger set we compute the mean, median, and/or mode to give us a feel for the "typical" or "average" value in the set.
The mean is the "average" that we have learned in most math classes starting in elementary school. To find the mean we calculate the sum of the items in the data set, and then we divide that total by the number of items in the data set. Thus, the mean of the values
8 + 7 + 15 + 4 + 17 + 6 + 4 + 5 + 7 + 8 + 7 11 | = | 88 11 |
= 8 |
The median is the "middle" value after all of the values have been sorted. We can sort our list of 11 values to get
The mode is the most common, the most frequently occurring value in the list. If we examine the list again, we note that the number 7 appears 3 times, which is more times than any other value appears. Therefore, 7 is the mode value.
What about using the mean with ordinal measures? In that case, the order of the numbers reflects the order of the data items. That is still not enough meaning and consistency to justify calculating a mean. One example of an ordinal measurement was assigning numeric values to the responses on an opinion questionnaire:
|
Please be aware of the fact that few people understand the absurdity and inappropriateness of calculating the mean for ordinal values. In fact, finding such a mean is done all the time. Opinion surveys are given, results are scanned and coded, and the data is fed into computer programs that automatically calculate the mean of the data set. Then the results are studied and even published. The ease of doing a computation should not be confused with the appropriateness of doing it. The computer programs do not ask for the kind of data. Those programs are more than happy to compute the mean for any kind of data. It is the task of the responsible user to know the kind of data being studied, and the appropriate measures to use for that data.
When interval and ratio measures were introduced in an earlier section we noted that there is a degree of accuracy associated with these kinds of measurements. Finding the mean of interval and ratio measures produces a value that could easily be one of the measurements. It would make sense to find that the mean height of female math students is 143.5 cm. That does not suggest that we will find a female math studetns whose height is 143.5 cm. In fact, as noted in the earlier discussion, there is no likelihood of finding such a student since measurements are only approximate. We may or may not find a female math student whose height is 143.5 cm., but knowing that the mean height of such students is 143.5 sm. gives us a good feel for some "center" of the measurements for the group. For one thing, we know that, unless all of the female math students are 143.5 sm. tall that there are some shorter and some taller than that measure. Thre could be 5 that are 142.5 cm. tall and 1 that is 148.5 cm. tall, but there have to be studentss on both sides of the mean (again, unlesss they are all the same height).
On the other hand, to find that the mean color of our package of M&M's® is 4.436 (which is the correctly though inappropriately computed mean of the values given above) does not make sense. There is no color assigned to that value. We had 4 as a blue and 5 as a light brown. The mean is not a color "between" these two. In the same way, to find that the mean response of students to an opinion survey statement is 3.741 may be an interesting calculation, but it is a meaningless value. One person's Strongly Disagree does not balance with another person's Strongly Agree to give a net opinion of "Neutral".
The mean is a powerful measure of central tendency for interval and ratio measurements. However, the mean is not without its problems. The value that we compute as the mean of a set of data can be highly influenced by extreme values in that data. This is especially true for small data sets or even for larger ones if the extreme values are outrageously extreme. Let us demonstrate this.
Our earlier example had the 11 values,
It is important to note that the influence of the one extreme value in the case above is related to both the magnitude of that value and to the size of the data set. As noted earlier, we tend to find small data sets in math problems. In those cases, an extreme value, such as the 160 above, will have a major impact on the value of the mean. However, if we have a data set that is similar to the our original example, but with 1000 items that have a mean of 8, then changing one item from 17 to 160 will have a much smaller effect. For the mean of 1000 items to be 8 we must have the sum of those 1000 items be 8000 (since the mean is the sum of the items divided by 1000). If we change that one item from 17 to 160 we merely increase the sum of all the items by 143 (we take away the 17 and add the 160). Therefore, the new sum of all the items will be 8143 and the new mean will be 8143/1000 or 8.143, different from the original value of 8, but not all that different, and still feeling representative of the numbers in the data set. 160 is quite extreme from the values in our original list, but its impact is minimized by the large number values (now 1000 values) in the much larger data set.
At the same time, even with a larger data set, outrageously extreme values, such as changing the 17 to a 160,000,000 will still make a huge difference in the value of the mean. Even with our 1000 element set, changing the one item to 160,000,000 will change the sum of all values to 160,007,983, and that makes the mean be 160,007,983/1000 or 160,007.983, quite a change from the original value of 8.
Attention should be paid to outrageously extreme values. Looking at the mean wealth of people staying at a particular hotel on a particular night would certainly be affected by a Bill Gates visit, even if the hotel had 500-guests. However, looking at the mean height of guests at that same 500-guest hotel, the mean would be changed by a visit from an NBA team, but it would not be changed by that much. The NBA players are much taller than the other guests, but they are not outrageously taller.
For data sets with a large number of items, and where outrageously extreme values are not possible, it is hard to change the mean of the data set by adding or changing a small percentage of the values. For example, let us say that there are 10,000 registered students at Washtenaw Community College, and that the mean age of those students is 28.5 years. Assuming we are not going to change those 10,000 students, how many 18-year-old students (additional new high school graduates) would WCC need to enroll in order to bring the mean age down to 25? The answer is 5,000 new, additional 18-year-old students would have to be enrolled. [The actual calculation of this is left as an exercise.] WCC would need a 50% increase in enrollment, all 18 year olds, in order to lower the mean age by 3.5 years.
The 11 number data set that we have been using as an example is:
It was convenient to have an odd number of items in our example. With an odd number of items there is always a middle term. What do we do when we have an even number of items? What if we remove the first three values from the original list (i.e., the 8, 7, and 15) from our old list and look at the remaining list of 8 items:
Before we go any further, let us develop a method for calculating the position of the middle item or items. Let us say that we have an odd number of values, which we will call n. We know that n is an odd number. Therefore, n+1 is an even number, which means that n+1 is evenly divisible by 2. In fact, if n is odd then (n+1)/2 is the item number of the middle term in the sorted list. In our earlier example, when we had 11 items in the list, the middle term was (11+1)/2 or the 6th item in the sorted list. The 6th item has 5 items to its left and 5 items to its right.
On the other hand, if we have an even number of items then we need the item numbers of the two middle terms. If the number of items is n, and n is even, then n is evenly divisible by 2, and the two middle terms will be items n/2 and (n/2)+1. In our example of 8 items, the two middle items are 8/2 (the 4th) and (8/2)+1 (the 5th) items in the sorted list. Remember that once we find the two middle terms, the median is the mean of those two values.
In the 8-item example given above, the median turned out to be 6.5, a value that was not one of the values in the original list. This happens often in math class and test problems, but rarely in the real world where we have much larger data sets. With a large data set each item tends to appear many times. Most of the time with large data sets the two middle items will have the same value. In that case, the median is the mean of those two values, which is the same value.
We recall that the mean is affected by extreme values. The median is not so affected. We found the median of
In our introduction to mode we need to cover the case where there is a tie. For such a case we say that there are two mode values, or three if there is a three-way tie, and so on. We will often see these kinds of data sets in math problems. For example, we might have the data set
where there are two mode values, 4 and 7. This is called a bi-modal data set.
In real life, with large data sets, there are rarely any ties. Here is a data set
of 163 values:
The most frequently occurring If you reload this page your will see another list. It is possible to have two or more values that appear equally often in these lists. It may have happened here. However, it should not happen all that often. (The lists generated for this section do not have a requirement for a single or double mode. The numbers are generated and then the mode is determined. However, we expect to find a bi-modal distribution for about 1 out of every 12 lists generated here. This means that if you reload this page over and over you will probably see a bi-modal list within the first 12 examples, even more likely within the first 24 examples, and almost certainly within the first 36 examples.) It is interesting to note how the median, the mean, and the mode compare in the previous data set. The median is and the mean is All three measures of central tendency are close, and all seem to represent the center of the values in the list. Even if there is more than one mode, the modal values should be quite close together. All of this is a consequence of the way the values were generated within this page. Often when we have interval or ratio measures, the expectation is that extreme values will happen rarely, while the values toward the center of the data will happen more often. Thus, in those situations, it is reasonable to expect that the mode will be close to the mean and the median. Remember that for nominal measures we should not be computing the mean or median. The mode stands by itself in those cases. The mode joins the median, though not the mean, as being reasonably computed for strictly ordinal values (that is, for values that have an order, but that are not interval or ratio in nature). However, the mode may be far different from the median in ordinal measurements.
Up to this point we have been emphasized the overall most popular value as being the mode. Furthermore, we noted that a two-way tie in overall popularity is called a bi-modal data set. We will return to this concept later, after we have presented frequency distribution charts in Section 0.5.
The mode is the only appropriate measure of central tendency for data that is nominal in nature, data that does not come from a "yardstick" measurement and that does not have an underlying order.
The median is appropriate if the original data has an underlying order. The median is especially useful if extreme values significantly move the mean of the data. The median is generally the preferred measure of central tendency if our values consists of counting things, rather than measuring them.
The mean is the most appropriate measure of central tendency if the original data stems from measurements. We need to remember that the mean may be influenced by extreme values. The mean is useful in characterizing large sets of counted values, especially when used in conjunction with the median of the data values.
With these measures of central tendency in hand, we now turn to "Measures of Dispersion" to tell us how close the data values are to the central values.
©Roger M. Palay
Saline, MI 48176
November, 2010