Sampling

Return to Topics page

We start our look at inferential statistics by introducing the idea of taking a sample of a population. Sampling is a process of selecting a number of hopefully representative values or items from a population of values or items. There are times when it is impossible from a fiscal perspective or from a practical perspective, or from a time perspective for us to examine an entire population. Consider three examples.

First, let us say that we are about to buy a huge shipment of bolts of fabric. Perhaps there are 5000 bolts in the shipment. We want to be sure that items in the shipment, the individual bolts of fabric, meet our strict standards. We simply cannot afford the time or the expense of unrolling each bolt so that we can inspect the the 40 or 100 yards of cloth that are in each bolt. Instead, we decide to get a sample, a small number perhaps 50, of representative bolts from the shipment. We carefully inspect each of those 50. Certainly, if all 50 are flawless, we have no reason to reject the shipment. Now it is certainly possible that the 5000 bolts in the shipment had 4950 bad bolts and only 50 good ones! But it is highly unlikely that we would have just selected the good ones to examine.

Likewise, if we inspect 50 bolts and they are all flawed beyond our standards, then we are going to reject the entire shipment. We tell the seller that we do not have the time or the money to search through the remaining 4950 bolts to see if there are any good ones. The whole shipment goes back and we turn to someone else to supply us with fabric.

The two instances above give two extremes. In inferential statistics we come up with understandings and rules for what we should do when we do not have such extreme cases. What should we do if in the 50 bolts that we do inspect we find just 2 that have some flaws in them? What should we do if we find 5 that are flawed? In the end we are going to base our decision on our examination of the sample of 50 bolts out of the population of 5000 bolts.

As a second example, consider the prognostications that surround elections. A year before the election we start getting statements like "Candidate X has 45% of the vote, candidate Y has 38% of the vote, and 17% are undecided." How do they get these numbers? They certainly do not hold a secret ballot for all potential voters and then just count the votes. Instead, these pollsters select a sample of the population and ask that sample how they would vote. Then based on the results of that survey of the people in the sample, the pollsters come up with their estimate of what the general population would do if the election were held at that moment. In the small print they also tell you that their numbers are at best guesses and that there is an anticipate error of upwards of 4 or 5%.

While thinking of a third example I was driving around southeastern Michigan. I noticed that the non-personalized, standard issue, Michigan blue and white automobile license plates seem to always start with A, B, C, or D (at least as of December, 2015). An example of such a plate would be DHP 4507. [In fact that is probably someone's plate, but we really do not care who that someone is.] However, for whatever reason, I am currious to know what proportion of these standard plates start with A, what proportion start with B, and so on. I interest you in this questions and we decide to try to find the answer. Now, neither you nor I are going to go around and systematically find and record the license plate of every car in Michigan. That is just too big of a task. However, we could get a sample of cars, maybe all of the cars in the WCC parking lots as we and a few of our friends drive through them, and we could easily get a count of the standard blue and white Michigan car plates that start with each of the letters A through D. Then, based on the results of that sample, we could make at least a good guess at the relative proportion of each type of plate in the population. Given that there are less than 4000 parking spots on campus, we and 9 of our friends could easily gather the required data in just an hour or so. Again, we would be using the results of looking at a sample in order to make a good estimate of the situation in the population.

One immediate issue to consider in this last example is whether or not the distribution of initial letters on license plates has anything to do with the registered location of the vehicle. It is possible, after all, that initial letters on plates allocated to the Grand Rapids area are different from those allocated to the Ann Arbor area. To test this we could call one of our friends in Grand Rapids and ask that friend to gather similar information from some large parking area around Grand Rapids. Then we could compare the results any see if it looks like there is any area related difference. Again we are using sample data to make statements about the overall population.

What we see is that samples can be really helpful in learning about a population without having to test, inspect, or use all of the elements of that population. What remains to be considered is how do we get a representative sample. We will do this within the following areas: