Multiple testing issues; Sample Size Issues

Multiple Testing

This issue is meant to address the pitfall of running test after test after test, looking for some value that will allow us to reject some null hypothesis. The classic, though nasty, example of this is to imagine some graduate student looking for a topic for a dissertation. Nobody wants to write a dissertation that says he/she could not find a significant difference in the topic. Therefore, the graduate student gather some sample data and then runs many tests, all at the 0.05 level of significance, looking for some factor or two that indicates that there is evidence to reject some hypothesis. With the power and speed of computers the graduate student could easily run 50 to 100 such tests.

The problem, and danger, of this is that the tests were run individually at the 0.05 significance level. The meaning of running tests at the 0.05 significance level is that we are willing to be wrong, that is make a Type I error, 5% of the time, i.e., on average, 1 out 20 times. That means that even if we have 50 null hypotheses where all of them are true, if we take samples and test those hypotheses at the 0.05 significance level, we can expect to reject a true hypothesis about 5% of the time, meaning that we could easily be rejecting 2 or 3 true hypotheses out of the group of 50 just based on the probability of getting a slightly weird sample.

The bottom line here is that you cannot just do a "shotgun" investigation of lots of null hypotheses. If you find yourself in a position where it looks like you have to do this, then one suggestion, the Bonferroni method, suggests that you multiply the attained significance on each test by the number of tests in the group. This would mean that if we were running a group of 7 tests and the attained signifcance on the fourth one was 0.02 (a value low enough to reject the null hypothesis at the 0.05 level) we would treat that test as if the attained significance was 7*0.02=0.14 (a value that is not sufficiently low to justify rejecting the null hypothesis).

Sample Size Issues

Determining sample size can be quite a challenge. On the one hand, a bigger sample is clearly more representative. On the other, sampling is, or at least in the real world can be, quite expensive. This is particularly true if the measurement on the sample destroys the sample. If you want to measure the mean time that a fire retardant slows down a house fire it would be really expensive to set multiple houses on fire just to get some measurements. (Just to spread some unverified rumors, I have been lead to believe that in a certain major industry, the "magic" number for expensive destructive samples is six. That seems low to me, but I have not talked to any responsible person in the industry to verify the number and/or to learn why it works if that is the number.)

Another issue with sample size is that you can effectively reject almost any null hypothesis if you make your sample large enough. Let me explain. First, if the null hypothesis is true, as in absolutely true, then increasing the sample size will not let me effectively reject the null hypothesis all the time. However, look at the case where the null hypothesis is false, but not by much. Consider the case where the null hypothesis is that the mean value of some measure on a population is 34.37. Now, if the true mean value for that measure is 34.362, and assuming we are running the test with the alternative hypothesis that the true mean is not equal to 34.37, do we really want any sample to reject our null hypothesis? That is, do we really care if the true mean is 34.37 or 34.362? Pick any standard deviation you want for the sample. I can find a sample size that is large enough to make the standard error, the standard deviation of the sample mean, so small that 34.37 is more than 5 standard deviations away from 34.362. In that case it is almost a certainly that any sample will yield results that justify our rejecting the null hypothesis.

I understand that the previous discussion is quite complex and a bit convoluted. The point is that the real world is not statistics class. There are consequences to decisions. If you are going to make a decision based on statistic you need to understand how your choice of statistical tests and sample size will affect the statistical decision.

Not to be too snotty, but in my career I have been asked, at various time, to find some statistical value. My first response is to ask if there is a decision to be made based on the result of that analysis. If the answer is "No" then I just make up an answer. Thus, if the head of Admissions at WCC asks for the average age of admitted students, my question is "Will you do anything different depending upon the answer?" When the answer to that comes back as "No" then I pull a number out of the air, perhaps 23.416 (adding the decimal places to make it look good) and life goes on at no cost to me or to the school.

However, if the answer is, "Yes, if the mean age is under 23 then we will completely change our advertising strategy?" then I start asking, "Are you sure? If the average comes in at 22.9 will you still make the changes?" Until we can get to the true "cut point" there is no fair way to move forward with the process.

Return to Topics page

©Roger M. Palay Saline, MI 48176 November, 2021