Multiple testing issues; Sample Size Issues
Return to Topics page
Multiple Testing
This issue is meant to address the pitfall of
running test after test after test,
looking for some value that will allow us to reject some null hypothesis.
The classic, though nasty, example of this is to imagine
some graduate student looking for a topic for a dissertation.
Nobody wants to write a dissertation that says he/she could not find
a significant difference in the topic. Therefore, the graduate student
gather some sample data and then runs many tests,
all at the 0.05 level of significance, looking for some factor or
two that indicates that there is evidence to reject some hypothesis.
With the power and speed of computers the graduate
student could easily run 50 to 100 such tests.
The problem, and danger, of this is that the tests were run
individually at the 0.05 significance level.
The meaning of running tests at the 0.05 significance level
is that we are willing to be wrong, that is make a Type I error,
5% of the time, i.e., on average, 1 out 20 times. That means that even if we have
50 null hypotheses where all of them are true, if we take samples and test those
hypotheses at the 0.05 significance level, we can expect to reject
a true hypothesis about 5% of the time, meaning that we could easily
be rejecting 2 or 3 true hypotheses out of the group of 50 just based on the
probability of getting a slightly weird sample.
The bottom line here is that you cannot just do a "shotgun"
investigation of lots of null hypotheses. If you find yourself in a position
where it looks like you have to do this, then one suggestion, the
Bonferroni method, suggests that you multiply the attained significance
on each test by the number of tests in the group. This would mean that if we were
running a group of 7 tests and the attained signifcance on the fourth one was 0.02
(a value low enough to reject the null hypothesis at the 0.05 level) we would treat
that test as if the attained significance was 7*0.02=0.14 (a value that is not sufficiently low
to justify rejecting the null hypothesis).
Sample Size Issues
Determining sample size can be quite a challenge.
On the one hand, a bigger sample is clearly more representative.
On the other, sampling is, or at least in the real world can be, quite expensive.
This is particularly true if the measurement on the sample destroys the sample.
If you want to measure the mean time that a fire retardant slows down a house fire
it would be really expensive to set multiple houses on fire just to get some measurements.
(Just to spread some unverified rumors, I have been lead to believe that in a certain major industry,
the "magic" number for expensive destructive samples is six.
That seems low to me, but I have not talked to
any responsible person in the industry to verify the number
and/or to learn why it works if that is the number.)
Another issue with sample size is that you can effectively reject almost any
null hypothesis if you make your sample large enough. Let me explain.
First, if the null hypothesis is true, as in absolutely true,
then increasing the sample size will not let me effectively
reject the null hypothesis all the time.
However, look at the case where the null hypothesis is false, but not by much.
Consider the case where the null hypothesis is that
the mean value of some measure on a population is 34.37. Now, if the true mean value for that
measure is 34.362, and assuming we are running the test with the alternative hypothesis
that the true mean is not equal to 34.37,
do we really want any sample to reject our null hypothesis?
That is, do we really care if the true mean is 34.37
or 34.362?
Pick any standard deviation you want for the sample.
I can find a sample size that is large enough to make the standard error,
the standard deviation of the sample mean, so small that 34.37 is more than 5 standard deviations away from
34.362. In that case it is almost a certainly that any sample will yield
results that justify our rejecting the null hypothesis.
I understand that the previous discussion is quite complex and a bit convoluted.
The point is that the real world is not statistics class. There are consequences to decisions.
If you are going to make a decision based on statistic you need to understand
how your choice of statistical tests and sample size will affect the
statistical decision.
Not to be too snotty, but in my career I have been asked, at various time, to find some
statistical value. My first response is to ask if there is a decision to be made based on
the result of that analysis. If the answer is "No" then I just make up an answer.
Thus, if the head of Admissions at WCC asks for the average
age of admitted students, my question is
"Will you do anything different depending upon the answer?"
When the answer to that comes back as "No" then I pull a number out of the air,
perhaps 23.416 (adding the decimal places to make it look good) and life goes on at no cost to
me or to the school.
However, if the answer is, "Yes, if the mean age is under 23 then we will completely
change our advertising strategy?" then I start asking, "Are you sure?
If the average comes in at 22.9
will you still make the changes?"
Until we can get to the true "cut point" there is no fair
way to move forward with the process.
Return to Topics page
©Roger M. Palay
Saline, MI 48176 November, 2021