Explore Hypothesis Tests, Two Pop., Diff Proportions


The script below provides a way to
  1. Create two populations of values with given proportions of specified characteristics. With some care and attention the proportion of one or more of the characteristics can be made identical or not.
  2. Specify a particular characteristic.
  3. Specify the size of the samples to be taken from the two populations.
  4. Specify the significance level to use for a hypothesis test that the proportions of the specified characteristics are equal, i.e., that the difference of the proportions is zero.
  5. Specify the number of times to take such samples.
  6. Perform the sampling and, for each sample, run the hypothesis test, based on the proportions of the desired characteristic in the two samples, that the difference of the population proportions is 0 vs. the alternative that the difference is not zero.
  7. Keep track of the number of times that the null hypothesis is rejected or is not rejected.
  8. Report that count.
If we start with the two populations having the same proportion for a specific characteristic then by asking for a significant number of samples, say 10,000, we can see that we really do make a Type I error, reject H0 when it is true, at about the level of significance. On the other hand, if we start with the two populations having different proportions for a characteristic then by asking for a significant number of samples, say 10,000, we can see how often we are making a Type II error, not rejecting H0 when it is false. This we can explore as we change just how different the proportions are and/or just how big we make our samples.

In the folder containing the function scripts for this course create a new directory, copy the model.R file to that directory, rename the file in the new directory, double click on the file to open Rstudio. Then copy all of the text below the line and paste it into your Rstudio editor pane. Then, you can highlight the entire script and run it to use the default values. After that you can go back and change parameters and run the script again to explore the consequences of those changes.
  
# Look at doing a hypothesis test that the difference in the
# the proportion of a specific characteristic in two populations
# is zero against the alternative hypothesis that the 
# proportion is not the same.

#   First create the two populations

#   We will specify the relative frequency of the characteristics
#   in each of the two populations

rel_freq_1 <- c( 11, 25, 7, 19, 13 )
rel_freq_2 <- c( 16, 14, 16, 19, 10 )
#     These happen to have the same total but that is not
#     necessary.  However, for this set of values 
#     it will mean that characteristic 4
#     will have the same proportion in the two populations.

#   Set a target  size for the populations
target_size <- 1000
#     Then we will make the populations be the smallest multiple of
#     the required size of the two relative frequencies that
#     is greater than the target size

sum_1 <- sum( rel_freq_1 )
mult_factor_1 <-trunc(target_size/sum_1)+1
num_factor_1 <- length(rel_freq_1)
pop_1 <- rep( (1:num_factor_1), rel_freq_1*mult_factor_1 )


sum_2 <- sum( rel_freq_2 )
mult_factor_2 <-trunc(target_size/sum_2)+1
num_factor_2 <- length(rel_freq_2)
pop_2 <- rep( (1:num_factor_2), rel_freq_2*mult_factor_1 )

table( pop_1 )
table( pop_2 )

#   then look at the proportions, just to verify the two
#   populations

n_1 <- length( pop_1 )
n_2 <- length( pop_2 )
pop_1_prop <- table( pop_1 )/n_1
pop_1_prop
pop_2_prop <- table( pop_2 )/n_2
pop_2_prop
#   Now we want to repeat the process of looking at samples
#   and, from the proportion in each sample, compute a confidence
#   interval.  

#   the first thing is to select one of the
#   characteristics to study.

which_char <- 4  # which, originally, we set to have the 
                 # same proportion in both populations

#  We actually know the propulation percents for this in 
#  the two populations so we can get the true difference

true_diff <- pop_1_prop[ which_char ] - pop_2_prop[which_char ]
true_diff

num_reps <- 1000
sig_level <- 0.05

#  then since we will be using the normal approximation here
#  we can find the z_alpha_2 value
z_alpha_2 <- sig_level/2
z_val <- qnorm( z_alpha_2, lower.tail=FALSE)

#   Set up the size of our samples
samp_one_size <- 76
samp_two_size <- 68

num_accept <- 0
num_reject <- 0

for( i in (1:num_reps) )
{  # choose samples from pop one get sample proportion
  index_1 <- as.integer( runif( samp_one_size, 1, 1001))
  samp_1 <- pop_1[ index_1 ]
  samp_num_choice_1  <- table( samp_1 )[ which_char]
  samp_1_prop <- samp_num_choice_1 / samp_one_size
  
  # choose samples from pop two get sample mean
  index_2 <- as.integer( runif( samp_two_size, 1, 1001))
  samp_2 <- pop_2[ index_2 ]
  samp_num_choice_2  <- table( samp_2 )[ which_char]
  samp_2_prop <- samp_num_choice_2 / samp_two_size
  
  
  this_diff <- samp_1_prop - samp_2_prop
  #   we want to use the pooled proportion to get 
  #   the standard error
  pooled <- (samp_num_choice_1+samp_num_choice_2)/
            (samp_one_size+samp_two_size)
  s_e <- sqrt( pooled*(1-pooled)*
                 (1/samp_one_size + 1/samp_two_size  ) )
  
  # get the attained significance
  if( this_diff > 0 )
  { attained <- pnorm( this_diff, 0, s_e, lower.tail=FALSE)*2} else
  { attained <- pnorm( this_diff, 0, s_e)*2}

  if( attained > sig_level )
  { num_accept <- num_accept+1} else
    
  { num_reject <- num_reject + 1}
}
#  report the number of times we accept
num_accept
#  and the number of times we reject H0
num_reject