Explore Confidence Interval, Two Pop., Diff means, σ's unknown


The script below provides a way to
  1. Create two populations of values with specified means and standard deviations.
  2. Find the mean and standard deviation of each population.
  3. Specify the size of a sample to be taken from each population.
  4. Specify the confidence level to use.
  5. Specify the number of times to take such samples.
  6. Perform the sampling and, for each sample, generate a confidence interval for the difference of the population means, and do so as if we did not know the true standard deviations of the populations.
  7. Keep track of the number of times that the generated confidence interval actually contains the true mean difference.
  8. Report that count.
  9. Report the standard deviation of the collection of the differences of the two sample means.
By asking for a significant number of samples, say 10,000, we can see that we really do get close to the specified confidence level of successes.

In the folder containing the function scripts for this course create a new directory, copy the model.R file to that directory, rename the file in the new directory, double click on the file to open Rstudio. Then copy all of the text below the line and paste it into your Rstudio editor pane. Then, you can highlight the entire script and run it to use the default values. After that you can go back and change parameters and run the script again to explore the consequences of those changes.
  
# We look at a confidence interval for the
# difference of two population means when we 
# do not know
# the standard deviation of each population.

#  first set up some goal populations
pop_one_mean <- 23.2
pop_one_sd <- 5.3
pop_two_mean <- 26.8
pop_two_sd <- 6.2

#  then create the two populations, each with 
#  1000 values

#  First generate an approximate standard normal
pop_one <- rnorm( 1000 )

#  Then get its mean and standard deviation
mu_1 <- mean( pop_one)
#  we want the standard deviation of the population
source("../pop_sd.R")
sd_1 <- pop_sd( pop_one )

#  Then create the distribution we want
pop_one <- ( (pop_one-mu_1)/sd_1 )* pop_one_sd+pop_one_mean

#   finally, verify that we have the right population
mean( pop_one )
pop_sd( pop_one )

#  Now do the same thing for pop_two

#  First generate an approximate standard normal
pop_two <- rnorm( 1000 )

#  Then get its mean and standard deviation
mu_2 <- mean( pop_two)

sd_2 <- pop_sd( pop_two )

#  Then create the distribution we want
pop_two <- ( (pop_two-mu_2)/sd_2 )* pop_two_sd+pop_two_mean

#   finally, verify that we have the right population
mean( pop_two )
pop_sd( pop_two )


#   Now we want to repeat the following process of
#   getting two samples, one from each population,
#   and then generating the confidence interval
#   for the difference of the population means.

#   While we are at this, and because we know what that
#   difference is, we can count the number of times 
#   that the true difference is inside our interval.

sig_level <- 0.93
samp_one_size <- 43
samp_two_size <- 38

num_reps <- 1000
num_success <- 0
num_fail <- 0
true_diff <- pop_one_mean - pop_two_mean


#  from the confidence level we want to find the
#  area in the two tails
alpha <- 1 - sig_level
alpha_div_2 <- alpha/2

# We know that we will be using the Student's-t
# distribution.  Therefore, we need to find the degrees
# of freedom.  The simple way is to use 1 less than the
# smaller of the two sample sizes.  The more complex
# approach, which gives a higher value for the degrees of
# freedom, involves some computation.  Furthermore, that 
# computation depends on the two sample standard 
# deviations, so we will have to wait until we get the 
# samples.


for( i in (1:num_reps) )
{  # choose samples from pop one get sample mean and sd
   index_1 <- as.integer( runif( samp_one_size, 1, 1001))
   samp_1 <- pop_one[ index_1 ]
   xbar_1 <- mean( samp_1 )
   sd_1 <- sd( samp_1 )
   
   # choose samples from pop two get sample mean and sd
   index_2 <- as.integer( runif( samp_two_size, 1, 1001))
   samp_2 <- pop_two[ index_2 ]
   xbar_2 <- mean( samp_2 )
   sd_2 <- sd( samp_2 )
   this_diff <- xbar_1 - xbar_2
   
   #  now we can compute the degrees of freedom to use
   d_1 <- sd_1^2/ samp_one_size
   d_2 <- sd_2^2/ samp_two_size
   df <- (d_1 + d_2)^2/ 
     (d_1^2/(samp_one_size-1)+d_2^2/(samp_two_size-1))
 
   #  finally we can get the t-value to use
   t_val <- qt( alpha_div_2, df, lower.tail=FALSE)

   #  and we get our standard error from the 
   #  sample standard deviations
   s_e <- sqrt( d_1+d_2 )
   
   # get the confidence interval
   ci_low <- this_diff - t_val*s_e
   ci_high <- this_diff + t_val*s_e
   in_ci <- (ci_low <= true_diff ) &&
            ( true_diff <= ci_high )
   if( in_ci )
   { num_success <- num_success+1} else
  
   { num_fail <- num_fail + 1}
}
#  report the number of successes
num_success