Explore Confidence Interval, Two Pop., Diff Proportions
The script below provides a way to
- Create two populations of values with specified proportions of different characteristics.
- Select one of those characterisitcs.
- Find the true difference in the proportion of that characteristic between the populations.
- Specify the size of the samples to be taken from each population.
- Specify the confidence level to use.
- Specify the number of times to take such samples.
- Perform the sampling and, for each sample, generate a confidence interval
for the difference of the population proportions based on the proportions of the
specified characteristic in the two samples.
- Keep track of the number of times that the generated confidence interval
actually contains the true proportion difference.
- Report that count.
By asking for a significant number of samples, say 10,000, we can see that
we really do get close to the specified confidence level of successes.
In the folder containing the function scripts for this course create a new directory, copy the model.R file to that directory,
rename the file in the new directory, double click on the file to open Rstudio.
Then copy all of the text below the line and paste it into your Rstudio editor pane. Then, you can highlight the entire
script and run it to use the default values. After that you can go back and change parameters and run the script again to
explore the consequences of those changes.
# Look at building a confidence interval for the difference in the
# the proportion of a specific characteristic in two populations.
# First create the two populations
# We will specify the relative frequency of the characteristics
# in each of the two populations
rel_freq_1 <- c( 11, 25, 7, 19, 13 )
rel_freq_2 <- c( 16, 14, 16, 19, 10 )
# These happen to have the same total but that is not
# necessary. However, it will mean that characteristic 4
# will have the same proportion in the two populations.
# Set a target size for the populations
target_size <- 1000
# Then we will make the populations be the smallest multiple of
# the required size of the two relative frequencies that
# is greater than the target size
sum_1 <- sum( rel_freq_1 )
mult_factor_1 <-trunc(target_size/sum_1)+1
num_factor_1 <- length(rel_freq_1)
pop_1 <- rep( (1:num_factor_1), rel_freq_1*mult_factor_1 )
sum_2 <- sum( rel_freq_2 )
mult_factor_2 <-trunc(target_size/sum_2)+1
num_factor_2 <- length(rel_freq_2)
pop_2 <- rep( (1:num_factor_2), rel_freq_2*mult_factor_1 )
table( pop_1 )
table( pop_2 )
# then look at the proportions, just to verify the two
# populations
n_1 <- length( pop_1 )
n_2 <- length( pop_2 )
pop_1_prop <- table( pop_1 )/n_1
pop_1_prop
pop_2_prop <- table( pop_2 )/n_2
pop_2_prop
# Now we want to repeat the process of looking at samples
# and, from the proportion in each sample, compute a confidence
# interval.
# the first thing is to select one of the
# characteristics to study.
which_char <- 2
# We actually know the propulation percents for this in
# the two populations so we can get the true difference
true_diff <- pop_1_prop[ which_char ] - pop_2_prop[which_char ]
true_diff
num_reps <- 10000
sig_level <- 0.93
# then since we will be using the normal approximation here
# we can find the z_alpha_2 value
z_alpha_2 <- (1-sig_level)/2
z_val <- qnorm( z_alpha_2, lower.tail=FALSE)
# Set up the size of our samples
samp_one_size <- 76
samp_two_size <- 68
num_success <- 0
num_fail <- 0
for( i in (1:num_reps) )
{ # choose samples from pop one get sample proportion
index_1 <- as.integer( runif( samp_one_size, 1, 1001))
samp_1 <- pop_1[ index_1 ]
samp_num_choice_1 <- table( samp_1 )[ which_char]
samp_1_prop <- samp_num_choice_1 / samp_one_size
# choose samples from pop two get sample mean
index_2 <- as.integer( runif( samp_two_size, 1, 1001))
samp_2 <- pop_2[ index_2 ]
samp_num_choice_2 <- table( samp_2 )[ which_char]
samp_2_prop <- samp_num_choice_2 / samp_two_size
this_diff <- samp_1_prop - samp_2_prop
s_e <- sqrt( samp_1_prop*(1-samp_1_prop)/samp_one_size +
samp_2_prop*(1-samp_2_prop)/samp_two_size )
# get the confidence interval
ci_low <- this_diff - z_val*s_e
ci_high <- this_diff + z_val*s_e
in_ci <- (ci_low <= true_diff ) &&
( true_diff <= ci_high )
if( in_ci )
{ num_success <- num_success+1} else
{ num_fail <- num_fail + 1}
}
# report the number of successes
num_success