Two Samples: σ known

Return to Topics page

Before we start, we should recognize that the situation described here almost never exists. How are we to know the standard deviation of two populations when we do not know their respective mean values? This situation is presented here, even though it would be strange to find it in real life (though not strange to find it on a stat test) because it is a straightforward computation that sets the pattern for the situations that will be presented in subsequent pages.

We will start with
  1. two independent, random samples from two populations;
  2. probably different sample sizes, though this is not a requirement;
  3. each sample should have at least 30 items in it, although if the underlying populations are normal then we can get away with smaller samples.
There is nothing strange about those issues. However, for this presentation we will assume that we know one more thing.
  1. We know the values of standard deviation for each population, σ1 for the first population and σ2 for the second.
The two populations have unknown mean values, μ1 for the first and μ2 for the second.

Confidence interval for the difference μ1 - μ2

We construct a confidence interval for the difference of the two means in the same way that we constructed one for the the mean of a population. That is, we need to know
  1. the desired level of the confidence interval,
  2. a point estimate for the difference of the means, and
  3. the standard deviation of the difference of sample means when we know the size of the samples.
But we can do this.
  1. Specifying a level for the confidence interval is easy. We will call it α, meaning that if we want a 95% confidence level then α=0.05, the area outside of our interval so that we will have 95% of the area inside the interval.
  2. If we have two samples then they have respective sample means and . The best point estimate for μ1 - μ2 is .
  3. The distribution of the "difference in sample means" for samples of size n1 and n2 taken from populations that have means μ1 and μ2 with standard deviations σ1 and σ2 is normal with mean μ1 - μ2 and standard deviation .
All of that means that our confidence interval is given by

Now we are ready to walk through an example or two.

Case 1: We have two populations that are each normally distributed. We want to create a 90% confidence interval for the difference of the means. (Therefore, we want 5% of the area below the interval and 5% of the area above the interval.) We know that the standard deviation of the first is σ1 = 7.4 and the standard deviation of the second is σ2 = 6.2. We take samples of size n1 = 38 and n2 = 31, respectively. Those samples yield sample means and . Of course that means that becomes our point estimate. The standard deviation of the distribution of the difference of the sample means, for samples of the given sizes, is . But the value of that expression is about 1.6374. All that remains is to find, for the standard normal distribution, the value of z that has 5% of the area to the right of that value. We could use the tables, a calculator, or the computer to find this. It turns out to be about 1.645. Therefore, our confidence interval is
1.8 ± 1.645*1.6374
1.8 ± 2.693
(-0.893, 4.493)
which leaves us with a margin of error of 2.693.

Of course, we could perform all of those computations in R by using the commands
sigma_one <- 7.4
sigma_two <- 6.2
n_one <- 38
n_two <- 31
x_one <- 17.4
x_two <- 15.6
pe <- x_one - x_two
pe
sd <- sqrt(sigma_one^2/n_one + sigma_two^2/n_two)
sd
z_half_alpha <- qnorm(0.05,lower.tail=FALSE)
z_half_alpha
moe <- z_half_alpha*sd
moe
pe-moe
pe+moe
The console result of those commands is given in Figure 1.

Figure 1

Case 2: We have two populations that are each normally distributed. We want to create a 95% confidence interval for the difference of the means. (Therefore we want 2.5% of the area below the interval and 2.5% of the area above the interval.) We know that the standard deviation of the first is σ1 = 12.8 and the standard deviation of the second is σ2 = 15.3. We take samples of size n1 = 43 and n2 = 52, respectively. Those samples are given in Table 1 and Table 2 below.




We really have all the values that we need now, the known σ's, the sample sizes, the sample means, and the confidence level. We could just plug those values into the statements that we used above and, from them, get the confidence interval. Figure 2 shows the console view of such statements.

Figure 2

The result is that we have a 95% confidence interval: (-15.626,-4.324) rounded to 3 decimal places.

Of course, we used the sample mean values that this page supplied. If those had not been given, then we would have to calculate them. We could do this with the commands
source("../gnrnd4.R")
gnrnd4( key1=1840954204, key2=0012801348 )
n_one <- length(L1)
n_one
x_one <- mean(L1)
x_one

gnrnd4( key1=1481765104, key2=0015301435 )
n_two <- length(L1)
n_two
x_two <- mean(L1)
x_two
the console view of which appears in Figure 3.

Figure 3

Then we could follow that with the more standard commands
sigma_one <- 12.8
sigma_two <- 15.3
pe <- x_one - x_two
pe
sd <- sqrt(sigma_one^2/n_one + sigma_two^2/n_two)
sd
alpha <- 0.05
z_half_alpha <- qnorm(alpha/2,lower.tail=FALSE)
z_half_alpha
moe <- z_half_alpha*sd
moe
pe-moe
pe+moe
to produce the results in Figure 4.

Figure 4

It does seem that this would be a good time to capture this process in a function. Consider the function ci_2known() shown below.
ci_2known <- function (
  sigma_one, n_one, x_one,
  sigma_two, n_two, x_two,
  cl=0.95)
{
    # try to avoid some common errors
    if( (cl <=0) | (cl>=1) )
      {return("Confidence interval must be strictly between 0.0 and 1")
      }
    if( (sigma_one <= 0)  | (sigma_two <= 0) )
      {return("Population standard deviation must be positive")}
    if( (as.integer(n_one) != n_one ) |
        (as.integer(n_two) != n_two ) )
      {return("Sample size must be a whole number")}

    pe <- x_one - x_two
    sd <- sqrt(sigma_one^2/n_one + sigma_two^2/n_two)
    alpha <- 1-cl
    z_half_alpha <- qnorm(alpha/2,lower.tail=FALSE)
    moe <- z_half_alpha*sd
    low_end <- pe-moe
    high_end <- pe+moe
    result <- c(low_end, high_end, pe, 
                sd, moe, alpha/2,
                z_half_alpha)
    names(result)<-c("CI Low","CI High","Pnt. Est.",
                    "St. Error", "M of E",
                    "alpha/2","z-score")
    return( result )
}
This function takes care of all of our work. Figure 5 demonstrates the use of the function, using the values that we have for Case:2, via the two statements
source("../ci_2known.R")
ci_2known( 12.8, 43, 134.3582,
           15.3, 52, 144.3327, 0.95)
the first of which merely loads the function [we would have had to place the file ci_2known.R into the parent directory (folder) prior to doing this].

Figure 5

One of the problems associated with the command of Figure 5 is that it required us to re-type some of the arguments. With a little bit of preplanning, we could have accomplished the same thing via
gnrnd4( key1=1481765104, key2=0015301435 )
L2 <- L1
gnrnd4( key1=1840954204, key2=0012801348 )
ci_2known( 12.8, length(L1), mean(L1),
           15.3, length(L2), mean(L2),
           0.95)
Note that we have generated the second list first and then we copied that list of values to the new variable L2. We need to do that because gnrnd4() always overwrites the values in L1. The result of the four statement sequence in shown in Figure 6.

Figure 6

Rather than go through another example, Table 3 gives you the data that you need to do 5 more problems. In fact, each time you reload the page you will get 5 new problems. Following Table 3 you will find the related answers in Table 4. Because the cases presented in the tables above are dynamic I cannot provide screenshots of getting those results in R. You should be able to do these computations and get approximately the same results.

The interpretation of a confidence interval is unchanged. In Case 7 above, the confidence interval is

Hypothesis Testing for the difference of two means

Remember, we are still in a situation where we know, by some miracle, the standard deviation of both populations. Instead of building a confidence interval we want to test the null hypothesis that the means of the populations are identical. That is, we have H0: μ1 = μ2. We take note here that we could restate the null hypothesis as H0: μ1 - μ2 = 0. Using this form of the hypothesis is more convenient in that it causes the distribution of sample means to be centered at 0.

To do this we need an alternative hypothesis, H1. We have three possible alternatives and we choose the one that is meaningful for the situation that arises. Those three are:
Initial FormRestated version
  • H1: μ1 < μ2 a one-tail test
  • H1: μ1 > μ2 a one-tail test
  • H1: μ1 ≠ μ2 a two-tail test.
  • H1: μ1 - μ2 < 0 a one-tail test
  • H1: μ1 - μ2 > 0 a one-tail test
  • H1: μ1 - μ2 ≠ 0 a two-tail test.

To perform a test on H0: μ1 - μ2 = 0 we need a test statistic. That will be the difference in the two sample means, namely, . We also need a level of significance for the test and we need to know the distribution of the difference in sample means. The level of significance will be stated in the problem. If the null hypothesis is true then the distribution of the difference in the sample means is as above, normal with mean 0 and standard deviation given by .

Case 8: Let us consider a specific example. We have two populations and we know that σ1 = 3.48 and σ2 = 4.12. We have H0: μ1 = μ2 and our alternative hypothesis is H1: μ1 < μ2. Or, restated, we have H0: μ1 - μ2 = 0 and our alternative hypothesis is H1: μ1 - μ2 < 0. We want to run the test with the level of significance set at 0.05. That is, even if the null hypothesis is true we are willing to make a Type I error, incorrectly reject the null hypothesis, 5% of the time. We know that we are going to take samples of size 38 and 47, respectively.

The critical value approach:

The z-value for the standard normal population that has 5% of the area to its left is about -1.645. The standard deviation of the distribution of the difference in sample means for samples of size 38 and 47, given the standard deviations of the populations being 3.48 and 4.12 is which is about 0.82453. Our one-tail critical value is the z-score, -1.645, times our standard error, 0.82453, below the distribution mean. However, remember that under the null hypothesis the distribution of the difference in the sample means will have a mean of 0. As a result, our critical value will be -1.645*0.82453 ≈ -1.356. Therefore, we will take our random samples (according to the given sample sizes) and if the difference between the means, , is less than the critical value, i.e., less than -1.356, then we will reject H0, otherwise we will say that we have insufficient evidence to reject the null hypothesis.

We take our two random samples. They have sample mean values of 25.46 and 26.91, respectively. 24.46 - 26.91 = -1.45 which is less than the critical value. Therefore, we reject H0 at the 0.05 level of significance.

The R statements for those computations are:
sigma_one <- 3.48
sigma_two <- 4.12
n_one <- 38
n_two <- 47
sd <- sqrt(sigma_one^2/n_one + sigma_two^2/n_two)
sd
z_alpha <- qnorm(0.05)
z_alpha
crit_low <- sd*z_alpha
crit_low
x_one <- 25.46
x_two <- 26.91
pe <- x_one - x_two
pe
The console view of those statements is given in Figure 7.

Figure 7



The attained significance approach:

For the attained significance approach we still need to compute the standard error value which we know is about 0.82453. Then we take our two samples and get the sample means. From those values we compute the difference, 24.46 - 26.91 = -1.45. Because the null hypothesis has 0 as the value of the difference of the true means, our question becomes "What is the probability of getting a difference of sample means such as we found or a larger difference?" We "normalize" that difference by dividing it by the standard error. That produces a z-value, -1.45/0.82453 ≈ -1.759, for a standard normal distribution. As such we can use a table, a calculator, or the computer to find the probability of getting -1.759 or a value more negative than it. That probability, 0.0393, is the attained significance. We were running the test at the 0.05 level of significance. Our attained significance 0.0393 is less than the 0.05 so we reject H0.

The R statements for those computations are:
sigma_one <- 3.48
sigma_two <- 4.12
n_one <- 38
n_two <- 47
sd <- sqrt(sigma_one^2/n_one + sigma_two^2/n_two)
sd
x_one <- 25.46
x_two <- 26.91
pe <- x_one - x_two
pe
z <- pe/ sd
z
attained <- pnorm(z)
attained
The console view of those statements is given in Figure 8.

Figure 8

Just for the record, the computation of Figure 8 did not need to be so extensive. Recall that the pnorm() function allows us to specify the mean and standard deviation rather than forcing us to compute the "normalized" value of z. Also, we can enter the difference of the means as the raw difference. The statements shown in Figure 9 employ these techniques.

Figure 9

One seemingly strange part of Figure 9, the use of sd=sd, deserves some additional comment. The name of the parameter in the pnorm() function is sd, just as the parameter for the mean is mean. We need to assign a value to that parameter. A structure such as sd=3.12 would tell pnorm() to use that value for the standard deviation. Prior to the use of pnorm() in Figure 9 we defined a variable, sd to hold the computed standard error. It did not need to be called sd but there is no harm in using that name. Thus, the structure sd=sd tells pnorm() to assign the value stored in the variable sd, the one on the right of the equal sign, to the parameter sd, the one on the left of the equal sign.

Case 9: Let us consider another example. We have two populations and we know that σ1 = 37.2 and σ2 = 35.1. We have H0: μ1 = μ2 and our alternative hypothesis is H1: μ1 ≠ μ2. Or, restated, we have H0: μ1 - μ2 = 0 and our alternative hypothesis is H1: μ1 - μ2 ≠ 0. We want to run the test with the level of significance set at 0.02. That is, even if the null hypothesis is true we are willing to make a Type I error, incorrectly reject the null hypothesis, 2% of the time. We recognize that this is a two-tail test. We will reject H0 if the difference in the sample means is sufficiently far from 0 in either the positive or negative direction. Finally, we know that we are going to take samples of size 53 and 44, respectively.

The critical value approach:

The z-value for the standard normal population that has 1% of the area to its left is about -2.326 and the z-value for the standard normal population that has 1% of the area to its right is about 2.326. The standard deviation of the distribution of the difference in sample means for samples of size 53 and 44, given the standard deviations of the populations being 37.2 and 35.1 is which is about 7.356. Our two-tailed critical values are the low z-score, -2.236, times our standard error, 7.356, below the distribution mean, and the high z-score, 2.236 times our standard error, 7.356 above the distribution mean. However, remember that under the null hypothesis the distribution of the difference in the sample means will have a mean of 0. As a result, our critical values will be -2.236*7.356 ≈ -17.113 and 2.236*7.356 ≈ 17.113. Therefore, we will take our random samples (according to the given sample sizes) and if the difference between the means, , is not between -17.113 and 17.113 then we will reject H0, otherwise we will say that we have insufficient evidence to reject the null hypothesis.

We take our two random samples. They have sample mean values of 276.42 and 261.13, respectively. 276.42 - 261.13 = 15.29 which is between the critical values. Therefore, we do not reject H0 at the 0.02 level of significance.

The R statements for those computations are:
sigma_one <- 37.2
sigma_two <- 35.1
n_one <- 53
n_two <- 44
sd <- sqrt(sigma_one^2/n_one + sigma_two^2/n_two)
sd
z_alpha <- qnorm(0.01)
z_alpha
crit_low <- sd*z_alpha
crit_low
crit_high <- -crit_low
crit_high
x_one <- 276.42
x_two <- 261.13
pe <- x_one - x_two
pe
The console view of those statements is given in Figure 10.

Figure 10

The attained significance approach:

With the given values of the population standard deviations and the known sample sizes we compute the standard error as before, namely, 7.356. Then we take our samples and find the sample means, 276.42 and 261.13. We note that the difference 276.42 - 261.13 will be positive. Therefore, we need to find the probability of getting that value or higher from our distribution which has mean 0 and standard deviation 7.356. We can do this with a pnorm() function call. The result then needs to be multiplied by 2 because we are looking at a two-tailed test. We are answering the question "What is the probability of being that far away from the mean, in either direction?" We calculated one side so we need to double it to get the total probability. The one-side value is 0.0188, so the total will be 0.0376, a value that is larger than our level of significance, 0.02, in the problem statement. Therefore, we do not reject H0 at the 0.02 level of significance.

The following R statements do these computations:
sigma_one <- 37.2
sigma_two <- 35.1
n_one <- 53
n_two <- 44
sd <- sqrt(sigma_one^2/n_one + sigma_two^2/n_two)
attained <- pnorm(276.42-261.13, 
                  mean=0,sd=sd,lower.tail=FALSE)
attained
attained*2
The console image of those statements is given in Figure 11.

Figure 11

At this point is seems appropriate to try to capture these commands in a function so that we can use the function to solve any problem like these. Consider the following function definition:
hypoth_2test_known <- function(
  sigma_one, n_one, mean_one,
  sigma_two, n_two, mean_two,
  H1_type=0, sig_level=0.05)
{ # perform a hypothesis test for the difference of
  # two means when we know the standard deviations of
  # the two populations and the alternative hypothesis is
  #      !=  if H1_type==0
  #      <   if H1_type < 0
  #      >   if H1_type > 0
  # Do the test at sig_level significance.

  diff_sd <- sqrt( sigma_one^2/n_one + sigma_two^2/n_two )
  if( H1_type==0)
     { z <- abs( qnorm(sig_level/2))}
  else
     { z <- abs( qnorm(sig_level))}
  to_be_extreme <- z*diff_sd
  decision <- "Reject"
  diff <- mean_one - mean_two
  if( H1_type < 0 )
     { crit_low <-  - to_be_extreme
       crit_high = "n.a."
       if( diff > crit_low)
         { decision <- "do not reject"}
       attained <- pnorm( diff, mean=0, sd=diff_sd)
       alt <- "mu_1 < mu_2"
     }
  else if ( H1_type == 0)
     { crit_low <-  - to_be_extreme
       crit_high <-  to_be_extreme
       if( (crit_low < diff) & (diff < crit_high) )
          { decision <- "do not reject"}

       if( diff < 0 )
         { attained <- 2*pnorm(diff, mean=0, sd=diff_sd)}
       else
         { attained <- 2*pnorm(diff, mean=0, sd=diff_sd,
                             lower.tail=FALSE)
         }
       alt <- "mu_1 != mu_2"
     }
  else
     { crit_low <- "n.a."
       crit_high <-  to_be_extreme
       if( diff < crit_high)
         { decision <- "do not reject"}
       attained <- pnorm(diff, mean=0, sd=diff_sd,
                         lower.tail=FALSE)
       alt <- "mu_1 > mu_2"
     }
 result <- c( alt, sigma_one, n_one, mean_one,
                   sigma_two, n_two, mean_two,
                   diff_sd, diff, sig_level, z,
                   crit_low, crit_high, attained, decision)
 names(result) <- c("H1:",
                    "sigma_one", "n_one","mean_one",
                    "sigma_two", "n_two","mean_two",
                    "std. err.", "difference", "sig level", "z",
                    "critical low", "critical high",
                    "attained", "decision")
 return( result)
  
}
Once this function has been installed we can accomplish the analysis of Case: 8 above via the command
hypoth_2test_known(3.48, 38, 25.46,
                   4.12, 47, 26.91,
                   -1, 0.05)
Figure 12 shows the console view of that command.

Figure 12

In a similar fashion, Case: 9 could have been handled via
hypoth_2test_known(37.2, 53, 276.42,
                   35.1, 44, 261.13,
                   0, 0.02)
Figure 13 shows the console view of that command.

Figure 13

Clearly, the command shortens our work and still provides us with all of the details.

Case 10: We can use the function in a new problem. We start with two populations with known standard deviations, namely, 15.97 and 21.62. Furthermore, we know that the two populations have a normal distribution. As such we can "get away with" having slightly smaller samples. Our null hypothesis is that the means of the two populations is the same. Thus, we have H0: μ1 - μ2 = 0 and our alternative hypothesis is H1: μ1 - μ2 > 0. We want to test H0 at the 0.025 level of significance. (This is a one-tail test.) We agree to take samples from each population, where the sample size is 24 and 28, respectively. Table 5 has the sample from the first population.

Table 6 has the sample from the second population.

For this information we need to do a little processing. We want the sample size and the sample mean for each of the two samples. We can generate the data and get those values with the R commands:
gnrnd4( key1=2185232304, key2=0159705826 )
L1
n1 <- length(L1)
m1 <- mean(L1)
gnrnd4( key1=2325842704, key2=0212604672 )
L1
n2 <- length(L1)
m2 <- mean(L1)
It was not necessary to display the data but it is nice to have the display so that we can confirm that we have correctly entered the gnrnd4() commands. The console view follows in Figure 14.
Figure 14

Then,to do the problem we just need the command
hypoth_2test_known(15.97, n1, m1,
                   21.62, n2, m2,
                   1, 0.025)
That command produces the output of Figure 15.
Figure 15

Figure 15 gives us the standard error, 5.227, the two sample means, 58.015 and 48.664, the difference of the means, 9.351, the critical value, 10.245 (there is only one since this is a one-tail test), the attained significance, 0.0368, and even the decision, do not reject. We would reach that decision using the critical value approach because the difference was not greater than the critical value, or by the attained significance approach because the attained significance was not less than the level of significance.

Table 7 presents some sample problems. The solution to those problems is given, case by case, in table 8. Please recall that the problems given above change every time that the page is reloaded. Because that is the case, we cannot provide a RStudio solution to the problems just given There is, however, a page that takes problems from a previously generated version of this page and displays both the tables, one giving the problems and the other giving the answers, and the RStudio solutions to those problems. That page of demonstrations is at Test for Comparing Means of Two Populations, Sigmas Known.

Return to Topics page
©Roger M. Palay     Saline, MI 48176     February, 2016