Inference for the Mean

Contents of this page:
Assumptions
Confidence Interval
Hypothesis Test
Sample Size
Paired Data

After all the theory, here are some examples. Actually, we have discussed almost everything here already.

Method

1-sample t

Assumptions

The methods discussed here work if:
the data comes from a simple random sample
the data comes from a normal distribution or the sample size is large enough (?)

Confidence Interval


A 100(1-a)% confidence interval for the population mean m is given by

Example Consider again the data set for newborn babies and the drug status of their mothers. Previously we found the following summary information:

Find 90% confidence intervals for the mean length of the babies in the three groups

Note 100(1-a)% = 90%, so a=0.1

1) Drug free
n=39, = 51.1, s=2.9, tn-1,a/2 = t38,0.05 = 1.686, so
± tn-1,a/2×s/√n = 51.1 ± 1.686×2.9/√39 = 51.1±0.57 or (50.53,51.67)

2) First Trimester
n=19, = 49.3, s=2.5, tn-1,a/2 = t18,0.05 = 1.734, so
± tn-1,a/2×s/√n = 49.3 ± 1.734×2.5/√19 = 49.3±0.99 or (48.31,50.29)

3) Throughout
n=36, = 48.0, s=3.6, tn-1,a/2 = t35,0.05 = 1.67, so
± tn-1,a/2×s/√n = 48.0 ± 1.67×3.6/√36 = 48.0±1.0 or (47.0,49.0)

MINITAB uses the command Stat > Basic Statistics > 1-Sample t to do all the work, find confidence intervals and much more:

Example: Find a 90% confidence interval for the mean length of babies of mothers who never took cocain.

Solution.
First we need to check the assumptions:
Graph > Boxplot, Simple, Drug Free and Graph > Probability Plot, Single, Drug Free
both show that the data is reasonably normal

Stat > Basic Statistics > 1-Sample t, Samples in column=Drug free, Options > Confidence level=90.0
A 90% CI for the length of babies of Drug Free mothers is (50.318, 51.882)

Hypothesis Test

The details of the hypothesis test for a population mean are as follows:

Null Hypothesis: H0: m = m0

Note: m0 is not "m0" but a specific number which you need to get from the problem.

Alternative Hypothesis: Choose one of the following, depending on the problem:
a) Ha: m > m0
b) Ha: m < m0
c) Ha: mm0

Test Statistic:

Rejection Region:
If your alternative is a) Ha: m > m0, then reject H0 if T > tn-1, a
If your alternative is b) Ha: m < m0, then reject H0 if T < -tn-1, a
If your alternative is c) Ha: mm0, then reject H0 if |T| > tn-1, a/2

Example Test at the 10% level whether the mean length of "Drug Free" babies is more than 50cm.

Solution by hand:

1) Parameter: mean m
2) Method: 1-sample t
3) Assumptions: boxplots and normalplots show no problem with normal assumption
4) a = 0.1
5) H0: m = 50.0 (mean length is 50cm)
6) Ha: m > 50.0 (mean length is more than 50cm)
7) T = √n(-m0)/s = √39(51.1-50)/2.9 = 2.37
8) We reject H0 if T > tn-1, a = t38, 0.1 = 1.304.
T = 2.37 > 1.304, so we do reject the null hypothesis
9) The mean length is statistically significantly higher than 50cm

Solution by computer:


1) Parameter: mean m
2) Method: 1-sample t
3) Assumptions: boxplots and normalplots show no problem with normal assumption
4) a = 0.1
5) H0: m = 50.0 (mean length is 50cm)
6) Ha: m > 50.0 (mean length is more than 50cm)
7) p-value = 0.011 (Stat > Basic Statistics > 1-sample t > Samples in column: Drug Free, test mean: 50.0, Options > Alternative: greater than)
8) p-value = 0.011 < a=0.1, so we do reject the null hypothesis
9) The mean length is statistically significantly higher than 50cm

Notice the advantage of the p value approach: p=0.011 clearly shows that this was a close decision, after all, if we had chosen a=0.01 we would have failed to reject H0.

Sample Size Calculations

One of the most important questions facing a researcher is how large a sample he needs to be able to draw valid conclusions. Here are some helpful formulas. We start with the formula for the confidence interval:

or as we have previously written it: sample mean error.
A sample size calculation starts with a decision on how large an error we are willing to accept. Let's call this error E.
In our formula we have the term tn-1, a/2 which includes the sample size n which we are trying to find. Here we will assume that n is going to be reasonably large at least say 50. In that case tn-1, a/2 = za/2, the critical values from a normal which we have encountered already.
So we get the following formula:

Unfortunately this formula still contains an unknown quantity, namely s. This of course is the sample standard deviation, an estimate of the population standard deviation s. Here are several possible ideas:
Is there already an estimate of s we can use, maybe from a previous or from a similar study?
If not maybe we can do a pilot study (something that is very often a good idea anyway)

Once we have some idea what s is we can replace the s in our formula with this s and then solve for n:

Example: We found that a 90% confidence interval for the mean length of babies of "Drug Free" mothers is (50.53,51.67), or 51.1±0.57, so the error on this estimate is 0.57. What sample size would be needed to find a 90% confidence interval with an error of 0.5?
We can use the sample standard deviation s=2.9 as a guess for the population standard deviation, so we have s=2.9. We want a 90% confidence interval, so we have
100(1-a)%=90%, a=0.1, za/2 = z0.05 = 1.645. Therefore
n = (za/2×s/E)2 = (1.645×2.9/0.5)2 = 91.03 ~ 92

Paired Data

Example: a group of people have participated in a weight loss program involving diet and exercise. In order to assess the program each participant was weight twice, one before the program and one after. The data is
Subject Weight Before Weight After
Paul 189 175
July 135 129
Linda 156 163
Carlos 213 192
... ... ...
Jose 191 196
Now the obvious question is: does the program work?
At first this looks like a new type of problem, after all we have two variables for each person - before and after. But what is important here are not the two numbers but their difference: for Paul what matters are not that he weighed 189 pounds before the progam and 175 pounds afterwards but that he lost 14 pounts (189-175). this is called a paired data problem

The analysis of paired data is actually quite simple: compute the difference for each pair, and then treat it as a one-sample problem for the population mean.

So for the data above compute the differences
Subject Weight Difference
Paul 14
July 6
Linda -7
Carlos 21
... ...
Jose -5

and now do the inference for these numbers.

Warning: don't forget the minus signs!

Some comments:

1) the most natural null hypothesis for a two sample problem is H0: md = 0. Here md is the population mean of the difference

2) the sample size n is the number of pairs (or differences), not the number of observations in the original sample

Example: In order to see whether there is a difference in the prices of foods at two local supermarkets we bought a basket of the same items at each of the two stores. Test at the 5% level whether the mean prices are the same. The data is in foods .

Solution by hand: First compute the mean and the standard deviation of the differences:

Make column diff with Calc > Calculator, Store in diff, Expression: 'Market 1' - 'Market 2'
Stat > Basic Statistics > Display Descriptive Statistics, diff


1) Parameter: mean of differences md
2) Method: 1-sample t
3) Assumptions: boxplots and normalplots show no problem with normal assumption
4) a = 0.05
5) H0: md = 0 (mean prices are the same)
6) Ha: md ≠ 0 (mean prices are different)
7) n=15, =-0.0367, s=0.157
T = √n(-md)/s = √15(-0.0367-0)/0.157 = -0.9
8) We reject H0 if |T| > tn-1, a/2 = t14, 0.025 = 2.14, |T| = 0.9 < 2.14, so we fail to reject the null hypothesis
9) it seems the mean prices in the two stores are not very different

Solution by computer:

1) Parameter: mean of differences md
2) Method: 1-sample t
3) Assumptions: boxplots and normalplots show no problem with normal assumption
4) a = 0.05
5) H0: md = 0 (mean prices are the same)
6) Ha: md ≠ 0 (mean prices are different)
7) p = 0.381 (Stat > Basic Statistics > Paired t, First Sample: Market 1, Second Sample: Market 2)
8) p = 0.381 > 0.05 = a, so we fail to reject the null hypothesis
9) it seems the mean prices in the two stores are not very different

We could have also run a 1 sample t test on "diff", with exactly the same result.

Example: Kelby Childers asked subjects to perform several tasks before and after 24 hours of sleep deprivation. One task involved the subjects lifting weights until muscle failure. The data in sleep is a count of the number of bench presses done before ("Pre") and after ("Post") the 24 hours. Find a 95% confidence interval for the mean difference in the number of bench presses.

Solution by computer: Make column diff with Calc > Calculator, Store in diff, Expression: 'Pre' - 'Post'
Draw boxplot and and normal plot of diff to check normality, seems ok.
Stat > Basic Statistics > Paired Data, First Sample: Pre, Second Sample: Post
95% CI for mean difference: (1.46524, 3.40976)

Solution by hand: A boxplot and the normal plot show the differences in presses to be reasonably normal.
Stat > Basic Statistics > Display Descriptive Statistics, diff
n=16, =2.438, s=1.825
100(1-a)=95, a=0.05, t15,0.025=2.131
±tn-1,a/2s/√n = 2.438±2.131·1.825/√16 = 2.438±0.972
So a 95% confidence interval for the mean difference in bench presses is (1.466, 3.41)
Note that the confidence interval indicates that the difference is positive, that is the subjects were able to bench press more before the sleep deprivation than after it, just what we would expect if there is an effect at all.

For more on inference for a single mean see page 412 of the textbook.