A hypothesis test is a statistical method that answers a yes-no question
Example Is the average GPA of undergraduates at the Colegio less than 2.8?
Example Is the average income of men in Puerto Rico higher than the average income of women?
Example Is there a relationship between smoking and lung cancer?
A hypothesis test is usually phrased in the form of two statements rather than a question. These statements are called the null hypothesis (H0) and the alternative or research hypothesis (H1 or Ha)
Example
H0: The average GPA of undergraduates at the Colegio is 2.8 (or maybe even higher).
Ha: The average GPA of undergraduates at the Colegio is less than 2.8.
Example
H0: The average income of men and women in Puerto Rico is the same.
Ha: The average income of men in Puerto Rico is higher than the average income of women.
Example
H0: There is no relationship between smoking and lung cancer.
Ha: There is a relationship between smoking and lung cancer.
Now instead of deciding whether we should answer the question with yes or no we are going to decide which statement we believe is true, but of course this is (almost) the same thing.
Often we will make our decision based on a parameter, and the value of the corresponding statistic. If so we can also express the hypotheses in terms of population parameters. If the hypothesis is written in terms of parameters this (almost always) means that the null hypothesis has an = sign.
Example Is the average GPA of undergraduates at the Colegio less than 2.8? Here we are looking at an "average". In Statistics we have several ways to compute an "average", such as the mean or the median. Which of these is better depends on many considerations. Let's say we use the mean. Now the standard symbol for a population mean is μ, and so we can write the hypotheses as follows:
H0: μ = 2.8
Ha: μ < 2.8
Example Is the average income of men in Puerto Rico higher than the average income of women? Again we are interested in "averages", and let's say here we decide to use the median. The population median is sometimes denoted by l. But there are two medians: the median income of men and the median income of women. Let's denote them by lM and lW, respectively. Then the hypotheses are:
H0: lM = lW
Ha: lM > lW
Example Is there a relationship between smoking and lung cancer? Now we are looking at the relationship between the two variables "smoking" and "lung cancer". If both are continuous variables (and some other conditions hold) we could use Pearson's correlation coefficient ρ as a measure of the strength of the relationship:
H0: ρ = 0
Ha: ρ ≠ 0
Hypothesis Testing: Formalism and Notation
A complete hypothesis test has to have all of the following parts:
1) Parameter of interest
2) Method of analysis
3) Assumptions of Method
4) Type I error probability α
5) Null hypothesis H0
(in terms of parameter and in plain language)
6) Alternative hypothesis Ha
(in terms of parameter and in plain language)
7) Test statistic
8) Rejection region and decision on test
9) Conclusion (in plain language)
Warning In any homework or exam if any of these parts are missing you will loose points!
Example: Over the last five years the average score in the final exam of a course was 73 points. This semester a class with 27 students used a new textbook, and the mean score in the final was 78.1 points with a standard deviation of 7.1.
Question: did the class using the new text book do (statistically significantly) better?
For this specific example the complete hypothesis test looks as follows:
1) Parameter: mean μ
2) Method: 1-sample t test
3) Assumptions: data comes from normal distribution, or n large. Checked boxplot
4) α = 0.05
5) H0: μ = 73 (mean score is still 73)
6) Ha: μ > 73 (mean score is higher than 73)
7)
8) reject H0 if T>t26, 0.05 = 1.706, T = 3.81 > 1.706, so we reject the null hypothesis
9) The mean score in the final is statistically significantly higher than before.
There is a second technic for carrying out a hypothesis test called the p-value approach. It differs from the above only in steps 7) and 8). In part 7) we use MINITAB to find a number called the p-value. This has to include the sequence of commands you used. In 8) we decide whether or not to reject the null hypothesis based on this p-value as follows:
| p< α |
p> α |
| reject H0 |
fail to reject H0 |
Example for carrying out a hypothesis test using the p-value approach. We will do the same example as above for the new textbook:
1) Parameter: mean μ
2) Method: 1-sample t test
3) Assumptions: data comes from normal distribution, or n large. Checked boxplot
4) α = 0.05
5) H0: μ = 73 (mean score is still 73)
6) Ha: μ > 73 (mean score is higher than 73)
7) p =0.000101 (Stat > Basic Statistics > 1-sample t > Summarized data: Sample size: 27, Mean: 78.1, Standard Deviation: 7.1, Test mean: 73, Options > Alternative: greater than )
8) p = 0.000101 < α = 0.05, so we reject the null hypothesis
9) The mean score in the final is statistically significantly higher than before.
What is the p value? It is the probability to observe our data or something even more extreme if the null hypothesis is true. So in the example above it is P(
≥ 78.1 | μ = 73)
The advantage of the p-value approach is that in addition to the decision on whether or not to reject the null hypothesis it also gives us some idea on how close a decision it was. Here with p=0.0001 we would have rejected the null hypothesis even if we had chosen α = 0.01, so it was not a close thing at all.
Let's do a little simulation to understand the p-value
Example We will do the following:
• generate 50 observations from a normal distribution with mean μ and standard deviation s=1.0
• find the p-value of the test H0:μ=10.0 vs
Ha: μ≠10.0
• repeat 1000 times
• draw the histogram of the 1000 p-values and find the percentage of p-values<0.05
This is done in the MACRO pvalue, run it with
CTRL-L, %k:\3101\pvalue 10.0
Now if mu=10.0, the null hypothesis is true. As we can see in that case any number between 0 and 1 is equally likely to be our p-value, and so 5% are < 0.05.
As mu gets father away from the hypothesized value of 10.0 (the null hypothesis is "more" wrong), the p-values start to "bunch up" around 0, so we correctly rejecting the null hypothsis more and more often.
Example Let's return for a moment to the 1970's draft. When we ran the simulation we were really carrying out the following hypothesis test:
1) Parameter: Pearson's correlation coefficient ρ
2) Method: Test for ρ
3) Assumptions: Data comes from normal distribution. Checked.
4) α = 0.05
5) H0: ρ=0 (no relationship between Day of Year and Draft Number)
6) Ha: ρ≠0 (some relationship between Day of Year and Draft Number)
7) p<1/1000 (from simulation)
8) p<α = 0.05, so we reject the null hypothesis,
9) There is a statistically significant relationship between Day of Year and Draft Number.
Example Say a pharmaceutical company has developed a new drug, and they want to show that it is better than the currently available ones. They carry out a clinical trial with a treatment and a control group. For each patient they record the days until the disease is cured. Let μT be the mean number of days for the treatment group, and μC be the mean number of days for the control group. Eventually they will carry out a hypothesis test to see whether the new drug is better. Here they would use the hypotheses
H0: μT=μC (the new drug does not work better)
Ha: μT<μC (the new drug does work better)
At first it seems a little strange to students that we would choose "new drug is not better than the old one" as H0, but there are good reasons for this approach as we will see later. In practise it is very easy for us: H0 always has the = sign!.
Warning: In the 6 parts of a hypothesis test, the first 3 (at least in theory) should be done before looking at the data. The following is not allowed: say we did a study of students at the Colegio. We asked them many questions. Afterwards we computed correlation coefficients for all the pairs of variables and found a high correlation between "Income" and "GPA". Then we carried out a hypothesis test H0:ρ = 0 vs. Ha:ρ ¹ 0
The problem here is that this hypothesis test was suggested to us by the data, but hypothesis tests only work as advertised if the hypotheses are formulated without consideration of the data.
Related to this problem is the following issue: as we said we will always use H0 with = (for example μ = 0). On the other hand there are three commonly used alternative hypotheses:
Ha: μ > 0
Ha: μ < 0
Ha: μ ¹ 0
Go back to our example of the new textbook. Here we have the following
Correct: we pick Ha: μ > 73 because we want to proof that the new textbook works better than the old one.
Wrong: we pick Ha: μ > 73 because the sample mean score was 78.1, so if anything the new scores are higher than the old ones.
Warning:
A null hypothesis looks something like this
H0: μ=14.5
not
H0=14.5 (What is the parameter?)
or
μ=14.5 (Is this H0 or Ha?)
Warning: getting the correct null hypothesis is very important - if you do everything else right but picked the wrong statement as your null hypothesis you will always get the wrong answer!
When we carry out a hypothesis test in the end we always face one of the following situations:
So there are two possible mistakes (type I and type II), and the probabilities for making them (α and β). These two mistakes, though, are treated completely differently in statistics: when we do a hypothesis test we decide ahead of time what we are willing to accept as a type I error probability α, and then accept whatever the type II error probability β is. Well, not quite, but wait and see.
We have already talked about confidence intervals. At first a confidence interval and a hypothesis test seem to bee very different but they are actually closely related. So finding a 90% confidence interval is related to carrying out a hypothesis test with α=0.1 because 100·(1-α)%=90% leads to α=0.1.
As we saw before if you find a 95% CI instead of a 90% CI you make the interval wider. Similarly if you make α smaller, thereby reducing the probability of falsely rejecting the null hypothesis you (almost always) make β larger, that is you increase the probability of falsely accepting a wrong null hypothesis. The only way to make both α and β smaller is by increasing the sample size n.
How do you choose α? This in practice is a very difficult question. What you need to consider is the consequences of the type I and the type II errors.
Example In our example above with the new textbook, what does it mean to "commit the type I error"? If we do, what are the consequences? What does it mean to "commit the type II error" and what are its consequences?
Type I error: Reject H0 although H0 is true
H0: μ=73
Scores are the same with new textbook as they were with old textbook
new textbook is not better
This is the truth, but we don't know this, based on our experiment and the hypothesis test we reject H0, that is now we think the textbook is better
Consequences?
• We will change textbooks for everybody
• New students will not be able to buy used books, previous students will not be able to sell their books
• Professors have to rewrite their material, prontuarios etc.
• Professors will not consider other new textbooks that might really be better
• but scores will not go up, all of this is for nothing
Type II error: Fail to reject H0 although H0 is false
H0: μ=73 (new textbook is not better)
This is false, but we don't know this, based on our experiment and the hypothesis test we fail to reject H0, that is now we think the textbook is not better
Consequences?
• We will not change to the new textbook
• scores will not go up, but they would have if we had changed
• more students would have passed the course, got an A etc., but now they won't
• Professors will consider other new textbooks, but those might really be worse than the one we just rejected.
Note that in this example we will probably find out that we committed the type I error because we will observe that over the next few years the scores are not going up. On the other hand if we comit the type II error we are not likely to ever find out!
Example Let's have another look at the Salk Polio Vaccine trials. So, how did they "proof" that the vaccine worked? First of, this is a problem of comparing two proportions, the proportion of children who get the vaccine but get polio anyway (let's denote this proportion by πV) and the proportion of unvaccinated children who get polio (πU). At the end of the vaccine trials they carried out a hypothesis test with
H0 : πV=πU (polio rates are the same)
Ha : πV<πU (polio rate with vaccine is lower than the polio rate without the vaccine)
So, what would it mean here to "commit the type I error" or "commit the type II error"? What would be their consequences? Which of them do you think is the most serious? What would that say about how you should choose α?
Many fields such as psychology, biology etc. have developed standards over the years. The most common one is α = 0.05
. It has mainly historical reasons:


Here is an excerpt from this book:
..., therefore, we know the standard deviation of a population, we can calculate the standard deviation of [p. 102]
the mean of a random sample of any size, and so test whether or not it differs significantly from any fixed value.
If the difference is many times greater than the standard error, it is certainly significant, and it is a convenient
convention to take twice the standard error as the limit of significance ; this is roughly equivalent to the
corresponding limit α=.05 or 1 in 20, ...
Recent research in psychology has shown that α=0.05 is a fairly good standard, and we will use this if nothing else is said. It is not the only one, though. For example in Physics we often use α=2.9-7=0.000000029!
In hypothesis testing we choose α but we don't have any influence on β. One thing we can do is study its behaviour:
Example Let's illustrate the issue with a little simulation. For this we will generate some data and carry out a hypothesis test as follows:
Calc > Random Data > Normal, Generate 20 rows of data , Store in c1, Mean: 10.0, Standard Deviation: 3.0
Now we carry out the following hypothesis test:
1) Parameter: mean μ
2) Method: 1-sample t test
3) Assumptions: data comes from normal distribution (true because this is how data is generated)
4) α = 0.05
5) H0: μ=10.0
6) Ha: μ≠10.0
But we generated the data, so we know that μ=10.0. Therefore we know that the null hypothesis is true, and so if we commit an error it will be the type I error.
Now if we keep doing the above many times, what will happen? According to the theory we should fail to reject H0 (which is the correct conclusion) 95% of the time, and we should reject H0 (commit the type I error) 5% of the time.
This is done in the MINITAB macro test. Run it with this command:
CTRL-l, %k:\3101\test 20 10.0 3 0.05
Now let's change things a bit. Instead of generating data from a normal with mean 10.0, generate it from a normal with mean 12.0, but still carry out the test above for μ=10.0. So now we know that H0 is false, and we should reject it. If we don't we will commit the type II error. How often do we do this? Again run the macro:
CTRL-l, %k:\3101\test 20 12.0 3 0.05
As we see the method correctly rejects H0 about 81% of the time and so commits the type II error 19% of the time. So we find for this exact setup β=0.19
Here are some interesting cases:
Effect of the true mean
| Macro |
Percentage of type II error |
| CTRL-l, %k:\3101\test 20 10.5 3 0.05 |
89% |
| CTRL-l, %k:\3101\test 20 11.0 3 0.05 |
71% |
| CTRL-l, %k:\3101\test 20 11.5 3 0.05 |
43% |
| CTRL-l, %k:\3101\test 20 12.0 3 0.05 |
19% |
| CTRL-l, %k:\3101\test 20 12.5 3 0.05 |
6% |
so the further the true mean is from the one specified in H0, the less likely we are to commit the type II error, or
The more wrong the null hypothesis is, the more likely we are to make the right decision
Effect of the standard deviation
| Macro |
Percentage of type II error |
| CTRL-l, %k:\3101\test 20 10.5 3 0.05 |
89% |
| CTRL-l, %k:\3101\test 20 10.5 2.5 0.05 |
87% |
| CTRL-l, %k:\3101\test 20 10.5 2.0 0.05 |
82% |
| CTRL-l, %k:\3101\test 20 10.5 1.5 0.05 |
70% |
| CTRL-l, %k:\3101\test 20 10.5 1.0 0.05 |
44% |
| CTRL-l, %k:\3101\test 20 10.5 0.5 0.05 |
1% |
so the smaller the standard deviation, the less likely we are to commit the type II error, or
the closer together the data is, the easier it is to find a small difference between the true and the hypothesized mean
Effect of α
| Macro |
Percentage of type II error |
| CTRL-l, %k:\3101\test 20 11.5 3 0.1 |
31% |
| CTRL-l, %k:\3101\test 20 11.5 3 0.05 |
43% |
| CTRL-l, %k:\3101\test 20 11.5 3 0.01 |
70% |
| CTRL-l, %k:\3101\test 20 11.5 3 0.005 |
79% |
| CTRL-l, %k:\3101\test 20 11.5 3 0.001 |
91% |
so the smaller the α, the larger the β, or
the less likely it is that we commit one error, the more likely it s that we commit the other
Effect of Sample Size n
| Macro |
Percentage of type II error |
| CTRL-l, %k:\3101\test 20 10.5 3 0.05 |
89% |
| CTRL-l, %k:\3101\test 50 10.5 3 0.05 |
80% |
| CTRL-l, %k:\3101\test 100 10.5 3 0.05 |
63% |
| CTRL-l, %k:\3101\test 200 10.5 3 0.05 |
35% |
| CTRL-l, %k:\3101\test 300 10.5 3 0.05 |
18% |
| CTRL-l, %k:\3101\test 400 10.5 3 0.05 |
9% |
| CTRL-l, %k:\3101\test 500 10.5 3 0.05 |
4% |
so the large the samplesize, the smaller the β, or
the more information (data) we have the better a job we can do
Above we found the probability of the type I error β. In real live one usually calculates the power of a test, which is simply 1-β = P(correctly reject H0|H0 is wrong). MINITAB can calculate these numbers exactly using the Stat > Power and Sample Size command.
Example If we run CTRL-l, %k:\3101\test 400 10.5 3 0.05 we find the type I error probability is 0.09. Now Stat > Power and Sample Size . 1-sample t, Samplesize: 400, Difference 0.5 (10.5-10), standard deviation: 3, yields 0.913926, and 1-0.913926 = 0.09
Importance of Sample Size
Example Using the best currently available treatment the mean survival time of patients with a certain type of terminal cancer is 122 days. A pharmaceutical company has just developed a new drug for these patients which they believe will lead to longer survival times. To test this they randomly select 13 patients and give them this treatment. The mean survival time of these patients turns out to be 127 days with a standard deviation of 45 days. So they carry out the following hypothsis test.
1) Parameter: mean
2) Method: 1-sample t test
3) Assumptions: data come from a normal distribution (Checked boxplot)
4) α = 0.05
5) H0: μ=122 (same survival times as with old treatment, new treatment is not better)
6) Ha: μ>122 (longer survival times than with old treatment, new treatment is better)
7) p=0.348 (Stat>Basic Statistics>1-sample t, Summarized data, Sample Size 13, Mean 127, Stand Dev 45, Test mean 122, Options: Alternative: greater than)
8) p=0.348 > a, so we fail to reject the null hypothesis
9) There is not enough evidence to conclude that this new treatment is better than the old one.
So far, so good. Now let's say that instead of 13 patients the company did the study with 1300 patients. They find:
1) Parameter: mean
2) Method: 1-sample t test
3) Assumptions: data come from a normal distribution (Checked boxplot)
4) α = 0.05
5) H0: μ=122 (same survival times as with old treatment, new treatment is not better)
6) Ha: μ>122 (longer survival times than with old treatment, new treatment is better)
7) p=0.0.00 (Stat>Basic Statistics>1-sample t, Summarized data, Sample Size 1300, Mean 127, Stand Dev 45, Test mean 122, Options: Alternative: greater than)
8) p=0.000 < a, so we reject the null hypothesis
9) The new treatment is statistically significantly better than the old one.
As you see, whether a difference of 5 days is statistically significant depends on the sample size of the study! This is true no matter what the difference is. Let's do this example again, but now say that the mean survival time in the study was 122.12 days, just 2 hours more. Even this difference is statistically significant, although we would need a sample size of about 4million.
So, after we carried out a hypothesis test, what can we conclude? There are always the following possibilities:
• If we rejected the null hypothesis:
--- we reject H0 because H0 is false
--- we committed the type I error (but we know the probability of doing so - α)
• If we failed to reject the null hypothesis:
--- we failed to reject H0 because H0 is true
--- we committed the type II error
--- we failed to reject H0 because our sample size was to small!
In real live we never know what the correct reason is!
Example So in the case of the company in real live they would not (yet) give up on the new drug, but understanding that n=13 is very small they would repeat the clinical trial (if possible) with a larger sample size.
Statistical vs. Practical Significance
Often you read something like: the new drug was shown to be statistically significantly better than previous drugs. What does that mean? First of all it (usually) means that somebody carried out a hypothesis test and rejected the null hypothesis of no difference between the drugs. But should you care?
Example Say you have to go to a hospital for some checkups. Nothing complicated or dangerous, but you will need to be in the hospital for a few days. You have a choice of hospital A here in Mayaguez, or hospital B in San Juan (assume for a moment you are from Mayaguez). You recently read in the newspaper about a survey done in both hospitals were patients were asked to rate the hospital on things such as: Where the doctors nice? Was the food ok? Did they let you watch TV? In this survey hospital A got a score of 57% and hospital B got 61%. This difference turned out to be statistically significant. Where will you go?
Example Say you have to go to a hospital for some dangerous surgery. You have a choice of hospital A here in Mayaguez, or hospital B in Miami (again assume you are from Mayaguez). You recently read in the newspaper about a study done in both hospitals on how patients who had that surgery did. In this study hospital A had a survival rate of 57% and hospital B had 61%, but this difference turned out not to be statistically significant (?). Where will you go?
Just because something is statistically significant does not automatically mean it is important, and just because something is not statistically significant does not mean you should not care.
There is one common misuse of hypothesis testing you should be aware of. It concerns searching for something, anything significant:
Example: There is a famous (infamous?) case of three psychiatrists who studied a sample of schizophrenic persons and a sample of nonschizophrenic persons. They measured 77 variables for each subject - religion, family background, childhood experiences etc. Their goal was to discover what distinguishes persons who later become schizophrenic. Using their data they ran 77 hypothesis tests of the significance of the differences between the two groups of subjects, and found 2 significant at the 2% level.They immediately published their findings.
What's wrong here? Remember, if you run a hypothesis test at the 2% level you expect to make reject the null hypothesis of no relationship 2% of the time, but 2% of 77 is about 1 or 2, so just by random fluctuations they could (should?) have rejected that many null hypotheses! This is not to say that the variables they found to be different between the two groups were not really different, only that their method did not proof that.
Example let's do a little simulation to illustrate the problem: First this we generate 50 observations from a normal distribution with mean 10.0 and standard deviation 1.5:
Calc > Random Data > Normal, generate 50 observations, store in x, mean 10.0, standard deviation 1.5
Now we test H0:μ=10.0 vs Ha:μ≠10.0:
Stat > Basic Statistics > 1-sample t, samples in column: x, Test mean: 10.0
Now we generated the data, so we know for sure that H0 is true, and we should fail to reject it. But if we keep doing this sooner or later we will in fact reject H0, that is we will commit the type I error.
I wrote a little MINITAB Macro to repeat the above simulation 10 times. Run it with
CTRL-L, %K\3101\multinf
For more on hypothesis testing see chapter 8 of the textbook.