Hypothesis Testing
Contents of this page:
Introduction
H0 and Ha
Type I and Type II errors
Statistical vs. Practical Significance
Importance of Sample Size
p-value Approach - Level of Significance
Warning

Introduction

A hypothesis test is a statistical method that answers a yes-no question, phrased in the form of a statement
Example 1: Men in Puerto Rico are on average 5' 9'' tall
Example 2: The average income of men in Puerto Rico is higher than the average income of women.
Example 3: Smoking causes lung cancer.

A complete hypothesis test should have all of the following parts:
1) Type I error probability a
2) Null hypothesis H0
3) Alternative hypothesis Ha
4) Test statistic
5) Rejection region
6) Conclusion

Example: Over the last five years the average score in the final exam of a course was 73 points. This semester a class with 27 students used a new textbook, and the mean score in the final was 78.1 points with a standard deviation of 7.1.
Question: did the class using the new text book do (statistically significantly) better?

For this specific example the complete hypothesis test looks as follows:
1) a = 0.05
2) H0: m0 = 73
3) Ha: m0 > 73
4) hypfig1.png - 1451 Bytes (Look a little bit familiar? - central limit theorem)
5) reject H0 if T>t26, 0.05 = 1.706
6) T = 3.81 > 1.706, so we reject the null hypothesis, it appears that the mean score in the final is really higher.

H0 and Ha

A hypothesis test includes a null hypothesis H0 and an alternative hypothesis Ha. Either one or the other has to be true.
What to pick as H0? The null hypothesis has to be chosen in such a way that it completely describes the situation.
Example 1: H0: Men in Puerto Rico are on average 5' 9'' tall
Why? There is one and only one way in which the men are (on average) 5' 9'' tall, but there are (infinitely) many ways in which the average height is not 5'9'' (it might be a tiny little higher, somewhat higher, a lot higher, a lot smaller ..)
Example 2: H0: The average incomes are the same
Example 3: H0: Smoking does not cause lung cancer

Often a hypothesis can be expressed in terms of population parameters. If the hypothesis is written in terms of parameters this (almost always) means that the null hypothesis has an = sign.
Example 1: H0:m0 = 5' 9'' vs. Ha:m0 > 5' 9''
Example 2: H0:mMen = mWomen vs. Ha:mMen > mWomen
Example 3: H0:r = 0 vs. Ha:r ¹ 0

Often the alternative hypothesis is called the research hypothesis. This reflects the following logic of scientific research: Say we are trying to discover a new particle in physics. We build a big particle accelerator and detector. We run our experiment. Here we would use the hypotheses H0: We have not found a new particle vs. Ha: We have found a new particle. But this corresponds to a prudent and careful approach to scientific research, we will believe in a new discovery only if the data tells us that we have made such a discovery. Let the data proof it!

Warning: In the 6 parts of a hypothesis test, the first 3 (at least in theory) should be done before looking at the data. The following is not allowed: say we did a study of students at the Colegio. We asked them many questions. Afterwards we computed correlation coefficients for all the pairs of variables and found a high correlation between "Income" and "GPA". Then we carried out a hypothesis test H0:r = 0 vs. Ha:r ¹ 0
The problem here is that this hypothesis test was suggested to us by the data, but hypothesis tests only work as advertised if the hypotheses are formulated without consideration of the data.

Related to this problem is the following issue: as you will see, we will always use H0 with = (for example m = 0). On the other hand there are three commonly used alternative hypotheses:
Ha: m > 0
Ha: m < 0
Ha: m ¹ 0

Go back to our example of the new textbook. Here we have the following
Correct: we pick Ha: m > 73 because we want to proof that the new textbook works better than the old one.
Wrong: we pick Ha: m > 73 because the sample mean score was 78.1, so if anything the new scores are higher than the old ones.

Type I and Type II errors

When we carry out a hypothesis test in the end we always face one of the following situations:
hypfig2.png - 5442 Bytes

There is a useful analog in the legal system: Say a person is accused of a crime. They go to trial and in the end
hypfig3.png - 2760 Bytes
Which error is worse - convicting an innocent person or aquitting a guilty one? At least in our western legal system we do what we can to avoid the first mistake - convicting a guilty person - and life with the consequence that on occasion a guilty person goes free. Similarly in statistics when we do a hypothesis test we decide ahead of time what we are willing to accept as a type I error a, and then accept whatever the type II error b is. Well, not quite, but wait and see.

There is a close analog between the a and b in a hypothesis test and the confidence level and the width of the interval (error) in a confidence interval. If you make a smaller, thereby reducing the probability of falsely rejecting the null hypothesis you (almost always) make b larger, that is you increase the probability of falsely accepting a wrong null hypothesis. The only way to make both a and b smaller is by increasing the sample size n.

How do you choose a? This in practise is a very difficult question. What you need to consider is the consequences of the type I and the type II errors.

Example: In our example above with the new textbook, what does it mean to "commit the type I error"? If we do, what are the consequences? What does it mean to "commit the type II error" and what are its consequences?

Say a pharmaceutical company has developed a new treatment for stomach cancer. They carry out a clinical trial. The outcome of the trial is used in a hypothesis test with H0:New drug is not better than old drugs vs. Ha: New drug is better than old drugs. What are the type I and II errors here, and what are their consequences?

Many fields such as psychology, biology etc. have developed standards over the years. The most common one is a = 0.05, and we will use this if nothing else is said.

Statistical vs. Practical Significance

Often you read something like: the new drug was shown to be statistically significantly better than previous drugs. What does that mean? First of all it (usually) means that somebody carried out a hypothesis test and rejected the null hypothesis of no difference between the drugs. But should you care?

Example 1: Say you have to go to a hospital for some checkups. Nothing complicated or dangerous, but you will need to be in the hospital for a few days. You have a choice of hospital A here in Mayaguez, or hospital B in San Juan (assume for a moment you are from Mayaguez). You recently read in the newspaper about a survey done in both hospitals were patients were asked to rate the hospital on things such as: Where the doctors nice? Was the food ok? Did they let you watch TV? In this survey hospital A got a score of 57% and hospital B got 61%. This difference turned out to be statistically significant. Where will you go?

Example 2: Say you have to go to a hospital for some dangerous surgery. You have a choice of hospital A here in Mayaguez, or hospital B in Miami (again assume you are from Mayaguez). You recently read in the newspaper about a study done in both hospitals on how patients who had that surgery did. In this study hospital A had a survival rate of 57% and hospital B had 61%, but this difference turned out not to be statistically significant (?). Where will you go?

Just because something is statistically significant does not automatically mean it is important, and just because something is not statistically significant does not mean you should not care.

Importance of Sample Size

The outcome of a hypothesis test not only depends on whether the null hypothesis is true or false but it also always depends on the sample size. If the sample size is small even a large difference might not be statistically significant even if it is of practical importance. If the sample size is large even a small difference might be statistically significant even if it does not make any practical difference.

Illustration: Say we are testing H0:m = 0 vs. Ha:m > 0.
Although we don't actually know this it turns out that in reality m = 0.2, so the null hypothesis is wrong and we should reject it. In the next graph we have the probability of making the right decision (rejecting the null hypothesis) depending on the sample size. Clearly for a small sample size we are very likely to make the wrong decision. Actually the sample size has to be 68 in order to correctly reject the null hypothesis at last 50% of the time!

hyptestfig2.png - 5131 Bytes

So, if we fail to reject the null hypothesis, what can we conclude? There are always two possibilities:
--- we failed to reject H0 because H0 is true
--- we failed to reject H0 because our sample size was to small
In real live we never know what the correct reason is!

p-value Approach - Level of Significance

There is a second method for carrying out a hypothesis test called the p-value approach. It has the following parts:
1) Type I error probabilitya
2) Null hypothesis H0
3) Alternative hypothesis Ha
4) Find p value
5) Decision and Conclusion

We will decide whether or not to reject the null hypothesis based on the p value. The p value p is computed by the computer. This number is then compared to a as follows:
p< a p> a
reject H0 fail to reject H0

Example for carrying out a hypothesis test using the p-value approach. We will do the same example as above for the new textbook:
1)a = 0.05
2) H0: m0 = 73
3) Ha: m0 > 73
4) p = 0.000101 (from computer)
5) p = 0.000101 > a = 0.05, so we reject the null hypothesis, it appears that the mean score in the final is really higher.

What is the p value? It is the probability to observe our data or something even more extreme if the null hypothesis is true. So in the example above it is P(mean score on final exam ge.png - 200 Bytes 78.2 | m = 73)

The advantage of the p value approach is that in addition to the decision on whether or not to reject the null hyothesis it also gives us some idea on how close a decision it was. Here with p=0.0001 we would have rejected the null hypothesis even if we had chosen a = 0.01, so it was not a close thing at all.

Warning

There is one common misuse of hypothesis testing you should be aware of. It concerns searching for something, anything significant:
Example: There is a famous (infamous?) case of three psychiatrists who studied a sample of schizophrenic persons and a sample of nonschizophrenic persons. They measured 77 variables for each subject - religion, family background, childhood experiences etc. Their goal was to discover what distinguishes persons who later become schizophrenic. Using their data they ran 77 hypothesis tests of the significance of the differences between the two groups of subjects, and found 2 significant at the 2% level.They immediately published their findings.

What's wrong here? Remember, if you run a hypothesis test at the 5% level you expect to make reject the null hypothesis of no relationship 5% of the time, but 5% of 77 is about 3 or 4, so just by random fluctuations they could (should?) have rejected that many null hypotheses! This is not to say that the variables they found to be different between the two groups were not really different, only that their method did not proof that.