Hypothesis Testing

A hypothesis is a statement about a population parameter. In its most general form it can be written as

H0: θΘ0

where θ is a parameter (vector) and Θ0 is a region in the parameter space Θ

Example X~Ber(p), Θ=[0,1], Θ0={0.5}, so we are testing whether p=0.5
Example X~N(μ,σ), Θ=x(0,∞), Θ0=(100,∞)x(0,∞), so we are testing whether μ>100

In addition to the null hypothesis we also write down the alternative hypotheses Ha, usually (but not always) the complement of Θ0. So a hypothesis test makes a choice between H0 and Ha.

A hypothesis that "fixes" the parameter (θ=θ0) is called simple, otherwise it is called composite (for example θ>θ0)

A complete hypothesis test should have all of the following parts:

1) Type I error probability α
2) Null hypothesis H0
3) Alternative hypothesis Ha
4) Test statistic
5) Rejection region
6) Conclusion

Example: Over the last five years the average score in the final exam of a course was 73 points. This semester a class with 27 students used a new textbook, and the mean score in the final was 78.1 points with a standard deviation of 7.1.
Question: did the class using the new text book do (statistically significantly) better?

For this specific example the complete hypothesis test might look as follows:

1) α = 0.05
2) H0: μ0 = 73
3) Ha: μ0 > 73
4)
hypfig1.png - 1451 Bytes
5) reject H0 if T>qt(1-0.05,26) = 1.706
6) T = 3.81 > 1.706, so we reject the null hypothesis, it appears that the mean score in the final is really higher.

In the 6 parts of a hypothesis test, the first 3 (at least in theory) should be done before looking at the data. The following is not allowed: say we did a study of students at the Colegio. We asked them many questions. Afterwards we computed correlation coefficients for all the pairs of variables and found a high correlation between "Income" and "GPA". Then we carried out a hypothesis test H0: r = 0 vs. Ha: r ¹ 0

The problem here is that this hypothesis test was suggested to us by the data, but (most standard) hypothesis tests only work as advertised if the hypotheses are formulated without consideration of the data.

Go back to our example of the new textbook. Here we have the following:

Correct: we pick Ha: μ >73 because we want to proof that the new textbook works better than the old one.
Wrong: we pick Ha: μ >73 because the sample mean score was 78.1, so if anything the new scores are higher than the old ones.

Type I and Type II errors

When we carry out a hypothesis test in the end we always face one of the following situations:
hypfig2.png - 5442 Bytes

In statistics when we do a hypothesis test we decide ahead of time what we are willing to accept as a type I error α, and then accept whatever the type II error β is. Generally, if you make α smaller, thereby reducing the probability of falsely rejecting the null hypothesis you make β larger, that is you increase the probability of falsely accepting a wrong null hypothesis. The only way to make both α and β smaller is by increasing the sample size n.

How do you choose α? This in practise is a very difficult question. What you need to consider is the consequences of the type I and the type II errors.

Many fields such as psychology, biology etc. have developed standards over the years. The most common one is α = 0.05, and we will use this if nothing else is said.

p-value

In real live a frequentist hypothesis test is usually done by computing the p-value, that is the probability to observe the data or something even more extreme given that the null hypothesis is true

Example (above) p=P(mean score on final exam > 78.2 | μ = 73)

Then the decision is made as follows:

p< α p> α
reject H0 fail to reject H0

The advantage of the p value approach is that in addition to the decision on whether or not to reject the null hypothesis it also gives us some idea on how close a decision it was. If p=0.035<α=0.05 it was a close decsision, if p=0.0001<α=0.05 it was not.

The p-value depends on the observed sample, which is a random variable, so it in turn is a random variable. What is its distribution?

Example say X~N(μ,1) and we want to test H0: μ=0 vs Ha: μ>0. We use the rejection region {X>cv} where cv=qnorm(1-α). Now let Y~N(μ,1) independent of X, assume we observe X=x and denote the p-value by p(x), then

so if the null hypothesis is true the distribution of the p-value is uniform [0,1]. This turns out to be true in general. In pval(μ) we do a simulation to show what the distribution of the p-value is (in this example) if the null hypothesis is false.

Bayesian Hypothesis Testing

Strictly speaking hypothesis testing is not a Bayesian concept. To begin with, if we wanted to test the hypothesis H0: θ=θ0 we would need to start with a prior that puts some probability on the point {θ0}, otherwise the hypothesis will always be rejected. If we do that we can simply compute P(H0 is true | data), and if this probability is smaller than some threshhold (similar to the type I error) we reject the null hypothesis.

Instead of the probability P(H0 is true | data) we often compute the Bayes factor, given as follows: say X1, .., Xn~f(x|θ) and θ~g, then the posterior density is g(θ|x) prop. L(θ)g(θ) The belief about H0 before the experiment is descibed by the prior odds ratio P(θΘ0)/P(θΘ1), and belief about H0 after the experiment is descibed by the posterior odds ratio P(θΘ0|x)/P(θΘ1|x). The Bayes factor is then the ratio of the posterior to the prior odds ratios (a ratio of ratios)