Methods for Finding Hypothesis Tests

Direct Approach

Say we have X1, .., Xn ~ Pois(q) and we want to test H0: q=q0 vs. H1: q > q0
We know that is the mle of q, so a test based on seems reasonable.
Clearly large values of will indicate that the alternative is more likely to be true than the null hypothesis, so a reasonable rejection region is {>cv}. To find cv we need to solve the equation a = P(>cv | q=q0) = 1 - P( X1+..+Xn < n*cv | q=q0), and so cv = qpois(1-a,nq0)/n

What if we want to test H0: q=q0 vs. H1: qq0? Now the critical region should be { < cv1 or > cv2}. It turns out, though, that the choice of cv1 and cv2 is not unique. For example, cv1 = 0 and cv2 = qpois(1-a,nq0)/n would work. One popular strategy is the following
cv1 = qpois(a/2,nq0)/n and cv2 = qpois(1-a/2,nq0)/n.

This test is implemented in poistest1. It also draws the power curve.

Likelihood Ratio Tests

Say we want to test H0: q Q0 vs. H1: q Q0c. Then the likelihood ratio test statistic is defined by
hyp1fig1.png - 1851 Bytes
A likelihood ratio test (LRT) is any test that has a rejection region of the form {x: l(x) ≤ c}

The constant c here is not important, it will be found once we deside on the type I error probability a. It may better to think of this as "We reject H0 if l(x) is small".

Note that the supremum in the denominator is found over the whole parameter space, so this is just like finding the mle, and then finding the corresponding value of the likelihood function.

Note that we in the numerator we find the supremum over a subset of the one used in the denominator, so we always have l(x) ≤ 1

The logic of the LRT is this: In the denominator we have the likelihood of observing the data we did observe, given the most favourable paramaters (the mle) possible. In the numerator we find the likelihood assuming the null hypothesis is true. If their ratio is much smaller than 1, then there are parameters outside the null hypothesis which are much more likely than any in the null hypothesis, and we would reject the null hypothesis.

Example: Let X1, .., Xn be a sample from N(q,1). Consider testing H0: q=q0 vs. H1: qq0. Here Q0={q0} and so the numerator of l(x) is L(q0|x). For the denominator we have to find sup{L(q|x),q }:

and so

Now an LRT test rejects the null hypothesis if l(x) < c for some constant c. But

In other words the LRT test rejects the null hypothesis if l(x) is small, which is equivalent to |-q0| being large.
What is the constant c3? It depends on a, namely

for example, say n=10 and we want a=0.05, then c3=qnorm(1-0.05/2)/sqrt(10) = 0.62

Example: Let X1, .., Xn be a sample from a population witha pdf f(x|q) = e-(x-q) if x>q and 0 otherwise. (This is an exponential r.v with rate 1, shifted by q). The likelihood function is given by
hyp1fig6.png - 3754 Bytes
Consider testing H0: qq0 vs. H1: q>q0 for some specified value q0. The function L(q|X) is positive and increasing on - < q ≤ X(1), and then drops to 0. Thus the denominator of l(X), the unrestricted maximum of L(q|X) is L(X(1)|X). If X(1)q0, the numerator of l(X) is also L(X(1)|X). But since we maximize over qq0, the numerator is L(q0|X)if X(1) > q0. Therefore the likelihood ratio statistic is given by
hyp1fig7.png - 3377 Bytes
The likelihood ratio function is drawn in lrtexp
An LRT rejects the null hypothesis if l(X) ≤ c, which is clearly equivalent to a test which rejects the null hypothesis if X(1) ≥ c'. To determine the value of c' for some specific n and a we would now need the distribution of X(1). This is not impossible (actually, not even very hard, we have a formula for the 1st order statistic) but we can also do this via simulation. The calculations are also in lrtexp.

Asymptotic Distribution of the LRT

Suppose X1, .., Xn are iid f(x|q) and we wish to test H0: q Q0 vs. H1: q Q0c. Then under some regularity conditions the distribution of -2logl(X) converges to the distribution of a c2(p). Here p is difference of the number of free parameters in Q and the number of free parameters in Q0.

Note that if we let be the mle of L, that is L(|X) = sup{L(q|X):q Q} and let be the numerator of the LRT, that is L(|X) = sup{L(q|X):q Q0}, then
hyp1fig8.png - 3452 Bytes

Example: Let's return to the example of the mixture of two normals, and let's say we want to test H0: a=0.5 vs. H1: a 0.5. For the denominator of the LRT we can use the routine mixmle2. For the numerator we could redo the calculations leading to mixmle2, but now assuming a=0.5. Alternatively (and a whole lot less work, but considerably slower) we can use the R routine optim, general purpose minimizer. The test is run in the routine mixtest1
How good a test is this? That is, does it achieve its nominal type I error, and what is its power? We study this on one example in mixtest1power where we generate 200a N(0,1)'s and 200(1-a) N(3,0.52) for a from 0.5 (meaning the null hypothesis is true) to 0.8 (meaning it is quite wrong). To compute the power we run 1000 simulations for each of 20 values of a. Even on a fast desktop computer this takes about 20 minutes! And a complete study should include the effects of n and of different values of m1, s1, m2 and s2! The result of this study is in mixpoweres[[1]]

Example: Again consider the mixture of normals example, and let's say we wish to test H0: m1-m2=2.5, s1=s2 vs H1: either one or the other is not true.
Here under H0 we have three parameters: a, m1 (and once we have m1, m2 is fixed) and s1 (again then s2 is fixed).
The test is run in the routine mixtest2
Now the power of the test is a function of two variables, the true difference in means, and the true difference in standard deviations. In mixtestpower2 we compute the power for some examples. The result of two cases is in mixpoweres[[2]] and mixpoweres[[3]]

Large Sample Tests based on the CLT

Suppose we wish to test a hypothesis about a real-valued parameter q, and Tn is a point estimator of q. Say sn is the standard deviation of Tn. Now if some form of the CLT shows that (Tn-q)/sn converges in distribution to N(0,1) we can use this as a basis for a test.
Sometimes sn also depends on unknown parameters. In that case we can use an estimate of sn such as Sn instead.

A test based on Zn = (Tn-q)/Sn is often called a Wald test.

Example: Let X1, .., Xn be iid Ber(p). Consider testing H0: p = p0 vs. H1: p p0. The MLE of p is = . Since is the sample mean the CLT applies and states that for any p
hyp1fig9.png - 2291 Bytes
Of course we don't know p, but again we can estimate the p's in the denominator by and so we get a test with the test statistic

and we reject the null hypothesis is |Zn| > za/2.

Instead of replacing the p's in the denominator by we could also have used p0. Then another test is based on
hyp1fig11.png - 1677 Bytes
which rejects the null hypothesis if |Z'n| > za/2.

Which of these tests is better? Well, that depends on the power function. In powerbertest we draw the power function for both tests, for testing p=0.2. As we see the powerfunctions cross, so it depends on the true value of p which test is better.
In the case of Z' we can actually compute the power directly from a binomial cdf, see the black curve.

Bayesian Hypothesis Testing

In the Bayesian framework hypothesis tests are based on P(H0 is true |X) and P(H1 is true |X). In many ways these probabilities are exactly what a researcher desires to know, but they can only be found at the price of a prior distribution.
A Bayesian hypothesis test might then reject the null hypothesis if P(H1 is true |X) > 1-a.

Example: Let X1, .., Xn be iid N(m,s2) and let the prior distribution on m be N(m0,t2), where s2, m0 and t2 are known. Say we wish to test H0: mm0 vs. H1: m > m0. It can be shown that then the posterior distribution p(m|X) is N(nt2+s2m)/(nt2+s2),s2t2/(nt2+s2)).
Say we decide to accept H0 if P(H0 is true |X) > P(H1 is true |X) and reject H0 otherwise. So we reject H0 if P(mm0|X) ≥ 1/2.
Since p(m|X) is symmetric, this is true iff the mean of p(m|X) is less than or equal to m0. Therefore H0 is accepted if m0 + s2(m0-m)/(nt2)

Tests based on Simulation

The idea here is very simple: generate lots of simulated data from the same distribution as the real data, assuming the null hypothesis is true. For each run compute the corresponding test statistic, and then compare these values to the one from the data.

Example: Say X1, .., Xn are iid P(l). We wish to test H0: l=l0 vs. H1: ll0.
We know that the sample mean is the mle of l, and so we can base a test on and we reject H0 if is to far from l. For this we compute B simulated data sets X'1, .., X'n iid P(l0) and find the a/2 and the 1-a/2 quantiles. This will give us the critical values cv1 and cv2, and we reject H0 if either < cv1 or if > cv2

The calculations are in poistest and a power calculation is in powerpoistest

There are also some tests already built on simulation. Say we have X1, .., Xn from some distribution f(x) with mean m1 and Y1, .., Ym from the same distribution f(x) with mean m2. We wish to test H0: m1=m2 vs. H1: m1 m2.
Now under the null hypothesis the two samples come from the exact same distribution with the same mean. So their order is completely random, and if we find the test statistic
hyp1fig12.png - 681 Bytes
it should have a mean of 0. We could now proceed as above, generating data from f etc., but here is another idea: let's but the X's and the Y's in one vector of length n+m, let's find a random permutation of this vector, and then split it up into the first n and the last m numbers, calling them X' and Y'. Because under H0 any rearrangement of the data is just as likely as any other, this new data set is just as good as the original one. So we can now find T' from these observations. Because we have complete mixed the X's and the Y's we have ET'=0 for sure. Generating many T's we get an idea of "likey" values of T, and can compare the one in the real data set to them.
This is called a permutation test.

Example: Say we have X1, .., Xn iid N(m1,s2) and Y1, .., Ym iid N(m2,s2). Now the LRT test for this is based on
hyp1fig13.png - 2293 Bytes
which has a t distribution with n+m-2 degrees of freedom, and we reject H0 if |T|>qt(1-a/2). Let's compare this test (which is known to be optimal) to a permution test.
The 2-sample t test is already implemented in t.test, and we can find its power either via simulation or analytically. The permutation test ist implemented in perm.test. Its power has to be found via simulation, but because each permutation already includes a loop over 1000 runs, and now a simulation of a single point on the power curve will need to run perm.test 1000 times as well, and we will want say 50 points on the power curve, this mean we will run the inner loop of perm.test about 50 million times. On a desktop this takes about 2 hours. One example (for n=10 and m=15) was already run, and the result is in powertperm.plot. As we can see, the permutation test is just about as good as the t test, at least for these parameter values.