Evaluating Hypothesis Tests

The Power of a Test

In a hypothesis test the type I error probability α is defined by α=P(reject H0|H0 is true) and is chosen by the analyst at the beginning of the test. On the other hand the type II error probability β is defined by P(accept H0| H0 is false).

Example say we have X1, .., Xn~Ber(p) and for some reason we want to test H0: p=0.5 vs Ha: p=0.6.

Now is the mle of p and a large value of indicates that the null hypothesis is wrong, so we might use a test with the rejection region {>cv} for some critical value cv. Then

α = P(>cv|p=0.5) = 1-P(∑X<n·cv|p=0.5) = 1-pbinom(n·cv,n,0.5)
1-α=pbinom(n·cv,n,0.5)
n·cv=qbinom(1-α,n,0.5)
cv=qbinom(1-α,n,0.5)/n

and so

β = P(≤cv|p=0.6) = P(∑X<n·cv|p=0.6) = pbinom(n·cv,n,0.6) = pbinom(qbinom(1-α,n,0.5),n,0.6)

As a numerical example say α=0.05 and n=100, then cv=0.58 and β=0.3774

Example say we have X1, .., Xn~Ber(p) and now we want to test H0: p=0.5 vs Ha: p>0.5.

Now we can repeat the above, and again we find cv=qbinom(1-α,n,0.5)/n, but when we want to find β we have a problem, we don't know what the p is. What we can do is find β as a function of p:

β(p) = P(≤cv|p) = P(∑X<n·cv|p) = pbinom(n·cv,n,p) = pbinom(qbinom(1-α,n,0.5),n,p)

In real life we usually calculate the power of the test, defined by Pow(p)=1-β(p). It has two advantages:

1) it gives the probability of corrrectly rejecting a false null hypothesis
2) Pow(p0)=α

The power curve for this test is drawn in berpower. The curve for n=100, α=0.05 is in blue, for other values in red.

Example say we have X1, .., Xn~N(μ,σ), σ known, and now we want to test H0: μ=μ0 vs Ha: μ≠μ0 . Again is the mle, and Z=√n(0)/σ ~N(0,1), so a test might use the rejection region {|Z|>cv}:

The power curve for this test is drawn in mean.pow(1).

Example Again we have X1, .., Xn~N(μ,σ), σ known, and now we want to test H0: μ=μ0 vs Ha: μ≠μ0 , but this time we will use the median M as an estimator of μ. Again a reasonable rejection region is {|M-μ|>cv}. The problem is, what is the distribution of M? It can be found as follows: if n is odd we have M=X[(n+1)/2], the (n+1)/2 order statistic of X1, .., Xn, so

where φ(x|μ) and Φ(x|μ) are the density and cdf of normal rv's with mean μ (and sd σ). If n is even M=(X[n/2]+X[(n+2)/2])/2, and it can be shown that

med.pow(2) draws the density of the median (in red) together with the normal density (in blue) if n is odd.
Now

and we can find cv using the integrate function in R: set cv=0

1) set cv=cv+0.01
2) find a=integrate(f,mu-cv,mu+cv)$value
3) if a>1-α, done, otherwise go back to 1)

Finally β(μ)=P(μ0-cv<M<μ0+cv | μ) which we can again find using the integrate function.

med.pow(3) draws the power curves of the median (in red) together with the power of the test based on the sample mean (in blue).

Example: Again we have X1, .., Xn~N(μ,σ), σ known, and again we want to test H0: μ=μ0 vs Ha: μ≠μ0. If we are worried about possible outliers we might decide to use a trimmed mean as our estimator: the 100p% trimmed mean is defined by

are called the floor and the ceiling functions. In other words, to find the 100p% timmed mean eliminate the 100p/% smallest and largest observations and find the mean of the rest. Note: mean=0% trimmed mean and median~50% trimmed mean. In R use the mean(x,trim=p) function.

Again a reasonable test has rejection region {|Tp-μ|>cv}. But what is the distribution of Tp? This can not be done analytically for a general p, so either we do some heavy math everytime we want a different p, or we need a different solution. Here is one based on simulation:

to find cv:

1) generate Y1, .., Yn~N(μ0,σ), calculate Tp, call it Tp(1)
2) repeat 1) many times, say 1000 times
3) Find cv such that 100α% of the |Tp0|'s are greater than cv

to find β(μ):

1) generate Y1, .., Yn~N(μ,σ), calculate Tp, call it Tp(1)
2) repeat 1) many times, say 1000 times
3) Find β(μ) as the proportion of Tp such that |Tp0|>cv

med.pow(4) draws the power curves of Tp(in green) together with medians (in red) and the power of the test based on the sample mean (in blue). Because this takes a bit here is one example, med.pow(4,n=9, p=0.25):

as one would expect, the power of the 25% trimmed mean test is between those of the mean and the median.

Neyman Pearson Theory

Definition: Let C be a class of tests for testing H0: θΘ0 vs Ha: θΘ0c. A test in C with power function Pow(θ) is a uniformly most powerful (UMP) class C test if Pow(θ)≥Pow'(θ) θΘ0c and every power function Pow' for every test in C.

If the class C is the class of all tests with level α it is called the UMP level α test.

Theorem (Neyman-Pearson Lemma)

Consider testing H0: θ=θ0 vs Ha: θ=θ1, using a test with rejection region R given by

xR if f(x|θ1)>kf(x|θ0)
and
xRc if f(x|θ1)<kf(x|θ0)
for some k≥0, and
α=P(XR|θ0)

Then

a) (sufficency) Any test of this form is a UMP level α test.
b) (necessity) If there exists a test of this form with k>0, then every UMP level α test is of this form.

Example let X1, .., Xn~N(μ,σ), σ known and assume we want to test H0: μ=μ0 vs Ha: μ=μ1 Then

as k goes from 0 to ∞ log(k) goes from -∞ to ∞ and so does the whole right side, therefore a UMP level α test is of the from <cv, where

α=P(<cv)

and so