Inference for a Proportion (or Percentage) p

Contents of this page:
Assumptions
Confidence Interval
Hypothesis Test
Sample Size

In this section we will discuss inference for proportions (or percentages) such as the percentage of people who prefer Coke over Pepsi, who will vote PNP in the next election, who earn more than $50,000 per year etc.
Say we do a survey of n people and ask them "Do you prefer Coke over Pepsi?" Then if we let X be the number of people in the sample who say "yes" we already know X~Bin(n,p). The object of interest here is p, the proportion of people in the whole population who prefer Coke over Pepsi. Obviously X/n will be our point estimate of p. We will often use the notation = X/n to indicate that we are thinking of a point estimate for p.

Note: Mostly we use greek letters for parameters. Here is one exception, because we already used B(n,p). Some textbooks use p, though.

Example Say in a survey of 500 people 312 say they prefer Coke over Pepsi. Then a point estimate for the proportion of people who prefer Coke over Pepsi is = X/n = 312/500 = 0.624

Note Most often problems are stated in terms of percentages instead of proportions but all the methods use proportions. Simply multiply by 100% at the end.

Example A point estimate for the percentage of people who prefer Coke over Pepsi is 62.4%

Method

Based on normal approximation

Assumptions

All the methods discussed in this section are what we call large sample methods, that is they require a certain minimal sample size to work. Specifically we require that
n ≥ 5 and n (1-) ≥ 5

Confidence Interval


A 100(1-a)% confidence interval for the population proportion p is given by

This formula uses the critical values za. You can find these here

Example: Alcohol on college campuses is a very serious problem. But how common is it? A survey of 17,096 students in US four-year colleges collected information on drinking behavior and alcohol-related problems. (Herny Wechsler et al., "Health and Behavioral Consequences of Binge Drinking in College", Journal of the American Medical Association, 272 (1994). The researchers defined "frequent binge drinking" as having five or more drinks in a row three or more times in the past two weeks. According to this definition 3,314 students were classified as frequent binge drinkers.

Problem: Find a point estimate for the percentage of frequent binge drinkers.
Solution: A point estimate for the proportion of frequent binge drinkers is

therefore a point estimate for the percentage is 19.4%

Problem: Find a 99% confidence interval for the percentage of frequent binge drinkers.
Solution:
Assumptions: n = 17,096*0.194 = 3314 ≥ 5 and n (1-) = 17096(1-0.194) = 13782 ≥ 5
100(1-a)% = 99%, so a=0.01, so a/2=0.005 and z0.005 = 2.58. Therefore

and we find a 99% confidence interval for the percentage of frequent binge drinkers to be (19.1%, 19.7%)

Hypothesis Test

Null Hypothesis: H0: p =p 0

Alternative Hypothesis: Choose one of the following
a) Ha: p > p0
b) Ha: p < p0
c) Ha: p ≠ p0

Test Statistic:

Rejection Region:
If your alternative is a) Ha: p > p0, then reject H0 if Z > za
If your alternative is b) Ha: p < p0, then reject H0 if Z < -za
If your alternative is c) Ha: p ≠ p0, then reject H0 if |Z| > za/2

Example as we said in the discussion on probability, the South African mathematician Jon Kerrich, while in a German POW camp during WWII tossed a coin 10000 times and got 5067 heads. Test at the 5% level of significance whether this result is compatible with a fair coin.
Solution:
= 5067/10000 = 0.5067

1) Paramater: proportion p
2) Method: based on normal approximation
3) Assumptions: n = 10000*0.5067 = 5067 ≥ 5 and n (1-) = 10000 (1-0.5067) = 4933 ≥ 5
4) a = 0.05
5) H0: p = 0.5 (50% of flips result in "Heads", coin is fair)
6) Ha: p ≠ 0.5 (coin is not fair)
7)

8) We reject H0 if |Z| > za/2, a = 0.05, so a/2=0.025 and z0.025=1.96. Therefore we reject H0 if |Z| > 1.96. |Z| =1.34 < 1.96, so we fail to reject the null hypothesis
9) it appears Jon Kerrich's coin was indead fair.

Example Same as above, using the p-value approach
1) Paramater: proportion p
2) Method: based on normal approximation
3) Assumptions: n = 10000*0.5067 = 5067 ≥ 5 and n (1-) = 10000 (1-0.5067) = 4933 ≥ 5
4) a = 0.05
5) H0: p = 0.5 (50% of flips result in "Heads", coin is fair)
6) Ha: p ≠ 0.5 (coin is not fair)
7) p-value=0.184 (Stat > Basic Statistics > Proportion, check summarized data, Number of Trials: 10000, Number of Events: 5067, Options > check box, leave everything else as is)
8) p-value = 0.184 > 0.05, so we fail to reject the null hypothesis.
9) it appears Jon Kerrich's coin was indead fair.

Example Say Jon Kerrich coin was actually not a fair coin but one with P(Heads) = 0.505. How often would he have had to flip his coin to reject the null hypothesis?

Of course now we don't have any data, so we have to guess what X might have been. For example if he had flipped this coin 10000 times we would expect him to get about 10000·0.505=5050 heads. Running the test with these numbers we find:

n X p-value
10000 5050 0.322
20000 10100 0.159
30000 15150 0.084
40000 20200 0.046

so if he had flipped his coin about 40000 times he would have rejected the null hypothesis of a fair coin.

Sample Size Calculation

As with the mean we start with the formula for the confidence interval:
A 100(1-a)% confidence interval for the population proportion p is given by

So the error here is given by

and solving this equation for n yields:

We have the same problem as with the mean here, our formula still contains , but we can't find until we have a sample and we are still trying to figure out what the sample size should be. The same ideas as with the mean such as doing a pilot study work here as well. In addition we have something else we can do here. Notice that the curve *(1-) = - 2 is a parabola, opening downwards. It is drawn here:

Notice that *(1-) 1/4
and so we have

In other words, no matter what the population proportion p is a sample size of

will always be sufficient.

Example You want to do a survey of likely voters for the next election. You want to find a 95% confidence interval for the percentage of voters for the PNP, with an arror of E=0.03. What sample size is required?
Solution: 100(1-a)% = 95%, so a=0.05, so a/2=0.025 and z0.025=1.96. Therefore

Example same as above, but for the PIP. Here we already know that p is around 5%, therefore we should use the exact formula:

From the graph above you see that for proportions close to 0 or 1 a much smaller sample size is enough.

Look again at the formula for the sample size. There is something truely amazing about what is not part of this formula!

For more on inference for a single proportion see page 430 of the textbook.

Warning
We now have formulas for the mean and formulas for the proportion. These are completely different!

Example Formulas for confidence interval:

• Mean m:

• Proportion p:

If you use the formula for the mean although your problem is about a proportion you are guaranteed 0 points. I am typing this while grading a final exam and there are any number of students who have made this type of mistake. Some of them will fail the class because of this!