Some Special Random Variables

Bernoulli RV
Geometric RV
Binomial RV
Normal (Gaussian) Distribution
Central Limit Theorem

Bernoulli Trials

The most basic experiment possible is one that has only two possible outcomes:

• flip a coin - heads or tails
• roll a die - get a six or don't
• take a class - pass or fail
• person smokes - yes or no
• person has open heart surgery - person survives or dies

any such experiment is called a Bernoulli trial. In order to have a random variable we "code" on outcome as 0 and the other as 1. Then the pmf is given by
x 0 1
P(x) 1-p p
Often we arbitrarily we call one outcome a "success", the other a "failure". The probability of "success" is denoted by p, and then of course the probability of "failure" is q=1-p.

Example: Flip a fair coin, "success" = "get tails", p = P("success")=0.5, "failure" = "get heads", q = 1-p = 0.5

Example: Roll a fair die, "success" = "get a six", P("success")=1/6, "failure" = "don't get a six", q = 1-p = 5/6

Example: We randomly choose an employee of WRInc. , "success" = "employee is female", P("success")=206/527, "failure" = "employee is male", q = 1-p = 321/527

Once we have a distribution of a rv we can find formulas for the population mean and standard deviation: say X is a Bernoulli rv with success probability p, then
Population Mean: m=p
Population Standard Deviation: s=√(pq)

Geometric RV

Say we carry out a sequence of identical Bernoulli trials until the first success occurs. The rv X is the number of trials needed. Then X is called a geometric rv and we have P(X=k)=qk-1p

Note we already know this formula - see here

Note we use the following shorthand: X~G(p)

Example We roll a fair die until the first six occurs. Let X be the number of rolls. Find the pmf of X.
Here we have a sequence of Bernoulli trials:
• first roll - six or not
• second roll - six or not
• third roll - six or not
etc,

All rolls are identical and independent, so X is a geometric rv with p=1/6 and P(X=k)=(5/6)k-11/6

Example In a certain population a genetic condition is present in about 35% of the population. If we randomly select people from this population what is the probability that the fifth person tested is the first with the condition?
We have X~G(0.35), so P(X=5)=0.654·0.35 = 0.0625

Let's continue this example. What is the probability that none of the first 10 people tested has the condition?
If none of the first 10 people tested has the condition, then the first success has to come after the 10th trial. So

P(none of the first 10 people tested has the condition) = P(X>10) = 1-P(X≤10)

Now of course we could find this probability as follows:

P(X≤10) = P(X=1)+P(X=2)+..+P(X=10)

but there is actually a simple formula for this: P(X≤k)=1-qk

so we find
P(X>10) = 1-P(X≤10) = 1-(1-0.6510) = 0.6510 = 0.0135

Again we have formulas for the population mean and standard deviation: say X~G(p) then
Population Mean: m=1/p
Population Standard Deviation: s=√(q)/p

Example A certain rare genetic condition appears in about 4% of the population. If we randomly screen people, about how often do we find somebody with this condition?
We have X~G(0.04), so m=1/0.04=25, about every 25th person will have the condition.

Binomial RV

Again we carry out a sequence of independent Bernoulli trials, all with success probability p. Now, though, the number of trials, n, is known and we are interested in the number of successes X. X is said to have a Binomial distribution with parameters n and p. We use the following notation: X~Bin(n,p)

Example: what is the probability of exactly 3 "heads" in 5 tosses of a fair coin?
The important step here is to see that this is a problem for the Binomial rv:
• we have a sequence of Bernoulli trials (each flip has two possible outcomes - heads or tails)
• successive trials are independent (one flip does not affect the other)
• the probability of "succeess" p is the same for each flip (say "success" = "coin shows heads", then p=0.5 on each flip)
• the number of interest is the number of successes (we are interested in the number of "heads" = "success")
so we see that X~Bin(5,0.5)

Example: We randomly choose 10 employees of WRInc. What is the probability that exactly 5 of them are female?
• we have a sequence of Bernoulli trials (each person is either male or female)
• successive trials are independent (true if selection is done with replacement, still "almost" true if not)
• the probability of "succeess" p is the same for each flip (true if selection is done with replacement, still "almost" true if not. If selection is done without replacement we have:
P(first person is female) = 210/527 = 0.398482
P(second person is female | first person was female) = 209/526 = 0.397338
and those are almost the same.
• the number of interest is the number of successes (we are interested in the number of females = "success")
so we see that X~Bin(10,0.398)

Example X is the number of rainy days during the next week. Is X Binomial?

Example X is the number of rainy days during the next six month. Is X Binomial?

If X~G(p) we had a simple formula for P(X=k). Now X~Bin(n,p), so how do we find P(X=k)? Again there exists a formula but it is more complicated. Worse, there is no formula at all for P(X≤k). Instead we will find probabilities either using MINITAB or using tables:

Using the Binomial tables we can find probabilities of the form P(X≤x) for various n and p.

Using MINITAB we can find two types of probabilities:
• P(X=x) : Enter x (or x's) in a column, say c1. Then Calc > Probability Distributions > Binomial, check Probability, enter c1, n and p.
• P(X≤x) : Enter x (or x's) in a column, say c1. Then Calc > Probability Distributions > Binomial, check Cumulative Probability, enter c1, n and p

Note: we have P(X=x) = P(X≤x)-P(X≤x-1) so you can also find these probabilities from the table

Example Say X~Bin(20,0.4) Find P(X=8)

Using MINITAB: Enter 8 in a column, say c1. Then Calc > Probability Distributions > Binomial, check Probability, enter c1, 20 and 0.4. Answer: P(X=8)=0.179706
Using the Binomial tables P(X=8)= P(X≤8)-P(X≤7) = 0.595-0.416 = 0.179

Note: Say we want P(3≤X≤5) Then we have
P(3≤X≤5)
= P(X=3 or X=4 or X=5)
= P(X=0 or X=1 or X=2 or X=3 or X=4 or X=5) - P(X=0 or X=1 or X=2)
= P(X≤5) - P(X≤2)

For the same reason we have P(X≥5)=1-P(X≤4)

Example We randomly choose 10 employees of WRInc. What is the probability that exactly 5 of them are female? Let X be the number of employees chosen that are female. Then X~Bin(10,206/527). So
P(X=5) = 0.1928
p=206/527=0.3909 is not in the table, so we have to use MINITAB

Example Say we flip a fair coin 100 times. What is the probability to get at most 55 "heads"?
let X be the number of "heads" in 100 flips of a fair coin. Then X~Bin(100,0.5). So
P(X ≤ 55) = 0.8644

Example Say we flip a fair coin 100 times. What is the probability to get at least 55 "heads"?
Again let X be the number of "heads" in 100 flips of a fair coin so that X~Bin(100,0.5). Now
P(X ≥ 55) = 1-P(X≤54) = 1-0.8159 = 0.1841

Example Say we flip a fair coin 100 times. What is the probability to get between 45 and 55 "heads", inclusively?
let X be the number of "heads" in 100 flips of a fair coin. Then X~Bin(100,0.5). So
P(45 ≤ X ≤ 55) = P(X ≤ 55) - P(X ≤ 44) = 0.8644 - 0.1356 = 0.7287

Again we have formulas for the population mean and standard deviation: say X~Bin(n,p) then
Population Mean: m=np
Population Standard Deviation: s=√(npq)

Example: Say we know that 60% of the general population prefer Coke over Pepsi. If we ask 500 randomly selected people whether they prefer Coke over Pepsi, what is mean and the standard deviation of the number of people who say "yes"?
Let X be the number who say "yes", then X~Bin(500,0.6), therefore
m=500×0.6 = 300 and s=√(500×0.6×0.4) = 10.95

Example Say we want to do a mail survey, that is we send letters with questionnaires to randomly selected people and hope they fill it out and send it back. From long experience it is known that such surveys have a "return rate" of about 25%, that is only 1 in 4 people send their survey back. How many surveys do we need to send out to be 99% sure to get at least 100 back?
Say we send out n questionnaires. Let the rv X be the number of questionnaires we get back, then X~Bin(n,0.25). We need to solve the equation P(X≥100) = 0.99.
Solution by "Trial and error": We have X~Bin(n,0.25) and we want to solve the equation P(X≥100)=0.99, but then
P(X≥100)=1-P(X≤99)=0.99, or P(X≤99)=1-0.99=0.01
Now just play around with different values of n until you find one: enter 99 in c1, go to Calc > Distributions > Binomial, pick n, probability 0.25 and check whether the probability is about 0.01.

For more on the binomial distribution see page 247 of the textbook.

Normal (Gaussian) Distribution

This (for good reasons we will see shortly) is the most important distribution of them all! First, it is already familiar to you because it results in data that gives bell-shaped histograms.
A normal random variable has two parameters, denoted by m and s. As a notation we use X~N(m , s)
What is the meaning (interpretation) of the parameters? It is of course that mis the population mean and s is the population standard deviation.
Example: In the next figure we have 4 examples of normal densities with different means and standard deviations, drawn on the same scale:

An important special case is Z~N(0,1) which is called the standard normal. Actually any normal can be turned into a standard normal:
If X~N(m , s), then Z=(X-m)/s~N(0,1)
or vice versa:
If Z~N(0,1) and X=m+sZ, then X~N(m , s)

How do we find probabilities from a normal distribution? First, the only kind of probability we find is of the from P(X<x). Those are found as follows:
• Minitab: Enter x (or x's) in a column, say c1. Then Calc > Probability Distributions > Normal, check Cumulative Probability, enter c1, m and s.
•Table:
here

For a binomial rv X we could find two kinds of probabilities, P(X=x) and P(X≤x). This was possible because a binomial rv. is a discrete rv. For a normal rv X (and in general for any continuous rv) it turns out that P(X=x)=0, no matter what x is. A consequence of this is that for a normal rv X P(X≤x) = P(X<x). So the only probabilities we need to find are of the form P(X<x). So we have

Example If X~Bin(n,p) then P(X≥5) = 1-P(X≤4)

Example If X~N(m,s) then P(X≥5) = 1-P(X<5) 1-P(X≤4)

Of course MINITAB or the tables give only one kind of probability, how do we find others? We need two formulas:
•P(X>x) = 1-P(X<x)
•P(x<X<y) = P(X<y) - P(X<x)

Example: Say X~N(10,3).

Find P(X>13)
• MINITAB: P(X>13) = 1-P(X<13) = 1-0.8414 = 0.1586
• Tables: P(X>13) =
P((X-m)/s > (13-10)/3) =
P(Z>1.00) = 1-P(Z<1.00) = 1-0.8413 = 0.1587

In order to use the tables the trick is to immediately change from X to Z (using Z=(X-m)/s).

Find P(8<X<11)
• MINITAB: P(8<X<11) = P(X<11)-P(X<8) = 0.3780
• Tables: P(8<X<11) =
P( (8-10)/3 < (X-m)/s < (11-10)/3) =
P(-0.67 < Z < 0.33) =
P( Z<0.33) - P(Z<-0.67) =
0.6293 - 0.2514 = 0.3779

Example The scores on a certain standardized test are known to have a normal distribution with mean 100 and standard deviation 30.
What is the probability that a person taking this test scores above 115?
We have X~N(100,30).

• (with MINITAB:) P(X>115) = 1-P(X<115) = 0.3085
• (with tables:) P(X>115) =
P((X-100)/30 >(115-100)/30) =
P(Z>0.5) =
1-P(Z<0.5) = 1-0.6915 = 0.3085

What is the probability that a person taking this test scores between 90 and 110 points?
• (with MINITAB:) P(90<X<110) = P(X<110)-P(X<90) = 0.6306-0.3694 = 0.2611
• (with tables:)
P(90<X<110) =
P((90-100)/30<(X-m)/s<(110-100)/30) =
P(-0.33<Z<0.33) =
P( Z<0.33) - P(Z<-0.33) =
0.6293 - 0.3707 = 0.2586

Notice the difference in the two answers. This is due to the rounding in the tables of 1/3 = 0.3333... to 0.33.

Sometimes we to have to solve the inverse problem, that is we are given the probability and want to find the corresponding x, that is we want to solve the equation p=P(X<x) for x.

• (MINITAB) Say p = P(X<x): Enter p (or p's) in a column, say c1. Then Calc > Probability Distributions > Normal, check Inverse Cumulative Probability, enter c1, m and s.
• (Tables) Solve the corresponding problem for a standard normal (p=P(Z<z)) by finding the number inside the z table closest to p, and then
x=m+sz

Example (Above cont.) In the above test what score x is such that 75% of people taking the test will score less than x?
We have to solve the equation P(X<x) = 0.75. We find

• (with MINITAB:) x = 120.23
• (with tables:) We need z such that P(Z<z)=0.75. The number in the z table closest to 0.75 is z=0.67. So x=m+sz = 100+30*0.67 = 120.1

Example (Above cont.) In the above test what score x is such that 25% of people taking the test will score less than x?
We have to solve the equation P(X<x) = 0.25. We find

• (with MINITAB:) x = 79.76
• (with tables:) • (with tables:) We need z such that P(Z<z)=0.25. The number in the z table closest to 0.25 is z=-0.67. So x=m+sz = 100+30*(-0.67) = 79.9

Remark: There is a connection here to the percentiles discussed earlier. Above we actually found the 1st and 3rd population quartiles of this test!

Remark Recall that P(X<m)=0.5. This implies:
if p<0.5 then x<m
if p>0.5 then x>m

Example (Above cont.)
for p=0.75 we found x=120.3>100=m
for p=0.25 we found x=79.8<100=m

For more on the normal distribution see page 270 of the textbook.

Central Limit Theorem

Why is the normal distribution so important? The reason is the central limit theorem which states that under some very general conditions the sample mean has (approximately) a normal distribution, no matter what the distribution of the observations.

Example As an illustration consider the MACRO normmix: it generates 1000 observations from some bimodal random variable, clearly not from a normal distribution. But if we generate several columns of this data, and then find the mean, those means start to look like observations from a normal distribution:

%k:\3015\normmix c1
Graph> Histogram (with curvefit) shows that the data is very far from a normal distribution

%k:\3015\normmix c2
Calc > Row Statistics, Mean, Input Variables c1 c2
Graph> Histogram (with curvefit) shows that the data is still far from a normal distribution but it starts to look much better

The next graph shows the histograms for n=1 to n=6:

So even though we often don't know the distribution of the population, if we are interestd in the mean and if we have enough data we can often use the normal distribution to find probabilities etc..