Limit Theorems

Limits

say we have a sequence of numbers an. Then there is just one definition of a "limit", namely an→a iff for every e>0 there exists an ne such that |an-a|<e for all n>ne

Things already get a little more complicated if we go to sequences of functions. Here there are two ways in which they can converge:
Pointwise Convergence: fn(x)→f(x) pointwise for all xS iff for every xS and every e>0 there exists an ne,x such that |fn(x)-f(x)|<e for all n>ne
Uniform Convergence: fn(x)→f(x) uniformly for all xS iff for every xS and every e>0 there exists an ne (independent of x!) such that |fn(x)-f(x)|<e for all n>ne
and there is a simple hierarchy: uniform convergence implies pointwise convergence but not vice versa

Now when we go to probabilities it gets a bit more complicated. Say we have a sequence of rv's Xn with means mn and cdf's Fn, and a rv X with mean m and cdf F. Then we have:

Convergence in Mean Xn→X in mean iff mnm
Convergence in Distribution (Law) Xn→X in law iff Fn(x)→F(x) pointwise for all x where F is continuous
Convergence in Probability Xn→X in probability iff for every e>0 limn→∞P(|Xn-X|≥e)=0
Almost Sure Convergence Xn→X almost surely iff for every e>0 P(limn→∞|Xn-X|<e)=1

and now unfortunately there is no simple hierarchy either. Here are some relationships, without proofs:

Theorem

a) convergence in probability implies convergence in distribution. The reverse is true if the limit is a constant.
b) almost sure convergence implies convergence in probability, but not vice versa

Example Let the sample space S be [0,1] with the uniform distribution. Define X1, X2 etc as follosw:

X1(s)=s+I[0,1](s)
X2(s)=s+I[0,½](s)
X3(s)=s+I[½,1](s)
X4(s)=s+I[0,1/3](s)
X5(s)=s+I[1/3,2/3](s)
X6(s)=s+I[2/3,1](s)
X7(s)=s+I[0,¼](s)
and so on

Let X(s)=s. Now for each n there is an m and an i such Xn(s)=s+I[i/m,(i+1)/m](s) and as n→∞ also m→∞. So
P(|Xn-X|) = P(I[i/m,(i+1)/m])=1/m→0, so Xn converges to X in probability. However, vor every s[0,1] Xn(s)=s infinitely often and Xn(s)=s+1 infinitely often, so |Xn(s)-X(s)| alternates between 0 and 1 but does not converge. Therefore Xn does not converge to X almost surely.

(Weak) Law of Large Numbers

Let X1, X2, ... be a sequence of independent and identically distributed (iid) r.v.'s having mean m. Then coverges to m in probability

proof (assuming in addition that V(Xi)=s2 < )

We apply Chebyshev's inequality to Zn:

It is best to think of this (and other) limits theorems not as one theorem but as a family of theorems, all with the same conclusion but with different conditions. For example there are weak laws even if the Xn's are not independent, don't have the same mean and don't have even have finite standard deviations.

This theorem forms the bases of (almost) all simulation studies: say we want to find a parameter q of a population. We can generate data from a random variable X with pdf (pmf) f(x|q) such that Eh(X) = q. Then by the law of large numbers

(Strong) Law of Large Numbers

Let X1, X2, ... be a sequence of independent and identically distributed (iid) r.v.'s having mean m. Then coverges to m almost surely

proof: find it in most standard textbooks

Example : in a game a player rolls 5 fair dice. He then moves his game piece along k fields on a board, where k is the smallest number on the dice + largest number on the dice. For example if his dice show 2, 2, 3, 5, 5 he moves 2+5 = 7 fields. What is the mean number of fields q a player will move?
To do this analytically would be quite an excercise. To do it via simulation is easy:
Let X be an independent random vector of length 5, with X[j] 1,..,6 and P(X[j]=k)=1/6
let h(x) = min(x)+max(x), then Eh(X) = q

Let X1, X2, .. be iid copies of X, then by the law of large numbers

The simulation is implemented in exminmax

Example : A company has a system (website, computers, telephone bank, machine, ...) that is mission-critical, and so they have two identical systems with the second one taking over automatically when the first one fails. From experience they know that each systems failure time has an exponential distribution with mean 84.6 hours. Once a system is down its repair time has a U[1,5] distribution. What is the probability that both systems are down simultaneously?
Note: there is the possibility that both systems go down, system 1 gets fixed but breaks again before system 2 is fixed, so that they both are down immediately. The probability of this happening is so small though that we will ignore this.
Let Ti, i=1,2 be the failure time of system i. Let R be the repair time of system 1. Then the probability of both systems being down is P(R > T2) = P(R-T2>0) = E[I(0, )(R-T2)]
So we have h(R,T2) = I(0, )(R-T2)

The simulation is implemented in exsystems1

Example , cont. What is the mean time per year when both systems are down?
Let Ti(k) be the failure times of system i in order with k=1 the first time, k=2 the second time etc.. Let R(k) be the same for the repair time. So we are looking at a sequence T1(1), R(1),T1(2), T2(2) etc.

The simulation is implemented in exsystems2

Example , cont. Say the company has the option to contract another repair man, which would lower the repair time such that then R~U[1,3]. It costs the company $8950 per hour when the system is down. How much can they pay this new repair man so it is a good idea to contract him?
We found that without the new guy we have a downtime of about 6 hours per year for a yearly cost of $53700. If we higher the new guy we have an annual downtime of about 2.5 hours for a total cost of $22400. So we should pay him at most $31300 per year.

Example , cont. Say we have a contract with our main customer that specifies that our downtime can not exceed 10 hours per year, otherwise we have to pay a fine. We decide we are willing to accept a 10% chance that we exceed the time limit. Should we proceed with these conditions?

The simulation is implemented in exsystems3 We see that there is about a 15% chance of the failure time exceeding 10 hours, so we should not proceed.

Recall: a random variable X is said to be normally distributed with mean m and variance s2 if it has density:

If m=0 and s=1 we say X has a standard normal distribution.

We use the symbol F for the distribution function of a standard normal r.v., so

Central Limit Theorem


Let X1, X2, .. be an iid sequence of r.v.'s with mean m and standard deviation s. Let n be the sample mean of the first n observations. Then

proof for the proof we will make an even stronger assumption, namely that the mgf of Xn exists in an open neighborhood of 0. We will show that the mgf's of √n((n-m)/s) converge to the mgf of a standard normal rv, which is given by

Let Yn=(Xn-m)/s, then √n((n-m)/s) = 1/√n·∑Yi, so

We now expand this into a Taylor series:

because EY0=E1=1, EY1=0 and EY2=1.
An application of Taylor's theorem shows the rest term nR(t/√n) goes to 0 as n→∞. So

where we use a well-known lemma from real analysis: if an→a, then (1+an/n)n→ea

Example Let's do a simulation to illustrate the CLT: we will use the most basic r.v. of all, called a Bernoulli r.v. which has P(X=0)=1-p and P(X=1)=p (Think indicator function for the coin toss}. So we sample n Bernoulli r.v. with "success paramater p" and find their sample mean. Note that

The simulation is done in the routine cltexample1
cltexample1(0.5,10) shows that this approximation is quite good even for small n if p is "nice" (neither small nor large)
cltexample1(0.001,100) shows that this approximation is quite bad even for large n if p is "bad" (very small or very large)

The version of the CLT for Bernoulli rv's is famous all by itself, it is called the DeMoivre-Laplace theorem. It was the first CLT with a rigorous proof.

Example In cltexample2 we generate a sequence of rv's which are not independent, don't have the same mean or standard deviation and the CLT holds anyway!

Example In cltexample3 we generate a sequence of rv's which are independent but the CLT does not hold!

Example Let's return to the example of the company with the mission-critical system above. Let D be the total downtime per year. D has some strange distribution, but it also is the sum of independent random variables, although they are not iid because some of them are exponentials and others are uniform. In exsystems4 we rerun the simulation in exsystem3, and this time we also find the sample mean and standard deviation of the simulated D's. Finally we overlay the histogram (scaled to integrate to 1) with the corresponding density. The histogram shows that D is somewhat normal anyway.
We find mD=6.0 and sD=4.0. So

Example Maybe the most important quantity in Statistics is the sample mean =1/n∑Xi. Here is an example: say the ages of people in a town have some distribution with mean 31.37 and standard deviation 12.34. If we randomly select a person, what is the probability that person is over 35 years old?
We have a rv X with m=31.37 and s=12.34. We want P(X>35.0) but we don't know the density of X, so there is no way to do this.
Let's say we could sample 25 people, what is the probability that their mean age is over 35? Now we want P(> 35.0), but

Example The CLT has found applications in just about any field of mathematics or science. Here is an application to number theory:

Erdös-Kac CLT
Say we pick an integer at random from {1,2,..,n} Then the interger has about loglog(n)+F(√loglog(n)) prime divisors.

As we saw above, the CLT is really a family of theorems, all with the same conclusion but with different assumptions. In fact, there are probably a 1000 different CLT's! Here is what is probably the most famous of them:

Lindeberg-Feller
For each n, let Xn,m, 1≤m≤n, be independent random variables with EXn,m=0. Suppose