Inequalities and Limit Theorems

Two very useful inequalities

Markov's Inequality:
If X takes on only nonnegative values, then for any a>0

proof:

Chebyshev's Inequality:
If X is a r.v. with mean μ and variance σ2, then for any k>0:

proof:

Example Consider the uniform random variable with f(x) = 1 if 0<x<1, 0 otherwise. We already know that μ=0.5 and σ=1/√12 = 0.2887. Now Chebyshev says
P(|X-0.5|>k·0.2887)≤1/k2

For example

P(|X-0.5|>1·0.2887)≤1/12 = 1 (rather boring!)
or
P(|X-0.5|>3·0.2887)≤1/32 = 1/9

actually P(|X-0.5|>0.866) = 0, so this is not a very good upper bound.

(Weak) Law of Large Numbers

Let X1, X2, ... be a sequence of independent and identically distributed (iid) r.v.'s having mean μ. Then e>0

proof (assuming in addition that V(Xi)=σ2<∞)

We apply Chebyshev's inequality to Zn:

This theorem forms the bases of (almost) all simulation studies: say we want to find a parameter θ of a population. We can generate data from a random variable X with pdf (pmf) f(x|θ) such that Eh(X) = θ. Then by the law of large numbers

Example : in a game a player rolls 5 fair dice. He then moves his game piece along k fields on a board, where k is the smallest number on the dice + largest number on the dice. For example if his dice show 2, 2, 3, 5, 5 he moves 2+5 = 7 fields. What is the mean number of fields θ a player will move?
To do this analytically would be quite an excercise. To do it via simulation is easy:
Let X be an independent random vector of length 5, with X[j] 1,..,6 and P(X[j]=k)=1/6
let h(x) = min(x)+max(x), then Eh(X) = θ

Let X1, X2, .. be iid copies of X, then by the law of large numbers

The simulation is implemented in exminmax .

Central Limit Theorem

This is one of the most famous theorems in all of mathematics / statistics. Without it, Statistics as a science would not have existed until very recently:

We first need the definition of a normal (or Gaussian) r.v.:
A random variable X is said to be normally distributed with mean μ and variance σ2 if it has density:

If μ=0 and σ=1 we say X has a standard normal distribution.
We use the symbol Φ for the distribution function of a standard normal r.v., so

Let X1, X2, .. be an iid sequence of r.v.'s with mean μ and standard deviation σ. Then

where is the sample mean of the first n observations.

Note that

so the scaling in the clt is just right to match the standard normal r.v.

Let's do a simulation to illustrate the CLT: we will use the most basic r.v. of all, called a Bernoulli r.v. which has P(X=0)=1-p and P(X=1)=p (Think indicator function for the coin toss}. So we sample n Bernoulli r.v. with "success paramater p" and find their sample mean. Note that

The simulation is done in the routine cltexample 1

Approximation Methods

Say we have a r.v. X with density f, a function h and we want to know V(h(X)). Of course by definition we have

but sometimes these integrals (sums) are very difficult to evaluate. In this section we discuss some methods for approximating the variance.

Recall: If a function h(x) has derivatives of order r, that is if g(r)(x) exists, then for any constant a the Taylor polynomial of order r is defined by

One of the most famous theorems in mathematics called Taylor's theorem states that the remainder of the approximation h(x)-Tr(x) goes to 0 faster than the highest order term:
Taylor's theorem

There are various formulas for the remainder term, but we won't need them here.

Example : say h(x) = log(x+1) and we want to approximate h at x=0. Then we have

The approximation is illustrated in taylor

For our purposes we will need only first-order approximations (that is using the first derivative) but we will need a multivariate extension as follows: say X1, ..,Xn are r.v. with means μ1, .. ,μn and define X=(X1, ..,Xn) and μ=(μ1, .. ,μn). Suppose there is a differentiable function h(X) for which we want an approximate estimate of the variance. Define

The first order Taylor expansion of h about μ is

Forgetting about the remainder we have

and

Example : say we have a sample X1, ..,Xn from a Bernoulli r.v. with success parameter p. One popular measure of the probability of winning a game is the odds p/(1-p). For example when you roll a fair die the odds of getting a six are (1/6)/(1-(1/6) = 1:5.
An obvious estimator for p is , the sample mean, or here the proportion of "successes" in the n trials. Then an obvious estimator for the odds is /(1-). The question is, what is the variance of this estimator?
Using the above approximation we get the following: let h(p)=p/(1-p), so h'(p)=1/(1-p)2 and

The routine varapp1 illustrates how good an approximation this is.

Example : We have a rv X~U[0,1], and a rv Y~U[0,X]. Find an approximation of V[Y/(1+Y)]
Note: this is called a hierarchical model.
We have:
1) fX(x)=1 if 0<x<1, 0 otherwise
2) fY}X=x(y|x)=1/x if 0<y<x, 0 otherwise
Now

Example : let's consider the random vector with joint pdf f(x,y) = 6x, 0 < x< y < 1. Say we want to find V(X/Y). Then if we consider the function h(x,y) = x/y we have

Now we need to find μX=E[X], V[X], μY=E[Y], V[Y] and cov(X,Y):