Conditioning and Importance Sampling

Conditioning

We have previously seen a famous formula for conditional expectations: E{E[X|Y]}=E[X]. Here is another example:

Example Say (X,Y) is a discrete rv with joint pmf given by
Y
    0 1
  0 0.1 0
X 1 0.1 0.2
  2 0 0.6

So the marginal of X is

X
0 1 2
0.1 0.3 0.6

and E[X]=0·0.1+1·0.3+2·0.6=1.5

Also the marginal of Y is

Y
0 1
0.2 0.8

Now the conditional pmf of X|Y=0 is
X|Y=0
0 1
0.5 0.5

so E[X|Y=0]=0·0.5+1·0.5+2·0=0.5
and the conditional pmf of X|Y=1 is


X|Y=1
1 2
0.25 0.75

so E[X|Y=1]=0·0+1·0.25+2·0.75=1.75.

Now let the rv Z=E[X|Y], then
Z
0.5 1.75
0.2 0.8

and finally E{[X|Y]}=E[Z]=0.5·0.2+1.75·0.8=1.5=E[X]

There is also an equivalent formula for the conditional variance:

V[X]=E[V(X|Y)]+V[E(X|Y)]

Let's see:
V[X]=E[X2]-E[X]2 = 02·0.1+12·0.3+22·0.6 - 1.52 = 0.45

Now V[E(X|Y)] = V[Z] = 0.52·0.2+1.752·0.8 - 1.52 = 2.5 - 2.25 = 0.25

Also V[X|Y] is a rv (just like E[X|Y]) with pmf
V[X|Y=0] = E[X2|Y=0] -E[X|Y=0]2)
V[X|Y=1] = E[X2|Y=1] -E[X|Y=1]2)
so if we set Z1=V[X|Y] we have
Z1
0.25 0.1875
0.2 0.8
and E[V(X|Y)] = E[Z1] = 0.25·0.2+0.1875·0.8 = 0.2

and so V(E[X|Y])+E[V(X|Y)] = 0.2+0.25 = 0.45

vr1 varifies all those calculations.


So, how can we use this formula for reducing the variance of our simulation estimators? Because V[X|Y]>0 always we have V[X]≥V[E(X|Y)] for any rv Y. So say we run a simulation yielding a rv X with E[X]=θ and the simulation yields a second rv Y, such that E[X|Y] is known. Since E{E[X|Y]}=E[X]=θ it follows that E[X|Y] is also an unbiased estimator of θ and has a variance not larger than X itself.

Example Say we would like to use simulation to estimate the value of π(=3.14...). A straight-forward simulation is as follows: generate V1, V2 iid U[-1,1]. If V12+V22≤1 set Zi=1, otherwise 0. Run the simultion n times, then 4(∑Zi)/n is an estimator of π

This is done in pi_est(1)

Now let's use the estimator E[Z|V1] instead of Z. Note

and so (1-V12)½ is a better estimator than Z alone.

This is done in pi_est(2)

Note that this new estimator has another adventage: it needs only one U[0,1] per simulation run.

How much better is it? Let's see:

Importance Sampling

Example say we have a rv X geometric with p=0.5. We want to find P(log(X!)>50).
Let's try to solve this problem analytically. First, log(x!) is an increasing function of x, so there exists x50 such that log(x!)>50 iff x>x50, so that P(log(X!)>50)=P(X≥x50). Finding x50 analytically is hopeless, though. We can do it with R by trial and error, using log(factorial(n)) for different values of n. We find n=22.5, so

or about 2.38·10-7

How about an R check? The problem with this is that the probability p we want to find is very small, so in a simple simulation as shown in vr2(1) we can expect the outcome of interest only about every 1 in 4.2million runs. In order to get some resonably good estimate we probably need to run the simulation with n=109.

Here is an idea: the problem is that our event of interest, log(X!)>50, is very rare, it almost never happens. Let's instead sample from a distribution Y which has large values much more often, so that log(Y!)>50 happens more often. For example, let's try Y geometric with p=0.05, see vr2(2). It seems P(log(Y!)>50)=0.35, but there is a problem, we get a warning from R: value out of range in 'gammafn'. The reason is that R calculates log(y!) by first calculating y! and then taking log, and it finds y! via the gamma function, but for y bigger than about 170 y! can no longer be found that way. But log(175!)=732.33, not so big at all. We need a different way to calculate log(y!). Here are two ideas:

• log(y!) = log(∏k=1yk)=∑k=1ylog(k)
so we could use the R command sum(log(1:y))

• in logfac(y) I calculate log(y!) directly for y from 0 to 9, and then use
log(y!)=0.918938533205+(y+0.5)*log(y)-y+(1/12-1/(360*y2))/y;
this is based on Stirlings' Approximation:

Both of these work, but the second is much faster, so we will use it.

So P(log(Y!)>50)=0.35. But what good is that? I want X! Well:

so if we sample from Y and find the sum here we still get an estimate of the probability for X. This is shown in vr2(3)

In general we have the following: Let X be a rv' with pmf (pdf) f and and Y a rv' with pmf (pdf) g. Say we want to find E[h(X)]. Then

Note this was done for discrete rv's but it works just as well for continuous ones.

Note how to choose Y? Obviously we need Y such that it can't happen that P(Y=x)>0 and P(X=x)=0. In general we should choose a Y with the same "support" as X, that is P(X=x)>0 iff P(Y=x)>0. It is not necessary to have a Y that "looks like" X. For example in the case above we could have choosen Y with pmf fY(x)=6/(π2x2). See vr2(4). It is also a good idea to choose Y such that the event of interest (log(Y!)>50) happens about 50% of the time.

Example say X, Y and Z have a standard normal distribution. Find P(|XYZ|>K), for example K=10
Now there is no way to do this analytically, and again the probability is very small. So we will use IS with X', Y' and Z' generated from normal distributions with mean 0 and standard deviation s. For our case of K=10 s=3 works good. In general, for some K play around a bit to find a good s. See vr3

Example (From a book by Robert and Casella) let X~Cauchy and we want to use simulation to estimate τ=P(X>2)

Note:

Method 1: (Direct Simulation) generate X1, .., Xn iid Cauchy, estimate τ=1/n∑I[2,∞)(Xi)
Note

see vr5(1)

Method 2: (Direct Simulation, using a special feature of the problem) Make use of the fact that a Cauchy is symmetric around 0, so P(X>2)=½P(|X|>2), so generate X1, .., Xn iid Cauchy, estimate τ=1/2n∑I[2,∞)(|X|i)

see vr5(2)

Method 3: ( Direct Simulation, using a special feature of the problem) Make use of the fact that

see vr5(3)

Method 4 : ( Direct Simulation, using a special feature of the problem)

see vr5(4)

Method 5 : (Importance sampling) Let's use the rv Ywith density g(x)=2/x, x>2. Note

so this is actually the same as Method 4, with the same variance, see vr5(5) use the rv Ywith density g(x)=2/x, x>2. Note

so this is actually the same as Method 4, with the same variance, see vr5(5)