MCMC - Markov Chain Monte Carlo

The starting point of this section is the result that if the Markov chain {Xn} is irreducible and ergodic, then
mark1fig10.png - 1299 Bytes
The idea is to use this as follows: say I want to generate data from a distribution p. Now if I can find an irreducible and ergodic Markov chain {Xn} which has p as its stationary measure, we can generate observations from {Xn}, wait a while until its limiting distribution is reached (?) and then take the {Xn} as if they came from p.

The Hastings-Metropolis Algorithm

Let's say we want to generate observations from a distribution p on 1,..,m. Let Q be the transition probability matrix of an irreducible Markov chain on 1,..,m. Define the Markov chain {Xn} as follows: When Xn=i a r.v. X such that P(X=j)=qij is generated. (This of course means we need to know how to generate observations from Q). If X=j, then set Xn+1=j with probability aij and equal to i with probability 1-aij. Now Xn is a Markov chain with transition probabilities given by:
mark2fig1.png - 2681 Bytes

Now this markov chain will be time-reversible and have stationary measure p if
piPij=pjPji for all ji.
This is equivalent to
piqijaij=pjqjiaji
and is easy to check that this will be satisfied if we set
mark2fig2.png - 1506 Bytes

One of the reasons this algorithm is so useful is the following: say we only know the values in p up to a constant, that is we have a sequence {bj,j=1,..m}, bj≥0 and ∑bj=B. We want to generate observations from p with pj=bj/B. Then the above algorithm works without the need to find B because
mark2fig3.png - 2276 Bytes
This is great because For example if we want to generate data from a posterior distribution B is the integral over the marginal distribution, which might be very difficult to find.

With this we get the Hastings-Metropolis Algorithm:
1) Choose an irreducible Markov chain with transition probabilities Q and choose some integer k between 1 and m
2) Let n=0 and X0=k
3) generate a r.v. X such that P(X=j)=qx0j and generate U~U[0,1]
4) If U < bXqX,Xn/bXnqXn,X then NS=X, else NS=Xn
5) n=n+1, Xn=NS
6) Go to 3

Notice the similarities between this algorithm and the accept-reject method. The main differences, and the reason this algorithm is so useful, are that here we don't need to find c, which usually requires a maximization and we don't need B, as discussed above.

Example : Let's begin with a very simple Example , X~N(m,s2). First we need the "proposal" distribution Q. We are actually quite free to make almost any choice here. Let's try the following two: if we are at the point x we choose the next point from a
a) U[x-e,x+e] for some e>0.
b) N(x,e2) for some e>0.

For a) we have qxy=1 if x-e<y<x+e, 0 otherwise, and for b) we get qxy=dnorm(y,x,e). So the algorithm uses:
1a) X=runif(1,Xn-e,Xn+e])
1b) X=rnorm(1,Xn,e)

2a) bXqX,Xn/bXnqXn,X = dnorm(X,m,s)/dnorm(Xn,m,s)
2b) bXqX,Xn/bXnqXn,X = dnorm(X,m,s)dnorm(Xn,X,e)/(dnorm(Xn,m,s)dnorm(X,Xn,e))

This is implemented in mcmc1.
Try the following two cases:
1) m=0,s=1 and e=0.1
2) m=20,s=3 and e=1

Compare this algorithm, and its implementation, with the accept-reject algorithm. Here we needed practically no calculations at all.

The are two main difficulties with the MCMC method in practice:
1) It can take a lot of computational effort, For example if we want to generate just 1 variate at a time we still might have to generate 10000 others before the stationary distribution is reached.
2) It can be very difficult in practice to know when the stationary distribution is reached, that is when the "burn-in" period is over. There are Example s where the chain seems to have settled down for very long periods but is not actually at the stationary distribution yet.

Example : Let's consider again the normal mixture model and carry out a Bayesian analysis. To keep things simple we assume that m1, s1, m2 and s2 are known and the only parameter is a. An obvious prior on a is a beta distribution, and again to keep things simple we will use a~beta(t,t), that is we use a prior centered at 0.5. Then we have

Finding the exact posterior distribution means finding the marginal distribution which in this case is hopeless analytically. Fortunately we don't need it for the Hastings-Metropolis algorithm.
The routine is implemented in mcmc2. Compare the cases a=0.9, N=100, t=1, a=0.9, N=100, t=10 and a=0.9, N=1000, t=10,

The Gibbs Sampler

Suppose we are given a joint density f(x,y1,..,yp) and are interested in the marginal f(x). Of course we have
mark2fig7.png - 1888 Bytes
but this p-dimensional integral might be very difficult to evaluate. The Gibbs sampler instead generates a sequence of simulated values of f(x) which can then be used to estimate the desired quantity. It works by iteratively generating observations from the conditional distributions f(x|Yi=yi,i=1,..,p) and f(yj|X=x, Yi=yi,i=1,..,p, i j). Here is the algorithm:
1) pick initial values Yi(0)=yi(0) and set k=1.
2) X(k) ~ f(x|Yi(k)=yi(k),i=1,..,p)
3) Yi(k+1) ~ f(yj|X(k)=x(k), Yi(k)=yi(k),i=1,..,p, i j)
4) goto 2

Example : Say we have X~Bin(n,p) and p~Beta(a,b) and we want a sample from the posterior distribution p|X. Then the joint distribution of X and p is the beta-binomial distribution given by
mark2fig8.png - 2578 Bytes
To use the Gibbs sampler we need the conditional distributions of X|p and p|X:
1) X|p ~ Bin(n,p)
2) p|x ~ Beta(x+a,n-x+b)
so the Gibbs sampler is as follows:
1) Choose an initial value for p, say p(0)=0.5 and set k=1
2) X(k) ~ Bin(n,p(k))
2) p(k+1) ~ Beta(x(k)+a,n-x(k)+b)

This is implemented in mcmc3. Here we use the following strategy to get "independent" samples: run the sampler for a short period (B=10) and then start from scratch. For each run pick the last observation.

It can be shown that the Gibbs sampler is actually a special case of the Hastings-Metropolis algorithm: Let X = (X1, .., Xn) be a r.v. with probability mass function p(x) that need only be specified up to a multiplicative constant, and suppose that we want to generate a r.v. whose distribution is that of X. That is we want to generate a r.v. with mass function (or density) p(x) = Cg(x) where g(x) is known but C is not. Using the Gibbs sampler assumes that for any i and values xj, j i, we can generate a r.v. X with pmf
P(X=x) = P(Xi=x|Xj=xj,j i)
It operates the Hastings-metropolis algorithm on a Markov chain with states x = (x1, .., xn) and with transition probabilities defined as follows: Whenever the present state is x, a coordinate that is equally likely to be any of the 1,..,n is chosen. If coordinate i is chosen, then a r.v. X whose probability mass function is as above is generated and if X=x the state y=(x1, .., xn-1, x, xn+1,..,xn) is considered as the next candidate state. In other words, with x and y given the Gibbs sampler uses the Hastings -Metropolis algorithm with
mark2fig4.png - 2857 Bytes
Because we want the limiting distribution to be p the vector y is then accepted as the new state with probability
mark2fig5.png - 3498 Bytes
So in the Gibbs sampler the candidate state is always accepted as the next state.