

Now this markov chain will be time-reversible and have stationary measure π if
πiPij=πjPji for all j≠i.
This is equivalent to
πiqijαij=πjqjiαji
and is easy to check that this will be satisfied if we set
One of the reasons this algorithm is so useful is the following: say we only know the values in π up to a constant, that is we have a sequence {bj,j=1,..m}, bj≥0 and ∑bj=B. We want to generate observations from π with πj=bj/B. Then the above algorithm works without the need to find B because

With this we get the Hastings-Metropolis Algorithm:
1) Choose an irreducible Markov chain with transition probabilities Q and choose some integer k between 1 and m
2) Let n=0 and X0=k
3) generate a r.v. X such that P(X=j)=qx0j and generate U~U[0,1]
4) If U<bXqX,Xn/bXnqXn,X then NS=X, else NS=Xn
5) n=n+1, Xn=NS
6) Go to 3
Notice the similarities between this algorithm and the accept-reject method. The main difference, and the reason this algorithm is so useful, are that here we don't need to find c, which usually requires a maximization and we don't need B, as discussed above.
Example : Let's begin with a very simple example , X~N(μ,σ). First we need the "proposal" distribution Q. We are actually quite free to make almost any choice here. Let's try the following two: if we are at the point x we choose the next point from a
a) U[x-e,x+e] for some e>0.
b) N(x,e) for some e>0.
For a) we have qxy=1 if x-e<y<x+e, 0 otherwise, and for b) we get qxy=dnorm(y,x,e). So the algorithm uses:
1a) X=runif(1,Xn-e,Xn+e])
1b) X=rnorm(1,Xn,e)
2a) bXqX,Xn/bXnqXn,X = dnorm(X,μ,σ)/dnorm(Xn,μ,σ)
2b) bXqX,Xn/bXnqXn,X = dnorm(X,μ,σ)dnorm(Xn,X,e)/(dnorm(Xn,μ,σ)dnorm(X,Xn,e))
This is implemented in mcmc1.
Try the following two cases:
1) μ=0,σ=1 and e=0.1
2) μ=20,σ=3 and e=1
Compare this algorithm, and its implementation, with the accept-reject algorithm. Here we needed practically no calculations at all.
There are two main difficulties with the MCMC method in practice:
1) It can take a lot of computational effort, For example if we want to generate just 1 variate at a time we still might have to generate 10000 others before the stationary distribution is reached.
2) It can be very difficult in practice to know when the stationary distribution is reached, that is when the "burn-in" period is over. There are examples where the chain seems to have settled down for very long periods but is not actually at the stationary distribution yet.
Example : say we want to generate r.v.'s X such that P(X=k)=c/kr, r>1 and k=1,2, ... We did this already for k=2 using the accept-reject algorithm, with the Cauchy as the Y distribution. Now we would need to know c for any value of r, which is not possible. Let's instead use MH. First we use the following:
if Xn<=m X=U{1..(2m+1)} for some m (here m=10)
otherwise X=U{-m,,m}
so qX,Xn =1/(2m+1) and bXqX,Xn/bXnqXn,X= (c/Xr)/(c/Xnr) = (Xn/X)r
see mcmc6.
next let's try X=Bin(2*Xn[i-1],0.5)+1 . This shows that not all choice of Q work, see mcmc6(2).
Example : Let's consider a normal mixture model, that is we have N1~N(μ1, σ1), N2~N(μ2, σ2), Z~Ber(α) and we observe
X=ZN1+(1-Z)N2
let's we want to carry out a Bayesian analysis. This means we will treat the parameters as random variables. To keep things simple we assume that μ1, σ1, μ2 and σ2 are known and the only parameter is α. As a rv α has a distribution, called the prior. An obvious choice is a beta distribution (because 0≤α≤1), and again to keep things simple we will use α~beta(τ,τ), that is we use a prior centered at 0.5. In a standard Bayesian analysis we will have to calculate the posterior distribution, that conditional density of α|X1=x1, X2=x2,..,Xn=xn. We have the joint density

Finding the exact posterior distribution means finding the marginal distribution which in this case is hopeless analytically. Fortunately we don't need it for the Hastings-Metropolis algorithm.
The routine is implemented in mcmc2. Compare the cases α=0.9, N=100, τ=1, α=0.9, N=100, τ=10 and α=0.9, N=1000, τ=10,
Example Let's generate data from the rv (X,Y) with f(x,y)=c/(x+y) 1<x<2, 1<y<2 using the Hastings-Metropolis algorithm.
Here we will use the following Markov process: if Xn[i-1,1]=x, choose
X[1]~U[1,2ε] if x<1+2ε
X[1]~U[x-ε,x+ε] if 1+ε<x<2-ε
X[1]~U[2-2ε,2] if x>2-2ε
for some ε>0, and the same for X[2].
This is implememnted in mcmc9. Note that for ε<0.4 or so this does not work, the chain gets stuck close to the corners. Of course for ε=0.5 we have X[i]~U[1,2], so the chain is an independent sequence, and the method still works!
Example Recall the example at the end of the section on general methods for generating data: we have a random vector (X1,..,Xd) with density f(x1,..,xd)=c∏xi, 0≤x1≤..≤xd≤1. To generate data from this density use the following Markov process Q: if Y is the last point, randomly choose a coordinate j from 1:d and choose X=Y except that Xj~U[Yj-1,Yj+1] (with Y0=0 and Yd+1=1). Notice that X again is a possible point from this rv and that
bXqX,Xn/bXnqXn,X = Xj/Yj
see mcmc4.
This seems almost to easy! How can we verify that this indeed generates the right data? In general this is really impossible, but let's at least do it for some special cases:
d=2:

d=3:
Notice that
if d=2 fx2(x)=4x3
if d=3 fx3(x)=6x5
so maybe
fxd(x)=2dx2d-1?
using mcmc4 it appears to be true
Example In one of the homeworks we showed that if X,Y are iid Pois(λ), E[X|X+Y=n]=n/2. Let's say we want to generalize this and find E[X1|X1+..+Xd=n]. In a direct simulation approach we would do the following:
1) generate X1,..,Xdiid P(λ)
2) if X1+..+Xd=n set Z=X1, otherwise go to 1)
3) repeat 1 and 2 say 1000 times and the find the mean of the Z
The problem is that if d is not small we will rarely find X1+..+Xd=n and so we will need to run through 1 and 2 many times to find an acceptable X1. Of course we have X1+..+Xd~P(dλ), so

for example if d=5, λ=1 and n=10 we have p=0.018, so we would find a good candidate only every 1/0.018=55 tries. mcmc5(1) does it anyway.
Now let's use the Metropolis-Hastings algorithm. We begin with the point (n,0,..,0). Then in each step we choose i~U[1,..,d], j~U[1,..,d] but not i, z~rpois(1,λ) and set x(k+1)=x(k). Finally if
U[0,1]<dpois(z,λ)/dpois(x[i],λ) we set x(k+1)[i]=z, x(k+1)[j]=n-∑x(k+1)[-i]
so that again we have x 1+..+xd=n
In this case we can use the direct simulation as a check on the MCMC simulation, at least where the direct simulation is not to slow. First let's check the case d=2, λ=1.0, where we know the correct answer; n/2. The dots are the estimated values using the MCMC algorithm
Next consider the case where n is the right size so that X1+..+Xd=n reasonably often, namely n=dλ. Using λ=2 we find
so it appears our MCMC simulation works.
Now let's see whether we can use our simulation to derive a formula for μ(d,n,λ)=E[X1|X1+..+Xd=n]. In the next graph we have the plots for λ=1, n=1:20 and d=3:11 together with the least squares regerssion lines:
It appears that as a function of n μ(n,d,1) is linear with an intercept of 0. How about its depencence on d? In the next graph we have the plot of n vs. the slope of the least squares regression lines, togther with several transformations:

The log-log transform yields a straight line relationship! Its lsr is given by y=-0.02040 -0.98070x, so it seems the intercept might be 0 and the slope -1, which means
log(μ(d,n,1)) proportional to -log(d)
or
μ(d,n,1) proportional to 1/d
The next graph has the slopes in the original scale, with the 1/d line:
and that seems to fit really well! So now we know (or at least suspect) μ(d,n,1)=n/d, which fits the known result (μ(2,n,1)=n/2) perfectly!
Last, the dependence on λ. In the next graph we do the simulation for n=5, d=3, 4, 5 and 6 and λ from 0.1 to 10:

It seems there is no dependence on λ, and so we find our function:
μ(d,n,λ)=n/d
Actually, we can also just do the math:
And finally, a really easy proof:
t=E[∑X|∑X=t]=∑E[Xi|∑X=t]=nE[X1|∑X=t]
Suppose we want to generate data from a random vector X1, .., Xn with joint density f(x1,..,xn). Unfortunately we know f only up to a constant, that is f(x1,..,xn)=c·u(x1,..,xn). Now the conditional distribution of Xk|X1=x1, .,Xk-1=xk-1, ,Xk+1=xk+1,..,Xn=xn is given by
so the density of the conditional distribution function does not depend on c. The idea of the Gibbs sampler is to generate a sequence of simulated values of fk(xk|x1,..,xk-1,xk+1,..,xn) with k going from 1 to n and then starting all over again.
Example Say we want to generate from (X,Y)~N(μ,Σ) where μ=(μx,ny) and
Recall

see mcmc7
It can be shown that the Gibbs sampler is actually a special case of the Hastings-Metropolis algorithm: Let X = (X1, .., Xn) be a r.v. with probability mass function f(x) that need only be specified up to a multiplicative constant, and suppose that we want to generate a r.v. whose distribution is that of X. That is we want to generate a r.v. with mass function (or density) f(x) = cu(x) where u(x) is known but c is not. Using the Gibbs sampler assumes that for any i and values xj, j ≠i, we can generate a r.v. X with pmf
P(X=x) = P(Xi=x|Xj=xj,j≠i)
It operates the Hastings-Metropolis algorithm on a Markov chain with states x = (x1, .., xn) and with transition probabilities defined as follows: Whenever the present state is x, a coordinate that is equally likely to be any of the 1,..,n is chosen. If coordinate i is chosen, then a r.v. X whose probability mass function is as above is generated and if X=x the state y=(x1, .., xn-1, x, xn+1,..,xn) is considered as the next candidate state. In other words, with x and y given the Gibbs sampler uses the Hastings -Metropolis algorithm with

Because we want the limiting distribution to be p the vector y is then accepted as the new state with probability

So in the Gibbs sampler the candidate state is always accepted as the next state.
Example Here is one of the standard models used in the actuarial sciences (Insurance) to model the number of claims that might have to be paid on a certain type of policy: 
The idea is this: there is a random number N of policies of the same type (car insurance, health ins, etc.) Obviously N>0 otherwise it's to boring. Each insurance has a probability Y to be claimed, so X is the number of policies that get claimed.
We want to generate data for X. In order to use the Gibbs sampler we need all the conditional distributions. We already have X|Y=y,N=n~Bin(n,y). It can be shown that
So the Gibbs sampler is as follows. Set k=0, pick initial values Yk=y and Nk=n.
1)
set k=k+1
2)
generate Xk~Bin(Yk-1,Nk-1)
3) generate Yk~Beta(Xk+α,Nk-Xk+β)
3) generate M~Pois(λ[1-Yk]) and set Nk=M+Xk
4) go to 1)
see mcmc8
Example : One of the main uses of the Gibbs sampler is in Bayesian analysis. Say we have X~Bin(n,p) and p~Beta(α,β) and we want a sample from the posterior distribution p|X. Then the joint distribution of X and p is the beta-binomial distribution given by

To use the Gibbs sampler we need the conditional distributions of X|p and p|X:
1) X|p ~ Bin(n,p)
2) p|x ~ Beta(x+α,n-x+β)
so the Gibbs sampler is as follows:
1) Choose an initial value for p, say p(0)=0.5 and set k=1
2) X(k) ~ Bin(n,p(k))
2) p(k+1) ~ Beta(x(k)+α,n-x(k)+β)
As a specific example, say in a sample of 100 employees of a company we have 37 womean and 63 men, and we want to find a 90% interval estimate for the percentage of female employees. We have no prior knowledge of this company, so we will use U[0,1] (-beta(1,1)) as our prior.
This is implemented in mcmc3. Notice that the interval is very similar to the frequentist solution,
±1.96√(
(1-
)/n) = (0.28,0.46)
Historically Bayesian analysis suffered from the problem that prior distributions had to be chosen so it was possible to calculate the posterior distribution (one popular choice are so-called conjugate priors, where the posterior distribution is the same as the prior one, except for the parameters) even though those priors were not a good description of our "prior belief". The Gibbs sampler allows us to be much more free in our choice of prior.