The EM algorithm seems at first to solve a very specific problem but it turns out to be quite useful in general
Example Let's return to the normal mixture model we considered earlier: Y1~N(μ1,σ1), Y2~N(μ2,σ2), Z~Ber(p) and X=(1-Z)Y1+ZY2 Let's assume for the moment that in addition to the X1, .., Xn we also observe Z 1, .., Z n . Then

but this is the same as the equation for X~N(μ,σ) where we only count those xi's where zi=0, so it has the solution
and similar for the other parameters of the normals.
So if we knew the zi's this would be a simple problem. On the other hand,

so if we knew the parameters we could estimate each of the zi's.
This is then the basic idea of the EM algorithm:
• in the M step assume you know the z1,..,zn and estimate the parameters.
• in the E step use these parameters to estimate the z1,..,zn.
See em1 for an implementation.
Example Let's look at a special case of the above, with p=½, μ1=0, σ1=1, μ2=0, σ2=1 and we assume that σ1 and σ2 are known, so we are fitting for p, μ1 and μ2 only.
norm.mix() generate 1000 observations from this mixture distribution and draws the contourplot for μ1vs μ2. (for p=1/2). As we can see, there are two maxima, one at about the right point (0,3) and another at (3,0). In norm.mix(2) we draw the log-likelihood onto the the subspace {(x,y): x+y=3}, which shows the same thing.
norm.mix(3, start) runs the EM algorithm from a starting point start. Check
norm.mix(3,c(0.3,-2,5))
norm.mix(3,c(0.3,4,5))
norm.mix(3,c(0.7,4,1))
The EM algorithm was originally invented by Dempster in 1977 to deal with a common problem in Statistics called censoring: say we are doing a study on survival of patients after cancer surgery. Any such study will have a time-limit after which we will have to start with the data analysis, but hopefully there will still be some patients who are alive, so we don't know their survival times, but we do know that the survival times are greater than the time that has past sofar. We say the data is censored at time T. The number of patients with survival times >T is important information and should be used in the analysis. If we order the observations into (x1, .., xn) the uncensored observations (the survival times of those patients that are now dead) and (xn+1, .., xn+m) the censored data, the likelihood function can be written as

because all we know of the censored data is that P(Xi>T)=1-F(T|θ)
If we had also observed the survival-times of the censored patients, say z=(zn+1, .., zn+m) we could have written the complete-data likelihood

and again we can use the EM algorithm to estimate θ:
• in the M step assume you know the z1,..,zn and estimate θ.
• in the E step use θ to estimate the z1,..,zn
Example Say Xi~Exp(θ) and we have data (x1, .., xn) and we know that m observations were censored at T. Now
so the EM algorithm proceeds as follows:
• in the M step assume you know the z1,..,zn and estimate θ=1/mean({x1, .., x n,zn+1, .., zn+m}).
• in the E step use θ to estimate the z1,..,zn =1/θ+T
see em2