Methods for Finding Estimators

Method of Moments

Let X = X1, ..,Xn be a sample from a distribution with pmf(pdf) f(x|q1,..,qk). Define the ith sample moment by mi = (Xi1+..+Xin)/n. Analogously define the ith population moment by mi = EXi. Of course mi is a function of the q1,..,qk. So we can find estimators of q1,..,qk by solving the system of k equations in k unknowns mi=mi i=1,..,k

Example: say X1, ..,Xn are iid N(m,s2). Here q1=m and q2=s2. Then
est2fig1.png - 6618 Bytes

Maximum Likelihood

Say f(x|q) is the joint pdf (pmf) of a sample X. Then, given that X=x is observed, the function of q defined by L(q|x) = f(x|q) is called the likelihood function.
Notice that there is absolutely no difference between the pdf and the likelihood function. They differ only in what is assumed to be known (q in the case of the pdf, x in the case of the likelihood function) and what is assumed to be a variable (the other way around).
Notice that in general the likelihood function is not a pdf (pmf) because when integrated over dq we (usually) don't get 1.

Often we will consider the log-likelihood function l(q|x)=log(L(q|x))

Example 1: say X1, ..,Xn are iid U[0,q], q>0. Then

Examples can be drawn with the function likefuns with which=1

Example 2: say X1, ..,Xn are iid Ber(p). Then

Examples can be drawn with the function likefuns with which=2

Example 3: We have observations X1, ..,Xn which are independent. We know that our population is made up of two groups (Men - Women, say) and each observation comes from one or the other group but we don't know which. Observations from group i have a N(mi,si2), i=1,2, distribution. We want to look at the log-likelihood function.
What we have here is called a mixture distribution. Say that proportion of members of group 1 in the population is a. Let's introduce a latent (unobservable) r.v. Zi, which is 1 if observation Xi comes from group 1, and 2 if it comes from 2. Then

where we use the notation F(x|m,s) for the cdf of a normal r.v and j(x|m,s) for its density.
Unfortunately this expression does not simplify! Also, it is a function in 5 dimensions, so just looking at it is difficult. In the routine mixplot we draw the log-likelihood for any choice of the parameters.

At least in the discrete case the likelihood function gives the probability that for a certain value of the parameter we observe the data we did observe. It is reasonable, then, to take as our estimator for the parameter that value of the likelihood function that maximizes this probability. In other words, we pick as our estimate that value of the parameter that is most likely, given the observed data. This is called the maximum likelihood method, and is by far the most common technic for finding estimators.
Because log is a monotone function, the value at which the maximum is found is the same for the log-likelihood function and the likelihood function. Using the loglikelihood is often easier, but not always possible because the likelihood function can be 0.

Example 1: say X1, ..,Xn are iid U[0,q], q>0. Above we have seen that the likelihood function is 0 from 0 to max{Xi}, where it jumps to q-n and then gradually declines back to 0. Therefore the maximum is attained at max{Xi}, and that is the maximum likelihood estimator of q.

Example 2: say X1, ..,Xn are iid Ber(p)
drawing the log-likelihood function we see that it has a single maximum. We can find this maximum using calculus:
est2fig5.png - 6132 Bytes
Because of the graph we know that the sample mean is the mle. In general, though, a point where the derivative is 0 might be a local or a global maximum, a local or a global minimum, a sattlepoint etc. And as in the first example, the maximum might be obtained at the boundary, in which case the derivative is of no use at all. This is easy in one dimension where we can draw the graph, even in two dimensions, but gets very hard in higher dimensions.

Example 3: here we have five parameters. To start lets keep it simple and assume we know m1, s1, m2, s2 and we want to estimate a. Then

This is a non-linear equation which can not be solved explicitely, so we will have to do it numerically. A standard method in numerical analysis for solving equations of the form h(x)=0 is Newton's method:
pick a starting point x1
find xn+1 = xn - h(xn)/h'(xn)
If the starting point is close enough to a solution of the equation, the sequence will converge to it

Example 3: The routine mixmle1 implements Newton's method for this problem. Note that

How about the complete problem with 5 parameters? For this we need a multivariate extension of Newton's method. Say f(x) is a real-valued function in n, and we wish to find a maximum (or more generally an extremal point) of f. Let Df be the gradient of f, that is Dfi(x) = d/dxi(f(x)) and let H be the Hessian matrix defined by H(x)ij = d2/dxidxj(f(x)). Then we have
pick a starting point x1
find xn+1 = xn - H-1(xn) Df(xn)

We will now apply this to our problem. This is quite a lengthy and nontrivial exercise in calculus, and it is important to do this slowly and carefully. It is also very important to choose some good notation to keep from getting lost.
We will use s rather than s2 as our parameter.
We begin by finding the first and second derivatives of the density of a normal with respect to the parameters:

Next we introduce some useful notation:

With this we find the gradient:

and the Hessian matrix (note that H[i,j]=H[j,i]):



The routine is implemented in mixmle2

Note: say we have m1 = X1, then

so the log-likelihood function has singularities at all points where mk = Xi and sj = 0.
This means that finding the mle here is a difficult problem, and a vey good start value is needed.
Consider the data set mixmledata. To get a good starting value draw the histogram of the data. There seem to be two peaks, one at about 13 and the other at about 19. The minimum is at 7.55. From the empirical rule we know that (m-3s,m+3s) covers the range of observations from a normal, so applying this to the right hand peak we find m1 =13 and m1-3s1 = 7.33, therefore s1 = (m1 - 7.55)/3 = (13-7.55)/3 = 1.8. For the normal on the right we find s2 = (22.5 - m2)/3 = (22.5-19)/3 = 1.2. Finally, the peak on the left appears to be slightly larger than the one on the right, so we start with a = 0.6. With this we start the routine with (0.6,13,1.8,19,1.2).
The routine returns (0.5718, 11.62, 1.9633, 18.99, 1.586). But is this really the mle? Unfortunately trying to make sure of this is quite hard. One thing to do is to draw the histogram with the fitted density, which looks quite good. Another way to check the result is to draw the log-likelihood function around the mle, one variable at a time. This is done in mixmlecheck.

Bayesian Point Estimation

We have already seen how to use a Bayesian approach to do finding point estimators, namely using the mean of the posterior distribution. Of course one could also use the median or any other measure of central tendency.