Methods for Finding Estimators

Method of Moments

Let X = X1, ..,Xn be a sample from a distribution with pmf(pdf) f(x|θ1,..,θk). Define the ith sample moment by mi = (Xi1+..+Xin)/n. Analogously define the ith population moment by μi = EXi. Of course μi is a function of the θ1,..,θk. So we can find estimators of θ1,..,θk by solving the system of k equations in k unknowns mii i=1,..,k

Example : say X1, ..,Xn are iid N(μ,σ). Here θ1=μ and θ22. Then

Method of Least Squares

Example say X1, .., Xn~N(μ,σ), σ assumed known. Now μ is the mean of the normal distribution, so any observations should be scattered around μ. If we estimate μ by say a, then εi=Xi-a is called the ith residual or error. Now a measure of the overall error is

Obviously we did not need to use T(a)= ∑(Xi-a)2, other possible choices are

• T(a)=∑|Xi-a|, which leads to a=median(X)
• T(a)=max{|Xi-a|} which leads to the mode

Maximum Likelihood

The idea here is this: the likelihood function gives the likelihood (not the probability!) of a value of the parameter given the observed data, so why not choose the value that "matches" (gives the greatest likelihood) to the observed data.

Example: say X1, ..,Xn are iid U[0,θ], θ>0. Then

Now L(θ|x) is 0 on (0,max(xi)), at max(xi) it jumps to 1/(max(xi))n and then monotonically decreases as θ gets bigger, so the maximum is obtained at θ=max(xi), therefore the mle is max(xi)

Example: say X1, ..,Xn are iid Ber(p). First notice that a function f has an extremal point at x iff log(f) does as well because d/dx{log(f(x))}=f'(x)/f(x)=0 iff f'(x)=0.

Example say X1, .., XnX1, .., Xn~Bin(n,p), both p and n unknown. We want to find the mles of p and n. We have

for any n we have the mle for p. Now we only need to search through the values of n for overall mle, this is done in mle.bin.n

Example X1, .., Xn~N(μ,σ):

Example: We have observations X1, ..,Xn which are independent. We know that our population is made up of two groups (Men - Women, say) and each observation comes from one or the other group but we don't know which. Observations from group i have a N(μii), i=1,2, distribution. We want to estimate the parameters.
What we have here is called a mixture distribution. Say that proportion of members of group 1 in the population is α. Let's introduce a latent (unobservable) r.v. Zi, which is 1 if observation Xi comes from group 1, and 2 if it comes from 2. Then

where we use the notation Φ(x|μ,σ) for the cdf of a normal r.v and j(x|μ,σ) for its density.
Unfortunately this expression does not simplify! Also, it is a function in 5 dimensions, so just looking at it is difficult.

To start let's keep it simple and assume we know μ1, σ1, μ2, σ2 and we want to estimate α. Then

This is a non-linear equation which can not be solved explicitely, so we will have to do it numerically. A standard method in numerical analysis for solving equations of the form h(x)=0 is Newton's method:
pick a starting point x1
find xn+1 = xn - h(xn)/h'(xn)
If the starting point is close enough to a solution of the equation, the sequence will converge to it.

The routine mixmle1 implements Newton's method for this problem. Set φi=φ(Xi22) and ψi=φ(Xi22)-φ(Xi11) then

Next let's consider the case where we know α, σ1 and σ2 and want to estimate μ1 and μ2. For this we need a multivariate extension of Newton's method. Say h(x) is a real-valued function in n, and we wish to find a maximum (or more generally an extremal point) of h. Let Δh be the gradient of h, that is Δhi(x) = δ/δxi(h(x)) and let H be the Hessian matrix defined by H(x)ij = δ2/δxiδxj(f(x)). Then we have

pick a starting point x1
find xn+1 = xn - H-1(xn) Δh(xn)

Here this means:

Also note that

The routine mixmle2 implements Newton's method for this problem.

How about the complete problem with 5 parameters?

We will now apply this to our problem. This is quite a lengthy and nontrivial exercise in calculus, and it is important to do this slowly and carefully. It is also very important to choose some good notation to keep from getting lost.
We will use σ rather than σ2 as our parameter.
We begin by finding the first and second derivatives of the density of a normal with respect to the parameters:

Next we introduce some useful notation:

With this we find the gradient:

and the Hessian matrix (note that H[i,j]=H[j,i]):



Note: say we have μ1 = X1, then

so the log-likelihood function has singularities at all points where μk = Xi and σj = 0.
This means that finding the mle here is a difficult problem, and a vey good start value is needed.

Example say X has a multinomial distribution with parameters p1,..,pk (we assume m is known), then if we simply find the derivatives of the log-likelihood we find

and this system has no solution. The problem is that we are ignoring the condition p1+..+pk=1. So we really have the problem

Minimize l(p1,..,pk) given p1+..+pk=1

One way to do this is with the method of Lagrange multipliers: minimize

l(p1,..,pk)-λ(p1+..+pk-1):

Maximum likelihood estimators have a number of nice properties. One of them is their invariance under transformations. That is if is the mle of θ, then g() is the mle of g(θ)

Example say X1, .., Xn~Ber(p) so we know that the mle is . Say we are interested in θ=p-q=p-(1-p)=2p-1, the difference in proportions. The of course p=(1+θ)/2 and

Example say X1, .., Xn~N(μ,σ) so we know that the mle of σ2 is s2 . but then the mle of σ is s.

Theorem Let X1, .., Xn be iid f(x|θ). Let denote the mle of θ, and let g(θ) be a continuous function of θ. Under some regularity conditions on f we have

√n[g( )-g(θ)]→N(0,√v(θ))

where v(θ) is the Rao-Cramer lower bound. That is, g() is a consistent and asympototically efficient estimator of g(θ).

Example say X1, .., Xn are iid Pois(λ), and we want to estimate g(λ)=P(X=0|λ)=e. First we need the mle:

Therefore the mle of g(λ) is g(). According to the theorem

√n[e--e]→N(0,√v(λ))

where v(λ) is the Rao-Cramer lower bound of e-X. Instead of calculating that directly we can use the approximations we previously discussed:
E[g()]=g(λ) and

zero.pois draws the histogram with the approximate density.

Bayesian Point Estimation

We have already seen how to use a Bayesian approach to do finding point estimators, namely using the mean of the posterior distribution. Of course one could also use the median or any other measure of central tendency. A popular choice for example is the mode of the posterior distribution.

Example Let's say we have X1, .., Xn~Ber(p) and p~Beta(α,β), then we already know that

p|x1,..xn~Beta(α+∑xi,n-∑xi+β)

and so we can estimate p as follows:

As k→∞ the posterior mean and mode clearly approach k/n. In fact so does the median, though that is somewhat more complicated to show.
This is implemented in berbeta

Example Let's say we have X1, .., Xn~N(μ,σ) and we want to estimate both μ and σ. So we need priors on both parameters:

For μ let's use the improper prior g(μ)=1 μ, and for σ we will use Jeffrey's prior p(σ)=1/σ, σ>0 and we will assume that the priors for μ and s are independent. We then get the joint prior on (μ,σ) to be proportional to 1/σ on x+ . Therefore

and so if we use the mean of the posterior distributions we get the sample mean as the estimator of the population mean.

How about σ? The marginal of σ2 turns out to be a scaled inverse-χ2 distribution, that is the distribution of 1/Z where Z~χ2 and it's mean is the sample standard deviation s2.

We see that with these priors Bayesian and Frequentist (maximum likelihood) estimators are the same. If we use these "flat" priors that often turns out to be the case.