One of the most important "objects" in Statistics is the likelihood function" defined as follows: Let X=(X1, .., Xn) be a random vector with joint pdf f(x|θ). then the likelihood function L is defined as
L(θ|x)=f(x|θ)
This must be one of the most deceivingly simple equations in math, actually it seems to be just a change in notation: L instead of f. What really matters and makes a huge difference is that in pdf we consider the x's as variables and the θ as fixed, in the likelihood function we consider the θ as the variable and the x's as fixed. Essentially we consider what happens after an experiment is done, that is what the data can tell us about the parameter.
Things simplify a bit if X1, .., Xn is an iid sample. Then the joint density is given by f(x|θ)=∏f(xi|θ).
Example X1, .., Xn~Ber(p):
Example X1, .., Xn~N(μ,σ):
Example X1, .., Xn~Γ(α,β):
Example Y1~N(μ1,σ1), Y2~N(μ2,σ2), Z~Ber(p) and X=(1-Z)Y1+ZY2
Example An urn contains N balls. n i of these balls have the number "i" on them, i=1,..,k and ∑ni=N. Say we randomly select a ball from the urn, note its number and put it back. We repeat this m times. Let the rv Xi be the number of balls with the number i that were drawn, and let X=(X1, .., Xk).
now the pmf of X is given

for any x1,..,xk with xi≥0 and ∑xi=m
Now let's assume we don't know n1,..nk and want to estimate them. First we can make a slight change in the parameterization: pi=ni/N i=1,..,k The resilting random vector is called the multinomial rv with parameters m, p1, .., pk.
Note if k=2 X1~Bin(m,p1) and X2~Bin(m,p2), so the multinomial is a generalization of the binomial.
the likelihood function is given by
where p1+.+.pk=1 and x 1+..xk=m
There is a common misconception about the likelihood function: because it is the same as the pdf (pmf) it has the same properties. This is not true because the likelihood function is a function of the parameters, not the variables.
Example X~Ber(p), so f(x)=(1-p)(1-x)px, x=0,1, 0<p<1
As a function of x with a fixed p we have f(x)≥0 for all x and f(0)+f(1)=1 but as a function of p with a fixed x, say x=1, we have
It turns out that for many problems the log of the likelihood function is more managable entity, mainly because it turns the product into a sum:
Example
X1, .., Xn~Ber(p)

(worry about Xi=0 for all i or Xi=1 for all i yourself)
The log-likelihood function is drawn in like(1,p=0.5)
Example X1, .., Xn~N(μ,σ):
The log-likelihood function is drawn in like(2,mu=0,sig=1)
Example X1, .., Xn~Γ(α,β):
The log-likelihood function is drawn in like(3,alpha=1,beta=1)
Example Y1~N(μ1,σ1), Y2~N(μ2,σ2), Z~Ber(p) and X=(1-Z)Y1+ZY2
The log-likelihood function is drawn in like(4,p=0.5,mu=0,sig=1,mu2=2,sig2=1)
Example say X has a multinomial distribution with parameters m, p1,..,pk, then
L(θ|x)=C(x,y)·L(θ|y)
θ
Θ
then the conclusion drawn from x and y should be identical.
So if two sample points have proportional likelihoods, they contain the same information about the parameter.
Example say X1, .., Xn~N(μ,σ), Y1, .., Yn~N(μ,σ) and assume σ is known. Then

and according to the likelihood principle if two experiments with the probability model N(μ,σ), σ known observe the same sample means, they should give the same result.
Example Consider the following problem: we have a Bernoulli trial with success parameter p, and we wish to estimate p.
Experiment 1: in this experiment we repeat the Bernoulli trial 20 times, so the rv X~Bin(20,p). We find x=7. Therefore

Experiment 2: in this experiment we repeat the Bernoulli trials until the 7th success, so the rv Y~NegBin(7,p). We find y=20, therefore
so now L1(p|7)=C(7,20)L2(p|20) and so according to the likelihood principle both experiments should result in the same estimate of p, regardless of the fact that we performed completely different experiments.
The likelihood principle is a good general prinicple for a statistical procedure but there are common situations were it is violated. For example, an important task in Statistics is model checking. Say for example we have the following probability model: X1, .., Xn~N(μ,σ) and we want to estimate μ. But then we worry whether our data really follows a normal distribution, so we do some checking, for example draw a boxplot. This, though, violates the likelihood principle because for one dataset we might decide that the normal assumption is wrong whereas for another we might accept it, even though both have the same sample mean.