Bayesian Statistics

In the classical, or frequentist approach to Statistics we consider a parameter θ a fixed although unknown quantity. A random sample X1, .., Xn is drawn from a population indexed by θ and, based on the observed values in the sample, knowledge about the true value of θ is obtained. In the Bayesian approach θ is considered a quantity whose variation can be described by a probability distribution (called a prior distribution), which is a subjective distribution describing the experimenters belief and which is formulated before the data is seen. A sample is then taken from a population indexed by θ and the prior distribution is updated with this new information. The updated distribution is called the posterior distribution. This updating is done using Bayes' formula, hence the name Bayesian Statistics.

Let's denote the prior distribution by π(θ) and the sampling distribution by f(x|θ), then the joint pdf (pmf) of X and θ is given by

f(x,θ)=f(x|θ)π(θ)

the marginal of the distribution of X is

m(x)=∫ f(x|θ)π(θ)dθ

and finally the posterior distribution is the conditional distribution of θ given the sample x and is given by

π(θ|x) = f(x|θ)π(θ)/m(x)

We can write this also in terms of the the likelihood function:

π(θ|x1,..,xn) = L(θ|x1,..,xn)π(θ)/m(x)

Example: You want to see whether it is really true that coins come up heads and tails with probability 1/2. You take a coin from your pocket and flip it 10 times. It comes up heads 3 times. As a frequentist we would now use the sample mean as an estimate of the true probability of heads, p and find = 0.3.

A Bayesian analysis would proceed as follows: let X1, .., Xn be iid Ber(p). Then Y= X1+..+ Xn is Bin(n,p). Now we need a prior on p. Of course p is a probability, so it has values on [0,1]. One distribution on [0,1] we know is the Beta, so we will use a Beta(α,β) as our prior. Remember, this is a perfectly subjective choice, and anybody can use their own. The joint distribution on Y and p is given by
bayesfig2.png - 11625 Bytes
which is known as the beta-binomial distribution.

Note that that (Y,p) is a random vector where one component is continuous (p) and the other is discrete (Y). So here we are combining a pdf with a pmf. It turns out that this is ok.

The posterior destribution of p given y is then
bayesfig3.png - 6211 Bytes
Of course we still need to "extract" some information about the parameter p from the posterior distribution. Once the sampling distribution and the prior are chosen, the posterior distribution is fixed (even though it may not be easy or even possible to find it analytically) but how we proceed now is comp[letely open and there are in general many choices. If we want to estimate p a natural estimator is the mean of the posterior distribution, given here by

B = (y+α)/(α+β+n)

This can be written as
bayesfig4.png - 2753 Bytes
and we see that the posterior mean is a linear combination of the prior mean and the sample mean.

How about our problem with the 3 heads in the 10 flips? Well, we have to completely specify the prior distribution, that is we have to choose α and β. The choice depends again on our belief. For example, if we feel strongly that this coin is just like any other coin and therefore really should be a fair coin we should choose them so that the prior puts almost all its weight at around 1/2. For example with α=β=100 we get E[p]=0.5 and V[p]=0.0016. Then B = (3+100)/(100+100+10) = 0.4905 is our estimate for the probability of heads. Clearly for such a strong prior the actual sample almost does not matter, For example for y=0 we would have found B = 0.476 and for y=10 it would be B = 0.524.

Maybe we have never even heard the word "coin" and have no idea what one looks like, let alone what probability of "heads" might be. Then we could choose α=β=1, that is the uniform distribution, as our prior. Really this would indicate our complete lack of knowledge regarding p. (This is called an uninformative prior). Now we find B = (3+1)/(1+1+10) = 0.3, which is just the sample mean again.

in bayescoin with which==1 we study the effect of the sample size on the estimate of p.

in bayescoin with which==2 we study the effect of alpha=beta on the estimate of p. A larger alpha means a prior more concentrated around 1/2.

Example say X~Bin(n,p), p known. Again, a Bayesian analysis begins with a prior on n. Now n=1, 2, .. and so a prior is any sequence a1, a2, .. with ai≥0 and ∑ai=1

If we want to find an estimate for n we can use for example the mode, that is the n which has the largest posterior probability.

Here are some specific examples: say we observe x=217 and we know p=0.37. Also

a) we know only that n≤750, so we choose ai=1/750 if 1≤i≤750, 0 otherwise, bayes.bin.n(217,0.37,rep(1/750,750))

b) we know n is most likely 500 with a standard deviation of 50,
bayes.bin.n(217,0.37,dnorm(1:750,500,50))

c) we know that n≤750 and that n is a multiple of 50,
bayes.bin.n(217,0.37,ifelse(c(1:750)%%50,0,1))

d) we know this was one of the four experiments we did, with n=510, 525, 550 or 575
a=rep(0,750)
a[c(510,525,550,575)]=1
bayes.bin.n(217,0.37,a)

The Big Question: Bayesian or Frequentist?

Should you be a Bayesian?

Bayesian Statistics has a lot of good features. To begin with, it answers the right question, P(Hypothesis|Data). There are others as well:

Decision Theory

There is a branch of mathematics concerend with decision making. It is conceptually a very useful and important one:

• Should you buy a new car, or keep the old one for another year?
• Should you invest your money into the stock market or buy fixed-interest bonds?
• Should the goverment lower the taxes or instead use the taxes for direct investments?

In decision theory one starts out by choosing a loss function, that is a function that assigns a value (maybe in terms of money) to every possible action and every possible outcome.

Example You are offered the following game: you can either take $10 (let's call this action a), or you can flip a coin (action b). If the coin comes up heads you win $50, if it comes up heads you loose $10. So there are two possible actions: take the $10 or flip the coin, and three possible outcomes, you win $10, $50 or loose $10. We need a value for each combination. One obvious answer is this one:

L(a)=10, L(b,"heads")=50, L(b,"tails")=-10

But there are other possibilities. Say you are in a bar. You already had food and drinks and your tab is $27. Now you notice that you only have $8 in your pocket (and no credit card etc.) Now if you win or loose $10 it doesn't matther, either way you can't pay your bill, and it will be very embarrassing when it comes to paying. But if you win $50, you are fine. Now your loss function might be:

L(a)=0, L(b,"heads")=1000, L(b,"tails")=0

The next piece in decision theory is the decision function. The idea is this: let's carry out an experiment, and depending on the outcome of the experiment we chose an action.

• Should you invest your money into the stock market or buy fixed-interest bonds?
Let's do this: we wait until tommorrow. If the Dow Jones goes up, we invest in the stock market, otherwise in bonds.

In decision theory a decision rule is called inadmissible if there is another rule that is better no matter what the outcome of the experiment. Obviously it makes no sense to pick an inadmissible rule.

So what's the connection to Bayesian Statistics? First there are Bayesian decision rules, which combine prior knowledge with the outcome of the experiment.

• based on the movement of the Dow Jones in the last year, I have a certain probability that it will go up over the next year.

Now there is a famous theorem (the complete class theorem) that says that all admissible rules are Bayesian decision rules for some prior.

Optimality

Obviously when we do something it would be nice to do it in an optimal (best) way. It turns out that in Bayesian statistics it is often possible to show that a certain method is best, better or at least as good as any other.

Why to be a Frequentist

Because you hate priors

or better to say you don't like the subjectivity introduced by priors. In Bayesian statistics it is entirely possible that two Scientists who have the same data available and use the same method for analysis come to different conclusions, because they have different priors.

Because Frequentist methods work

For most of the history of Statistics, that is from about 1900 to about 1960, there was (essentially) only Frequentist Statistics. In this time (and since) many methods have been developed that worked very well in practise. Many of those turn out to be also Bayesian methods when the right prior is used, but not all!

Example one of the most useful modern methods, called the Bootstrap, is a purely Frequentist method with no Bayesian theory. (Actually there is something called the Bayesian bootstrap, but it is not the same as the classical bootstrap)

Example A standard technic in regression is to study the residuals. This, though, violates the likelihood principle and is therefore not allowed unter the Bayesian paradigm. Actually, most Bayesians study the residuals anyway.

Simplicity

Even for the easiest problems ("estimate the mean GPA of students at the Colegio") a Bayesian analysis always seems to be complicated (choose a prior and a loss function, calculate the posterior, extract the estimate from the posterior, try to do all of this optimally) Frequentist solutions are often quick and easy.

So? Be Both!