Estimation

Point Estimation

Here we will discuss some basic ideas of statistical inference. In this area we use information available from a sample to make inferences (=educated guesses) for the corresponding population. Often this will mean using a statistic to estimate the corresponding but unknown parameter.

Example: We did a survey of the students at the Colegio. For that we interviewed 150 randomly selected students. We really want to know the mean GPA of all the students at the Colegio ( parameter), and we use the sample mean GPA of the students in the sample ( a statistic) as an estimate.

Each population parameter has a correponding statistic and vice versa.
Sample mean - population mean
Sample median - population median
Sample standard deviation - population standard deviation
Sample 1st quartile - population 1st quartile
Sample correlation coeffcient - population correlation coefficient
Sample least squares coefficients - population least squares coefficients
etc ....

Each of these sample numbers is called a point estimate.

Interval Estimation

In real life a point estimate is rarely enough, usually we also need some estimate of the error in our estimate.
Example: A census of all the students at the Colegio 10 years ago showed a mean GPA of 2.75. In our survey of 150 students we find today a mean GPA of 2.53. Does this proof that todays GPA's are lower? Actually the difficult part here is what we mean by "proof". After all if we repeated our survey tomorrow with a different sample of 150 students, their mean GPA will not again be 2.53. But how far away from 2.53 might it be? Could it actually be higher than 2.75?. Looking at it from a different point of view, if the mean GPA has not changed, could a random sample of 150 students have a GPA as low as 2.53?

One way to answer such questions is to find an interval estimate rather than a point estimate. Specifically we will consider a type of interval estimate called a confidence interval

We will learn about confidence intervals using the mean as an example. Here the formal definition is
a 100(1-a)% confidence interval for the population mean is given by
estfig1.png - 750 Bytes

First notice that the interval is given in the form point estimate error, which is quite often true in Statistics, although not always.

We already know all the ingredients of this formula with the exception of tn,a, which is the 100*(1-a)% quantile of a distribution called Students' t distribution with n degrees of freedom. In the context here it is often called a critical value. Quantiles in general are defined as follows: say the r.v. X has distribution F. Then the 100*p% quantile of F is the solution of F(xp)=P(X<xp)=p. The t distribution is part of R and its quantiles are found using the command qt(p,df)
Examples: t5,0.05 = qt(0.95,5) = 2.015 , t15,0.1 = qt(0.9,15) = 1.3406

Example for the confidence interval: Say in our survey we found in our sample a mean GPA of 2.53 with a standard deviation 0.65. Find a 90% confidence interval for the mean GPA:
estfig2.png - 4404 Bytes
and so our 90% confidence interval is (2.53-0.088, 2.53+0.088) = (2.442, 2.618)

What does that mean: a 90% confidence interval for the mean is (2.442, 2.618)? The interpretation is this: suppose that over the next year statisticians (and other people using statistics) all over the world compute 100,000 90% confidence intervals, many for the mean, others maybe for medians or standard deviations or ..., than about 90% or about 90,000 of those intervals will actually contain the parameter that is supposed to be estimated, the other 10,000 or so will not.
It is tempting to interpret the confidence inteval as follows: having found our 90% confidence interval of (2.442, 2.618), we are now 90% sure that the true mean GPA (the one for all the students at the Colegio) is somewhere between 2.442 and 2.618. Strictly speaking this interpretation is not correct because once we have computed the interval (2.442, 2.618) the true mean GPA is either in it or not. Nevertheless at least intuitively this interpretation is also useful.
Let's do a simulation to illustrate this. We generate m data sets, each with n observations from a N(m,s2). Then we compute the 90% confidence interval for each of these m data sets. Finally we find the percentage of simulation runs where the CI contains the true mean. The calculations are done in the function cisim.

But why would we be willing to accept a 10% chance of being "wrong", that is of getting an interval that does not contain the true parameter? Well, we don't have to, after all we chose to compute a 90% confidence interval. Instead we could have found a 99% confidence interval and only leave a 1% chance being "wrong". Here is what would happen:
100(1-a)% = 99% , so 1-a=0.99, a=0.01, a/2=0.005, t149,0.005 = 2.609
estfig4.png - 2371 Bytes
So this interval is larger, it has a width of 2*0.138=0.276 compared to a width of 2*0.088=0.176 for the 90% confidence interval. So finding confidence intervals involves a trade-off: if we make the probability of being wrong smaller we (almost always) make the interval larger. The only way to make an interval smaller without changing the confidence level is to get a larger data set!

cisim has an argument cl for confidence level, so we can check that this works for other levels as well.

In cisim we do all the calculations from scratch. In fact the t-test is already implemented in R using the t.test command.