Example: We did a survey of the students at the Colegio. For that we interviewed 150 randomly selected students. We really want to know the mean GPA of all the students at the Colegio ( parameter), and we use the sample mean GPA of the students in the sample ( a statistic) as an estimate.
Each population parameter has a correponding statistic and vice versa.
Sample mean - population mean
Sample median - population median
Sample standard deviation - population standard deviation
Sample 1st quartile - population 1st quartile
Sample correlation coeffcient - population correlation coefficient
Sample least squares coefficients - population least squares coefficients
etc ....
Each of these sample numbers is called a point estimate.
One way to answer such questions is to find an interval estimate rather than a point estimate. Specifically we will consider a type of interval estimate called a confidence interval
We will learn about confidence intervals using the mean as an example. Here the formal definition is
a 100(1-a)% confidence interval for the population mean is given by
First notice that the interval is given in the form point estimate
error, which is quite often true in Statistics, although not always.
We already know all the ingredients of this formula with the exception of tn,a, which is the 100*(1-a)% quantile of a distribution called Students' t distribution with n degrees of freedom. In the context here it is often called a critical value. Quantiles in general are defined as follows: say the r.v. X has distribution F. Then the 100*p% quantile of F is the solution of F(>xp)=P(X<xp)=p. The t distribution is part of R and its quantiles are found using the command qt(p,df)
Examples: t5,0.05 = qt(0.95,5) = 2.015 , t15,0.1 = qt(0.9,15) = 1.3406
Example for the confidence interval: Say in our survey we found in our sample a mean GPA of 2.53 with a standard deviation 0.65. Find a 90% confidence interval for the mean GPA:
and so our 90% confidence interval is (2.53-0.088, 2.53+0.088) = (2.442, 2.618)
What does that mean: a 90% confidence interval for the mean is (2.442, 2.618)? The interpretation is this: suppose that over the next year statisticians (and other people using statistics) all over the world compute 100,000 90% confidence intervals, many for the mean, others maybe for medians or standard deviations or ..., than about 90% or about 90,000 of those intervals will actually contain the parameter that is supposed to be estimated, the other 10,000 or so will not.
It is tempting to interpret the confidence inteval as follows: having found our 90% confidence interval of (2.442, 2.618), we are now 90% sure that the true mean GPA (the one for all the students at the Colegio) is somewhere between 2.442 and 2.618. Strictly speaking this interpretation is not correct because once we have computed the interval (2.442, 2.618) the true mean GPA is either in it or not. Nevertheless at least intuitively this interpretation is also useful.
Let's do a simulation to illustrate this. We generate m data sets, each with n observations from a N(m,s2). Then we compute the 90% confidence interval for each of these m data sets. Finally we find the percentage of simulation runs where the CI contains the true mean. The calculations are done in the function cisim.
But why would we be willing to accept a 10% chance of being "wrong", that is of getting an interval that does not contain the true parameter? Well, we don't have to, after all we chose to compute a 90% confidence interval. Instead we could have found a 99% confidence interval and only leave a 1% chance being "wrong". Here is what would happen:
100(1-a)% = 99% , so 1-a=0.99, a=0.01, a/2=0.005, t149,0.005 = 2.609
So this interval is larger, it has a width of 2*0.138=0.276 compared to a width of 2*0.088=0.176 for the 90% confidence interval. So finding confidence intervals involves a trade-off: if we make the probability of being wrong smaller we (almost always) make the interval larger. The only way to make an interval smaller without changing the confidence level is to get a larger data set!
cisim has an argument cl for confidence level, so we can check that this works for other levels as well.
In cisim we do all the calculations from scratch. In fact the t-test is already implemented in R using the t.test command.