, For example what is the error in
? Of course this is very easy and we already know the answer. But let's further say we don't know it and we want to use Monte Carlo simulation to find out. Formaly what this means is the following:
' of X'1, .., X'n
', For example find their standard deviation.But what do we do if we don't know that our sample came from a normal? A simple idea then is to replace sampling from the actual distribution function by sampling from the next best thing, the empirical distribution function, and our simulation becomes
Step 1) generate X*1, .., X*n from the empirical distribution function of X1, .., Xn
Step 2) find the sample mean
* of X*1, .., X*n
Step 3) repeat Step 1 and Step 2 many times (say 1000 times)
Step 4) Study the MC sample means
*, for example find their standard deviation.
What does it mean, generate X*1, .., X*n from the empirical distribution function of X1, .., Xn?Actually it means sampling with replacement from X1, .., Xn, that is X*1 is any of the original X1, .., Xn with probability 1/n.
In any one bootstrap sample an original observation, say X1, may appear once, several times, or not at all.
Example : Let's illustrate these ideas using an Example from a very good book on the bootstrap, "An Introduction to the Bootstrap" by Bradley Efron and Robert Tibshirani. The following table shows the results of a small experiment, in which 7 out of 16 mice were randomly selected to receive a new medical treatment, while the remaining 9 mice were assigned to the control group. The treatment was intended to prolong survival after surgery. The table shows the survival times in days:
| Treatment | Control |
| 94 197 16 38 | 52 104 146 10 |
| 99 141 23 | 50 31 40 27 46 |
| Treatment | Control | |
| Mean | 86.86 | 56.22 |
| Standard Deviation | 25.24 | 14.14 |
Let's say next that instead of using the mean we wish to use the median to measure average survival. We find the following:
| Treatment | Control | |
| Median | 94 | 46 |
| Standard Error | ? | ? |
= s(x). How accurate is
? Here is the algorithm to find the bootstrap estimate of the standard error in
:
*(b) = s(x*b), b=1,..,B
) by the sample standard deviation of the bootstrap replications.
Example : Recall the 1970's military draft. The data set is in draft$data. In draft$f we find the bootstrap estimate of the standard error of the correlation coefficient to be 0.05.
Note that the histogram of bootstrap replications appears to be normal, so we might now find confidence intervals for the correlation coefficient based on normal theory.
Example :: Let's consider again an Example from Efron and Tisbhirani, concerning the problem of curve fitting. The data set cholostyramine shows the data for 164 men who participated in a study to see if the drug cholostyramine lowered blood cholesterol levels. The men were supposed to take six packets of cholostyramine per day, but many of them actually took much less. The variable "Compliance", which we will call "z" measure the percentage of the drug actually taken (?). The variable "Improvement", called "y", measures the decrease in total blood plasma cholesterol level.
In cholostyramine$f(1) we plot the scatterplot of z vs. y and we see that there is a relationship between compliance and improvement.
We would like to fit a curve to this dataset, a curve that describes the relationship between z and y. The idea here is that there is true curve describing the relationship, which we want to estimate. The curve is the conditional expectation of y given z, written as r(z) = E[y|z]. If we knew the bivariate distribution that generated the observations(z,y), we could of course find this function analytically. Here, though, we will try to estimate it from the data.
Generally there are two approaches to curve fitting: on the one hand we can first decide on the functional form of the relationship, and then find the curve within this family that best "fits" the data. On the other hand we can use a so-called non-parameteric regression method.
For our data set clearly a linear model is not good, so instead we will use a quadratic model of the form y = β0 + β1z + β2z2. The parameters in the model β0, β1 and β2 are estimated via least squares, that is by minimizing the least squares error

In R this can be done easily using the function lm
There are a number of non-parametric regression methods available. We will use a method called loess. It works as follows:
Say we want to estimate the regression curve at the point with x-coordinate z. Then the n points xi = (zi, yi) are ranked according to |zi-z|, and the αn nearest points, that is those with |zi-z| smallest, are identified. Here α is a user-supplied "tuning parameter". Call this neighborhood of z N(z).
Next a weighted least squares linear regression model is fit to the points in N(z).
Basically, loess computes a local linear least squares model. Again, in R this can be done easily using the function loess
.
The function cholostyramine$f(2) plots the data and adds the quadratic and the loess (for alpha=0.6) regression curve.
How can the bootstrap help us with understanding the variability in these curve estimates? First of, we can use it to plot not just one, but many bootstrap curves.This is done in cholostyramine$f(3). We can think of these "curve sets" as a likely envelope of the true regression curve. One interesting aspect is that for all the bootstrap loess curves there is a steeper rise on the right. This would indicate that, For example , there is a sizable difference between a compliance of say 80% when compared to say 100%. Let's investigate this a little more:
Define the parameter θ as the increase in improvement when going from 80% compliance to 100%. We can estimate θ either using the quadratic model, or the loess method:

The calculations are in cholostyramine$f(4)
Is the value of 0.88 significantly smaller than the 1.74 for the loess method? For this we can again use the bootstrap, see cholostyramine$f(5)