First note that the residuals are not independent because they sum to 1. Their variance-covariance matrix is given by

H is called the hat matrix. This is measuring the leverage of an observation because if hii is large changing yi will change the fit appreciably. The trace of H is p, the number of regressors, so large often means 2 or 3 times p/n.
Let's illustrate these ideas using the dataset on the Scottish Hill Races. In hills.fun we find the data matrix X (which=1), the hat matrix H (which=2) and the observations with large leverage (which=3). There we also draw the scatterplots to see which of the observations have large leverage.
One of the consequences of having large leverage is that the corresponding observation has a smaller variance (?) One way to "fix" this is to compute the standardized residuals:
Another problem for the residuals is if one observation has an error which is very large because then the estimate of s will be too large as well, and this makes all the standardized residuals to small. A possible way to proceed is to find the studentized residuals as follows:
1) fit the model but without the use of the ith observation, then use the model to find the prediction y(i) for yi.
2) The studentized residuals are

An easy way to find the studentized residuals is using the formula

Studentized residuals are sometimes also called jackknifed residuals.
In hills.fun with which=5 and which=6 we draw the residual vs. fits plots and the normal plots using the three residuals
In hills.fun with which=7 we again draw the stud. residual vs. fits plot, and identify two outliers, Bens of Jura and Knock Hill. Bens of Jura has a high leverage and a high studentized residual. Knock Hill has a very high residual but not a large leverage. Predicting the time for the Knock Hill race from the model (which=8) we find predict(fit)=13.5, more than an hour less than the time in the dataset. This could possibly be a missprint.
The leverage discussed above has a major drawback as a measure of "outlierness": it ignores the response y. In general we are most concerned with observations whose removal will change the fitted model dramatically. How can we identify such "influential" observations? One measure of influence is Cook's distance. It uses the leave-one-out cross-validation idea as the studentized residuals discussed above. Say we eliminate the ith observation and fit the model using the remaining i-1 observations. Let's call the resulting vector of parameter estimates
(i)
. As always denote the the estimtor when we use all the data
. Then Cook's distance is defined by

As a rule of thumb values of Di large than 1 should be considered for removal. A plot of Cook's distance is part of the plot() method for lm objects, and is drawn in hills.fun(9).
What happens if we eliminate Knock Hill and/or Bens of Jura from the analysis? which=10 shows that the removal of Knock Hills does not change things much wheras the removal of Bens of Jura does.
One undesirable feature of the fit is the (statistically significantly) negative intercept (?). Of course we could set it equal to 0 by ourselves but it would still be nicer if the model did this automatically.
Next we have a look at a dataset describing the relationship between body weight and brain weight of mammals. In mammals.fun(1) we draw the scatterplot of Brain weight by Body weight. This graph is had to read because most of the space is taken up by just two observations, identified as Asian and African Elephants. The problem here is that the distributions of both body and brain are scewed to the right, as can be seen by drawing the corresponding boxplots (which==2). This kind of problem can often be fixed using a log transform. In which==3 and 4 we draw the corresponding boxplots and scatterplots, and we see that this indeed solves the problem.
What does this transform do?

and we see that this fits a power model.
In which==5 we do the fit and study the diagnostic plots. Observation #32 appears as unusual in some of the diagnostic plots, and we find it to belong to the species "Human"
Next we study the dataset Medieval Churches. A scatterplot of Area by Perimeter (churches.fun(1)) shows a strong relationship, but one that is clearly not linear. We also have some skedness to the right again. As before we can try and remmedy that via a trandformation, for example using logs (which==2). This seems to do an ok job. But is there a way to do better, maybe in some sense find an "optimal" transformation?
Box-Cox suggested to use a transform of the form

where l governs the transform and a is included to avoid problems with cases of y=0. Because 0's do not occur in our dataset will simply use a=0.
Box & Cox (1964) showed that the profile likelihood function for l is given by

and y. is the geometric mean of the observations and RSS(z(l)) is the residual sum of squares for the regression of z(l).
The routine boxcox draws the profile likelihood curve together with a 95% confidence interval. Basically any value of l in this interval yields a transformation close to optimal. For our dataset this is done by churches.fun with which=3. It shows that values between 0.45 and 0.8 are acceptable. The "nicest" value in this interval is 0.5, corresponding to the square root transform of Area. Notice that it sufficient to use √(Area) instead of (Area0.5-1)/0.5.
In which==4 we fit the corresponding model and study the diagnostic plots, and in which==5 we draw the fitted line plot. Note that the resulting model still has a hint of unequal variance.