Preliminary Analysis and Mallow's Cp
In a simple regression problem we make decisions on the model (good or bad, normal residuals etc.) based on the residual vs. fits and the normal plot. Eventually we will do the same in a multiple regression problem, but it is useful to do some preliminary checks early on. If the problems are obvious we might see them early and try to fix them without wasting a lot of time.
In a preliminary analysis we try to answer the following questions:
| Predictor |
| Graph |
| Is there a relationship? |
| Do the residuals have a normal distribution? |
| Do we have equal variance? |
| Is a linear relationship likely? |
Is there a problem with highly correlated predictors?
Here is what you need to do:
Graph
Scatterplot if predictor is continuous, Box plot if it is discrete
Is there a relationship?
Look at graph and calculate correlation coefficient, with p-value
Example
Do the residuals have a normal distribution?
If the residuals do not have a normal distribution the most common consequence are severe outliers.
Example
Do we have equal variance?
The usual "cone" shape in the scatterplot, boxes with very different sizes:
Example
Is a linear relationship likely?
Same as before in scatterplot, in boxplot: do the boxes form a line?
Example
Is there a problem with highly correlated predictors?
As a ball-park number, any correlations above 0.9 (in absolute value) can be trouble.
Often if the preliminary analysis shows no problems a linear model will work just fine, but at the end you should always check the residual vs. fits and the normal plot, just as before.
Mallow's Cp
As before we want our models to be parsimoneous. So if a predictor is not really useful, maube we should not include it in our model. The decision on which predictors to use can be made based on Mallow's Cp statistic, the smaller the better.