"Best Fit" and the Problem of Overfitting

R2

Say we are have two "good" models. How can we decide which is better? One measure of "quality of fit" is R2, the Coefficient of Determination. It is defined as

Ofter R2 is described as the percentage of variation in the response explained by the regression on the predictor(s). We always have:

• R2=0: no relationship between X and Y
• R2=100: perfect relationship between X and Y
• R2 of model A is greater than R2 of model B, then model A is better than model B

R2 is part of the output of the MINITAB regression command.

r and R2

There is a close connection between Pearson's correlation coefficient and R2: R2 = (r)2·100%!

Example Alcohol and Tobacco data (without NI): r=0.784, r2·100% = 0.7842·100% = 0.615·100% = 61.5% = R2

Comparing Polynomial Models

Unfortunately we cannot use R2 to choose between two polynomial models, say the quadratic and the cubic model, because the model with the higher power will never have a smaller R2!

Example: Elusage:
Linear model: R2=78.0%
Quadratic model: R2=84.7%
Cubic model: R2=84.7%
...
model with power 10: whatever R2is, it will be at least 84.7.0%, and probably even higher.

The reason for this is simple: Say we find the best quadratic model, which is
Usage = 196.7 - 4.640·Temperature + 0.03073·Temperature2
Now we add the cubic term Temperature3 as a predictor. One cubic model is
Usage = 196.7 - 4.640·Temperature + 0.03073·Temperature2 + 0·Temperature3
this is of course the same as the model above, so it has R2=84.7%. Only the least squares cubic model is the best
cubic model, so it's R2 cannot be smaller (and usually will be even a bit higher, even if the cubic term is not necessary). Question: which of these polynomial models should you use?

For the Alcohol vs Tobacco data (without NI) and a 9th degree polynomial we get:


It is always possible to find a polynomial model which fits the data set perfectly, that is it has R2=100%!
But: we want our models to fit the relationship, not the random fluctuations in the dataset.

A model should be parsimoneous, that is as simple as possible. (Occam's razor)

Solution: Use the polynomial model of lowest degree where the p-value of the hypothesis test for the highest order term is less than 0.05 (that is where the highest order term is statistically significant)

Example Elusage
Quadratic model:
Predictor p-value
Temperature 0.000
Temperature**2 0.000

so in the quadratic model the highest order term (Temperature**2) has a p-value less then 0.05

Cubic model:
Predictor p-value
Temperature 0.097
Temperature**2 0.442
Temperature**3 0.752

so in the cubic model the highest order term (Temperature**3) has a p-value greater then 0.05

so the polynomial model of lowest degree where the p-value of the hypothesis test for the highest order term is less than 0.05 is the quadratic, which is then our best choice.

Choosing between good Models

In choosing the best model (from our short list) proceed as follows:

Model is "good" = no pattern in the Residual vs. Fits plot

Step 1: If a linear model is good, use it, you are done.

If the linear model is not good, proceed as follows:
Step 2: check the square root model, the exponential model and the power model and see which of these (if any) are good.
Step 3: find the best polynomial model.
Step 4: c) Choose as the best of the good models in a) and b) the one which has the highest R2

Example Elusage.