| Variable | Relationship? | Normal Residuals? | Equal Variance? | Linear Model? |
| JanTemp | r=-0.106 | Yes | Yes | Yes |
| JulyTemp | r=0.322 | Yes | Yes | Yes |
| RelHum | r=-0.101 | Yes | Yes | Yes |
| Rain | r=0.433 | Yes | Yes | Yes |
| Education | r=-0.508 | Yes | Yes | Yes |
| PopDensity | r=0.252 | Yes | Yes | Yes |
| NonWhite | r=0.647 | Yes | Yes | Yes |
| WhiteCollar | r=-0.289 | Yes | Yes | Yes |
| Pop | No | |||
| Pop/House | r=0.368 | Yes | Yes | Yes |
| Income | r=-0.283 | Yes | Yes | Yes |
| HCPot | No | |||
| NOxPot | No | |||
| SO2Pot | r=0.416 | Yes | Yes | Yes |
| NOx | No |
There are problems with four predictors. We use the LOGT transform and repeat the preliminary analysis for those predictors:
| Variable | Relationship? | Normal Residuals? | Equal Variance? | Linear Model? |
| LOGT(PoP) | r=0.085 | Yes | Yes | Yes |
| LOGT(HCPot) | r=0.125 | Yes | Yes | Yes |
| LOGT(NOxPot) | r=0.280 | Yes | Yes | Yes |
| LOGT(Nox) | r=0.280 | Yes | Yes | Yes |
The transformations were successful, so we remove the variables Pop, HCPot, NOxPot and NOx from the dataset and instead use LOGT(Pop), LOGT(HCPot), LOGT(NOxPot) and LOGT(NOx)
Next we look at the correlations between the predictors. We find:
a) there are sizable correlations (for example cor(NonWhite,JulyTemp)=0.602)
b) LOGT(NOxPot) and LOGT(NOx) are perfectly correlated.
Because of a) unterpreting (understanding) the final model will be difficult
Using perfectly correlated predictors is not possible so we eliminate one of them, say LOGT(NOxPot)
Next we fit a model with all the predictors and check the assumptions:
The regression equation is
Mortality = 1230 - 1.89 JanTemp - 1.79 JulyTemp + 0.53 RelHum + 1.41 Rain - 7.57 Education + 0.00409 PopDensity + 5.01 NonWhite - 1.80 WhiteCollar - 34.7 Pop/House - 0.00059 Income + 0.091 SO2Pot + 6.5 LOGT(Pop) - 52.1 LOGT(HCPot) + 64.1 LOGT(NOx)
with R2=77.0%
The residual vs fits plot looks fine, so there is no problem with the model
The normal plot is ok, so no problem with the nomal assumption
The residual vs fits plot looks fine, so there is no problem with the equal variance assumption.
Next we use the best subset regression to see whether we can find a model with fewer predictors. It suggests a model based on JulyTemp, Rain, PopDensity, NonWhite, WhiteCollar and LOGT(NOx) with Mallow's Cp=4.3
Fitting this model we find
Mortality = 944 - 1.94 JanTemp + 1.92 Rain + 0.00644 PopDensity + 4.19 NonWhite
- 2.72 WhiteCollar + 39.1 LOGT(NOx)
with R2=74.2%
Again we check the assumptions:
The residual vs fits plot looks fine, so there is no problem with the model
The normal plot is ok, so no problem with the nomal assumption
The residual vs fits plot looks fine, so there is no problem with the equal variance assumption.
Notice that the final model has some variables that had a large p-value in the full model, for example PopDensity p=0.337 and are not significant even in the final model, p=0.080. The reason is this: by itself PopDensity does not tell us much about Mortality, but together with the other predictors it is a little useful.
In general the opposite could also happen: a predictor might be significant in the full model, but might not be part of the best model.
The important part here is this: the best model is chosen according to Mallow's Cp, not based on the hypothesis tests and not based on the correlations.