Analysis of House Price Data

Preliminary Analysis


Sqfeet
Graph Scatterplot
Is there a relationship? Yes, r=0.891 (p=0.00)
Do the residuals have a normal distribution? Yes
Do we have equal variance? Yes
Is a linear relationship likely? Yes

Floors
Graph Multiple Boxplot
Is there a relationship? If so, weak, r=0.321 (p=0.090)
Do the residuals have a normal distribution? Yes
Do we have equal variance? Yes
Is a linear relationship likely? Yes

Bedrooms
Graph Multiple Boxplot
Is there a relationship? Yes, r=0.674 (p=0.00)
Do the residuals have a normal distribution? Yes
Do we have equal variance? Yes
Is a linear relationship likely? Yes
<

Baths
Graph Multiple Boxplot
Is there a relationship? Yes, r=0.741 (p=0.00)
Do the residuals have a normal distribution? Yes
Do we have equal variance? Yes
Is a linear relationship likely? Yes

There are several possible outliers, especially observations #2 and #29. What to do with them is somewhat difficult to tell. Here is what the graphs look like without #2:

This does look ok, so we will use this dataset.

---------------------------------------------------------------------------------------------
This is the end of the preliminary analysis. Note that there is so far no mention of regression, residual vs. fits plot or normal plot. Making decisions about possible transformations and/or polynomial models early solely based on scatterplots and/or boxplots is the whole point of doing a preliminary analysis
---------------------------------------------------------------------------------------------

The regression equation is:
Price = 12.0 + 8.16 Sqfeet - 26.5 Floors - 9.29 Bedrooms + 37.4 Baths

Stat > Regression > Regression, Response= Price, Predictors= Sqfeet Floors Bedrooms Baths, Graphs > Normal Plot and Residuals vs. Fits Plot
here are the diagnostic plots:

This appears to be a good model and the assumptions of normally distributed residuals with equal variance appears to be o.k.

The highest correlation between predictors is r=0.743 (Floors-Baths)

R2 = 88.6%

The constant is not statistically significant (p value 0.504), so we might consider fitting a no-intercept model.

Of the predictor variables Bedrooms is not statistically significant (p value > 0.05), all other variables are.

Can we eliminate any predictors from the model? Using best subset regression we find that the best model uses Sqfeet, Floors and Bath (Mallow's Cp=4.8).
Stat > Regression > Best Subsets, Response= Price, Free Predictors= Sqfeet Floors Bedrooms Baths

The regression equation of this model is:
Price = - 1.8 + 7.40 Sqfeet - 19.7 Floors + 30.6 Baths
R2 = 87.7%

Note that the model with all four predictors has Cp=5.0. But Cp is a statistic, its exact value depends on the sample. So is the model with Sqfeet, Floors and Baths statistically significantly better than the model with all four predictors? We would need a hypothesis test to answer this question but MINITAb does not provide one.

For more on Mallow's Cp see page 603 of the textbook.