Exercises - Multiple Regression

Problem 1

Let's revisit the Olympic Long Jump. We want to find a model to predict the men's and the women's gold medal winning long jump for next year, but now using both time and sex as predictors. First rescale Time so that the numbers are not so big: use the year 1900 as a baseline (year zero), that is replace Time by Time-1900. Then code Men as 0 and Women as 1.

Part 1

Find the parallel lines model for predicting Long Jump from Time and Gender. What is the equation for the men, what is it for the women? What is the R2? Is it a good model?

Part 2

Find the equal intercept model for predicting Long Jump from Time and Gender. What is the equation for the men, what is it for the women? What is the R2? Is it a good model?

Part 4

Find the independent lines model for predicting Long Jump from Time and Gender. What is the equation for the men, what is it for the women? What is the R2? Is it a good model?

Part 5

Which of the three is the best model? Use the best model to find 95% prediction intervals of the gold medal winning long jump for men and women next year.

Part 6

Here is an alternative way to deal with outliers: we already know that 1896 is an outlier for the men's long jump. Make a new variable, called a dummy variable, call it "1896". it has a value of 1 for 1896, and 0's for all other times. Include this new variable in your parallel line model. What is the least squares equation? What is the R2? What is the 95% PI for the gold medal winning long jump for men next year now?

Problem 2

Consider the data set Employee Attitutes
Analyze this data set. Carry out a preliminary analysis, find the best linear model relating the ratings to the predictor variables.
Use the model you found to be best to find a 99% interval estimate for the mean rating score of supervisors whos employees give them the mean of the scores in every question.

Solutions

Problem 1

Part 1

Long Jump = 278 + 0.695 Time - 63.6 Gender or
Men: Long Jump = 278 + 0.695 Time
Women: Long Jump = 214.4 + 0.695 Time
R2 = 89.8%
It is a good model according to the residual vs. fits plot.

Part 2

Long Jump = 273 + 0.757 Time - 0.807 Time*Gender or
Men: Long Jump = 273 + 0.757 Time
Women: Long Jump = 273 - 0.05 Time
R2 = 79.8%
It is not a good model according to the residual vs. fits plot.

Part 3

Long Jump = 279 + 0.662 Time - 81.9 Gender + 0.258 Time*Gender or
Men: Long Jump = 279 + 0.662 Time
Women: Long Jump = 197.1 - 0.920 Time
R2 = 90.4%
It is a good model according to the residual vs. fits plot.

Part 4

Both the independent lines model and the parallel lines model have a Mallow's Cp of 4.0, so they are equally good. using the parallel lines model we find the 95% PI for the men to be (327, 373) and for the women (264, 309)

Part 5

The regression equation is Long Jump = 282 + 0.640 Time - 63.5 Sex - 29.4 "1896", the R2 = 91.8% and the 95% Pi for Men is (328, 369)

Problem 2

All the predictors are continuous, so we use the scatterplot for all of them.
Complaints
Relationship strong positive relationship (r=0.825)
Normal Residuals O.K.
Equal Variance O.K.
Linear Model O.K.

Privileges
Relationship weak positive relationship (r=0.426)
Normal Residuals O.K.
Equal Variance O.K.
Linear Model O.K.

Learn New Things
Relationship positive relationship (r=0.624)
Normal Residuals O.K.
Equal Variance O.K.
Linear Model O.K.

Raises
Relationship positive relationship (r=0.590)
Normal Residuals O.K.
Equal Variance O.K.
Linear Model O.K.

To Critical
Relationship weak positive relationship (r=0.156)
Normal Residuals O.K.
Equal Variance O.K.
Linear Model O.K.

Advancing
Relationship weak positive relationship (r=0.155)
Normal Residuals O.K.
Equal Variance O.K.
Linear Model O.K.

The highest correlation between two predictor variables is 0.669 (Complaints and Raises). This is no problem for the regression.

Using all the predictors in a regression model yields:
The regression equation is: Rating = 10.8 + 0.613 Complaints - 0.073 Privileges + 0.320 Learn new Things+ 0.082 Raises + 0.038 Too Critical - 0.217 Advancing
Residual vs. Fits plot and normal plot look good, so this is a good model and the assumptions are justified.
R-Sq = 73.3%
According to the hypothesis tests for the predictor variables Complaints is highly significant (p value = 0.001), Learning new Things is borderline significant (p value = 0.070) and none of the other predictors is significant.

According to the best subset regression the best model uses Complaints and Learning New Things, with a Mallow's Cp of 1.1
The model is: Rating = 9.87 + 0.644 Complaints + 0.211 Learn new Things
R-Sq = 70.8%

Variable Mean
Rating 64.63
Complaints 66.60
Privileges 53.13
Learn new Things 56.37
Raises 64.63
Too Critical 74.77
Advancing 42.93

The 99% confidence interval is (61.19, 68.08)