Exercises - Simple Regression

Problem 1

We will use the Hubble's Constant dataset in this problem

Part 1

Find the slope of the least squares regression line

Part 2

Check the assumptions of least squares regression.

Part 3

Find the 68% interval estimate for the volocity of a galaxy which is 0.25 parsec from earth. Is this a "prediction" or an "extrapolation" problem?

Problem 2

For this exercise we will use the Olympics data set, specifically the Discus throw.

Part 1

Is there a relationship between the year and the discus throw? If so, how strong is it?

Part 2

Are there any outliers in the data set? If there are any, what would you do with them?

Part 3

Is a linear model o.k. for the relationship of discus and year?

Part 4

Using the linear model, find a 99% interval estimate for the gold medal winning discus throw in the next Olympics.

Part 5

For each of the following models, find the corresponding equation and decide whether the model gives a good fit to the data: quadratic, cubic, square root, exponential and power.

Part 6

Decide which of the models is best.

Solutions

Problem 1

Part 1

This is done by running the command Stat-Regression-Regression and reading the first line of the output:
The regression equation is
Velocity = - 40.8 + 454 Distance
and so the slope is 454

Part 2

there are three assumptions
a) Do we have a good model? - Check residual vs. fits plot
b) Do the residuals have a normal distribution? - Check the normal probaility plot
c) Do the residuals have equal variance?- Check residual vs. fits plot

Part 3

We want the interval for one galaxy, so we need the prediction interval
Go to Stat-Regression-Regression, go to Options, type the desired x value in the box, change 95 to desired value.
Answer: the 68% PI for Velocity is (-174.3, 319.9)

Difference between Prediction and Extrapolation.
If the x value is within the observed x values, it is prediction, otherwise it is extrapolation.
Above we found the interval for Distance=0.25, the range of Distance in the dataset is 0.0320-2.000, so this is a prediction problem. If we had found the PI for Distance=2.7, it would have been an extrapolation problem.
Note The word "prediction" is used twice with different meanings here:
a) Prediction Interval: an interval estimate for an individual observation
b) Prediction (as compared to extrapolation)
say we want the 68% interval for a galaxy which is 2.7 parsec from earth, then the answer is the 68% prediction interval (907.5, 1463.4). It is called a prediction interval but it is also an extrapolation


Problem 2

Part 1

From the scatter plot of discus vs. year it is clear that we have a strong positive relationship. The Pearson correlation of Year and Discus is r= 0.979 with a P-Value of 0.000.

Part 2

A possible outlier is the observation in the lower left corner of the scatter plot. This outlier comes from the first Olympic games in 1896. It appears to be only a slight outlier, therefore I would leave it alone.

Part 3

The residuals vs. fits plot shows some pattern, similar to an upside down parabola, therefore a linear model is not o.k.

Part 4

Because only one person will win this gold medal we will find the 99% prediction interval. This is given by (2640, 3240)

Part 5

Model Equation Fit
Square Root y=-246+0.15√xnot o.k.
Eponential y = 0.0046×100.0029x not o.k.
Power y = 10-39.6x13.0 not o.k.
Quadratic y=-253000+248x-0.06x2o.k.
Cubic y=892000-1510x+0.844x2-0.00015x3 o.k.

Part 6

Step 1: Linear model is bad, so we go on
Step 2: Transformations
None of them are good
Step 3: Polynomial Model
Model p-value of highest order term
Quadratic 0.005
Cubic 0.388

so the quadratic is the polynomial model
Step 4: The best model is the quadratic.