Regression

If there is a relationship between variables "x" and "y", can we describe it?
We do that by finding a model, that is an equation y=f(x)
Here we keep it very simple and consider only linear relationships, that is equations of the form y = mx+b. In Statistics though we use a slightly different notation:
y = b0 + b1x

The logic here is this: if we know x, we can compute y. Unfortunately there are always "errors" in this calculation, so the answer y varies even for the same x. For example, let x be the number of hours a student studies for an exam, and y the score on the exam. Say we know from long experience that y=50+5x. So even if the student doesn't study at all (x=0) he/she would still get around 50 points, and for every hour studied the score goes up by about 5 points.
But of course there are many other factors influencing the grade such as general ability, previous experience, being healthy on the day of the exam, exam anxiety etc, so for any specific student the score will not be exactly what the equation predicts. So if three students all study 6 hours, the equation predicts a score of 50+5*6=80 for all of them but one might get a 69, the next a 78 and the third a 99. What the equation predicts is actually their mean score.
This is illustrated in the next graph:

where the scores of the people who studied 6 hours are in red, and their mean score is marked by an X.

b0 and b1 are numbers that depend on the population from which the data (X,Y) is drawn. Therefore they are parameters just like the mean or the median.

A standard problem is this: we have a dataset and we believe there is a linear relationship between x and y. We would like to know the equation y = b0 + b1x, that is we need to "guess" what b0 and b1 are. We will estimate them by a method called least squares regression. Here are the formulas:

As an example consider again the toy example from before:
x 1 2 3 4 5
y 1 3 2 2 5

where we find
.
and the graph with the line looks like this:

In MINITAB we have Stat > Regression > Regression to do the calculations:

Example Stat > Regression > Regression, Response = Draft Number, Predictors = "Day of the Year":
The regression equation is
Draft Number = 225 - 0.226 Day of Year

Always a nice graph is the fitted line plot, that is the scatterplot with the fitted model on top. It is done in Stat > Regression > Fitted Line Plot

Here are three important facts about least squares regression:

1) Consider what happens if we predict y for the mean of the x values:

so the point always lies on the least squares regression line.

2) As we see in the graph above, there is a relationship between the least squares regression line and the correlation coefficient:

One consequence of this is that if r=0, then b1=0, but r=0 means no relationship and b1=0 means we have a horizontal line.

Note that in the draft data we have Sxx=Syy because both x and y consist of the numbers 1-366, only in different orders which does not matter in a sum. Therefore we have that for this data set the slope of the least squares regression line, -0.226, is the same as the correlation coefficient r!

3) We have seen previously that for the correlation coefficient it does not matter what variable we choose as X and which as Y, that is we have cor(x,y) = cor(y,x). This is not true for regression. For example, in our introductory example above we found y=0.5+0.7x but if we do the regression reversing the role of x and y we find the least squares regression line to be x=1.02+0.761y. Note that this is not the same equation that we would get if we simply solved for x:
y=0.5+0.7x, therefore y-0.5=0.7x, therefore x=-0.71+1.43y
In regression it is important to distinguish between the independent variable (x) and the dependent variable (y).

For more on least squares regression see page 185 of the textbook.