Regression to the Mean

Consider the dataset golf scores. In the next graph we have the four fitted line plots for the first and second round of the 2006 AT&T Peeble Beach National Pro-AM, 2006 Honda Classic, 2006 Byron Nelson Classic and the 2007 Sony Open:

In all four cases (and these were randomly chosen from all the tournaments) the slope of the regression line is positive but less than 1. Is this surprising?

• That the slope is positive is explained by the fact that golfers (even pros) have long term ups and downs: for weeks or months they play well, and then they play not so well. So a golfer who plays bad in the first round also likely plays bad in the second, and vice versa.

• That the slope is less than one is a general phenomenan often observed in real life called regression to the mean. Here it is this: a golfer at any one moment in time has a given scoring ability. It will change over time but not from one day to the next. So if somebody shot a high score one day they are likely not playing well overall but they also likely had an exceptionally bad day as well. The next day their overall ability is still the same but they likely won't have a bad day on top of that, and their score should go down. If somebody shot a very low score one day they are not likely to be able to do it again the next day.

This is one of the most missunderstood principles of statistics. Say a player has done very well in the first round of a major tournament, but then he plays not so well in the second. Almost always commentators will say the he "felt the pressure" and so played worse. In reality it is likely just regression to the mean: the first day he played well above his natural ability, and the second day he came back to it.

This is a phenomena that we find everywhere. For example it helps doctors: who goes to a doctor? People who don't feel well. But some of them would get better anyway, doctor or no docter. But they all will think it was the doctor who did it! This by the way is also one of the explanations for the famous placebo effect.

Note: there is something special about these datasets: x and y measure the same thing (number of strokes needed for a round). In such a case it is often a good idea to draw the scatterplot with the same x and y scale.

Note: the slope for the AT&T is really small: 0.087. Is there even relationship between the first and second rounds? Here are the correlation coefficients:

Tournament r p-value
Sony Open 0.305 0.000
AT&T Peeble Beach 0.088 0.261
Honda Classic 0.259 0.002
Byron Nelson Classic 0.138 0.091
Shell Huston Open 0.294 0.000

So for the AT&T and Byron Nelson tournaments there is no statistically significant relationship between the two rounds (at these sample sizes!) Any idea why the two are different?