Correlation

Generally if there are more than two variables we are interested in their relationships. We want to investigate the two questions:
1) Is there a relationship? - correlation
2) If there is a relationship, can we describe it? - regression

Correlation

Example Let's consider the 1970's military draft
How can we check whether the draft was really "random"? Let's have a look at the scatterplot of "Day of the Year" and "Draft Number".

It certainly does not appear that there is a relationship between "Day of the Year" and "Draft Number", but is this really true?

As an additional check we can look at the Pearson's correlation coefficient r

Computation

For the formula of r we first need an extension of our sums of squares formula: the sums of squares of two variables x and y is defind by

The new part is ∑xy, which means, first multiply the pairs of x and y values, then add those up

Pearson's correlation coefficient is computed as follows:

As a simple numerical example consider the following:
x 1 2 3 4 5
y 1 3 2 2 5

and then we get:
n=5
∑x = 15
∑y = 13
∑x2 = 55
∑y2 = 43
∑xy = 1×1 + 2×3 + 3×2 +4×2 +5×5 = 46

Sxx = ∑x2 -(∑x)2/n = 55 - 152/5 = 10
Syy = ∑y2 -(∑y)2/n = 43 - 132/5 = 9.2
Sxy = ∑xy -(∑x)(∑y)/n = 46 - 15×13/5 = 7

r = Sxy/√SxxSyy = 7/√(10×9.2) = 0.73

Warning some common mistakes here:
1)

2) ∑x2 is not the same as (∑x)2
3) Don't forget the square root in Sxy/√SxxSyy
4) Not remembering the formula, or parts of it. MEMORIZE IT!

The correlation coefficient is like the mean, median, standard deviation, Q1 etc.: it comes in two versions:

• it is a statistic when it is found from a sample
• it is a parameter when it belongs to a population

In the first case we use the symbol r, in the second case we use ρ .

In MINITAB we have use Stat > Basic Statistics > Correlation to carry out the calculations. For example we find that the correlation of "Day of the Year" and "Draft Number" is r=-0.226

Properties of the Correlation Coefficient:

• always -1 ≤ r ≤ 1
• r close to 0 means very small or even no correlation (relationship)
• r close to ±1 means a very strong correlation
• r=-1 or r=1 means a perfect linear correlation (that is in the scatterplot the dots form a straight line)
• r<0 means a negative relationship (as x gets bigger y gets smaller)
• r>0 means a positive relationship (as x gets bigger y gets bigger)
• r is scale invariant that is changing the units of measurement does not change r. For example consider the following data:
x' 3 5 7 9 11
y' 1 7 4 4 13
then cor(x',y') = cor(x,y) = 0.73 because x'=2x+1 and y'=3y-2
• r treats x and y symmetricaly, that is cor(x,y) = cor(y,x)

Peason's correlation coefficient only measures linear relationships, it does not work if a relationship is nonlinear. As examples consider the following, all of which have clearly about the same strength of relationship:

Peason's correlation coefficient is only useful for the first case. Another situation where Pearson's correlation coefficient does not work is if there are outliers in the dataset. Even just one outlier can determine the correlation coefficient:

Weak vs. no Correlation It is important to keep two things separate: a situation with two variables which are uncorrelated (ρ=0) and two variables with a weak correlation (ρ≠0 but small). In either case we would find an r close to 0 (but never = 0 !) Finding out which case it is might be impossible, especially for small datasets.

Back to the draft

So, how about the draft? Well, we found r=-0.226. But of course the question is whether -0.226 is close to 0, close enough to conclude that all went well. Actually, the question really is whether the corresponding parameter ρ=0! Let's do a simulation to study this question. Simulations, or Monte Carlo methods, are very powerful methods for "experimenting" with numbers.

Simulation for the 1970's Military Draft

Doing a simulation means teaching the computer to repeat the essential part of an experiment many times. Here the experiment is the draft. What are the important features of this experiment?

• there are the numbers 1-366 in the order from 1 to 366 (in "Day of the Year")

• there are the numbers 1-366 in some random order (in "Draft Number")

In MINITAB we can do this as follows: get the numbers in Day of the Year in random order using Calc > Random Data > Sample from Columns, Sample 366 rows from "Day of the Year", store in c2. Then we can find the correlation coeffcient of "Day of the Year" and c6.

But of course this is only one "run" of the simulation. We would like to repeat this now many times, say 1000 times. Unfortunately this is very difficult with MINITAB. We need to use a very advanced feature called MACROS. These are basically little computer programs. The one I wrote for this is called draftsim. It repeats the steps above 1000 times.
In M119 you can run this MACRO as follows:
Open draft.mpj
hit CTRL-L, type (or copy-paste): %k:\3101\draftsim 'Day of Year'
Here I have the the result of one such run:

The MACRO also calculates the percentage of runs with a correlation coefficient farther from 0 than -0.226, shown in the session window. Of the 1000 simulation runs none had an r as far away from 0 as -0.226. This means one of two things happened:

• The draft went fine, but something extremely unlikely happened (something with a probability less than 1 in 1000)
• Something went wrong in the draft.

A probability of less than 1 in 1000 is generally considered to unlikely, so we will conclude that something did go wrong.

So the next time you see a sample correlation coefficient r=-0.226, can you again conclude that the corresponding population correlation coefficient ρ≠0? Unfortunately no! For example, say that instead of using the day of the year the military had used the day of the month (1-31). Now it looks as follows:

Make a column call Day of Month with 1-31
hit CTRL-L, type (or copy-paste): %k:\3101\draftsim 'Day of Month'
Here I have the the result of one such run:

and we see that now about 22% of the simulation runs have |r|>0.226! Always in these calculations you have to consider the sample size, for a large one (say 366) we can distinguish 0 from -0.226, for a small one (say 31) we cannnot.

This discussion is part of a topic we will discuss in greater detail later, namely hypothesis testing

Here is a little more on the 1970's Military Draft

Correlation vs. Causation


Say we have found correlation between variables "x" and "y". How can we understand and interpret that relationship?
x = Number of fireman responding to a fire
y = damages done by the fire.
say there is a positive correlation between x and y (and in real live there will be!)
does this mean x causes y?

Please note saying "x causes y" is not the same as "x determines y". There are usually many other factors besides x that influence y, maybe even some more important than x. For example say x="Time studied for Exam" and y="Score on Exam". Let's assume that there is a positive correlation between x and y. It is reasonable to conclude that x causes y, that is studying longer improves the scores. But of course there are also many other factors such as general ability, previous experience, being healthy on the day of the exam, exam anxiety, having a hang-over etc.

Discuss smoking vs. lung cancer

The only perfectly satisfactory way to establish a causation is to find a random sample, for example to do a clinical trial. An observational study is always somewhat suspect because we never know about hidden biases.

Things to look for when trying to establish a causation:
•   correlation is strong - the correlation between smoking and lung cancer is very strong
•   correlation is consistent over many experiments - many studies of different kinds of people in different countries over a long time period all have shown this correlation
•   higher doses are associated with stronger responses - people who smoke more have a higher chance of lung cancer
•   the cause comes before the response in time - lung cancer develops after years of smoking. The number of men dying of lung cancer rose as smoking became more common, with a lag of about 30 years. Lung cancer kills more men than any other form of cancer. Lung cancer was rare among women until women started to smoke. Lung cancer in women rose along with smoking, again with a lag of about 30 years, and has now passed breast cancer as the leading cause of cancer deaths among women.
•   the cause is plausible - lab experiments on animals show that nicotin causes cancer.

For more on the correlation coefficient see page 537 of the textbook.