Modern computers allow us to experiment with data.
As an example consider the data set on the 1970's draft
In R the dataset draft is organized as a matrix, with 366 rows and two columns. Just typing draft shows the content of the dataset.
In Statistics we really like to look at pictures. For this type of data the standard one is called the scatterplot (really just the data plotted in a Cartesian coordinate system):
To draw this graph in R is very easy: just type plot(draft)
It certainly does not appear that there is a relationship between "Day of the Year" and "Draft Number", but is this really true? As first hint that this may not be so let's add the least squares regression line:

This is done with the command
abline(lm(draft[,2]~draft[,1]),lwd=2)
which needs a few explanations:
• in R the rows and the columns of a matrix can be used with the [,] notation, so draft[,1] is the first column of the matrix (draft[1,] would be the first row)
• the least suqares regression method is a special case of what is called the linear model, and in R the calculations are done with the lm command. It uses the formula structure y~x with the response (y) on the left and the predictor (x) on the right. So lm(draft[,2]~draft[,1]) finds the least squares regression for y(=Draft Number) vs x(=Day)
type summary(lm(draft[,2]~draft[,1])) alone to see all the things calculated by this command.
• the abline command adds a line to a graph. One way to use it is to type abline(1.5,0.7), which adds a line of the form y=1.5+0.7x. But there are also lots of special versions of this command. So if you call the abline command with an argument that is an lm object, R knows to add the least squares regression line.
• Another argument of abline is lwd, which stands for line width, so adding lwd=2 makes the line thicker. In general you can use the commands args() and help() to see the details of the commands.
Back to the draft. If there is no relationship between the x and the y variables, then the line should be flat. Ours seems to have a negative slope, so maybe there is a problem. Of course, the specific data we have depends on the sample we have drawn, and the line will never be perfectly flat. The question is, how much of a slope is to much?
As a second way to look at the data we might find the correlation coefficient r
Recall some properties of the correlation coefficient
Always -1≤r≤1
r close to 0 means very small or even no correlation (relationship)
r close to -1 means strong negative relationship
r close to +1 means strong positive relationship
So, how about the draft? With the command cor(draft[,1],draft[,2]) we find draft r=-0.226. But of course the question is whether -0.226 is close to 0, close enough to conclude that all went well. In effect we want to do a hypothesis test. This is a method that chooses one of two options. Here these are:
H0: Draft was random vs. Ha: Draft was not random
We have already decided to use Pearson's correlation coefficient as a measure of "randomness" (or more precisely of "independence" of the two variables "Day of the Year" and "Draft Number". It comes in two versions:
• r - a statistic, that is a number computed from a sample
• ρ - a parameter, that is a number belonging to a population
We have found r=-0.226, but the real question is whether or not ρ=0, so we can rewrite the hypotheses as follows:
H0: ρ=0 (= draft was random)
Ha: ρ≠0 (= draft was not random)
The "traditional" way to answer this question would be to find the sampling distribution of r. For example , if it can be assumed that the central limit theorem applies here (and it does) then a number closely related to r has a sampling distribution which is a t distribution. Then the value of this test statistic can be compare to a t table. All of this is implemented in the command cor.test, so type cor.test(draft[,1],draft[,2]). We see that p-value = 1.3·10-5, so the test rejects the null hypothesis for any reasonable type I error probability α, and we reject the null hypothesis. It appears that ρ≠0.
The above works perfectly fine, but in general there could be two problems:
• Just about every statistical method has assumptions, what do we do if these are either violated or hard to verify?
• What if we wish to use a test statistic with no known sampling distribution?
In these situations (and many others) we can try to do a simulation:
• Days=1:366 makes a variable Days with the numbers 1,2,3,..,366
• y=sample(Days,size=366,replace=F) randomly shuffles the numbers in Days, just what the draft lottery was supposed to do
• cor(Days,y) finds the correlation.
In R it is generally a good idea not to save intermediate steps. Here we can combine the above into just one line, without saving the Days and the y:
• cor(1:366,sample(1:366,size=366,replace=F))
But we need to do this many times, so let's automate the process:
• z=rep(0,1000) makes a variable z with 1000 entries, all 0
• R is not just a statistics program, it is actually also a programming language, like C++, Basics, Fortran etc. As such it has all the standard parts of a programming language, for example a for loop:
for(i in 1:1000) z[i]=cor(1:366,sample(1:366,size=366,replace=F))
does the above 1000 times, putting the correlation coefficients in the vector z.
Another great feature of R is that we can write our own functions. For this we use the fix command. Typing fix(draft.sim) opens an editor in a new window. There we can write our function. In addition to the above draft.sim let's us choose how often we want to repeat the simulation (n times) with a default of n=1000, it does a histogram of the z vector and it calculates the number of times a z value exceeds 0.266, in absolute values.
We find that of the 1000 simulation runs none had an r as far away from 0 as -0.226. In the terminology of hypothesis testing we would say that the test has a p-value less than 0.001, and so would reject the null hypothesis. We would conclude that the draft was not random.
So, something did go wrong! Here is a little more on the 1970's Military Draft
Census: If all the entities of a population are included
Sample: any subset of the population
Random sample: a sample found through some randomization (flip of a coin, random numbers on computer etc.)
Simple Random Sample (SRS): each "entity" in the population has an equal chance of being chosen for the sample (This is assumed for all the methods discussed in the later parts of this course)
Stratified Sample: First devide population into subgroups, then do a SRS in each subgroup.
Example : Male-Female, Freshman-Sophmore-Junior-Senior
Variable: any characteristic of the entities in the population
Data: (singular - plural)
Parameter: any numerical quantity associated with a population
Statistic: any numerical quantity associated with a sample
Parameter vs. Statistic
Absolutely fundamental to understanding what goes on in Statistics!
The logic here is this: we really would like to know a parameter. This could be done by carrying out a census but doing a census is complicated, expensive, difficult, time consuming, maybe even impossible. So instead we find a sample and the corresponding statistic. Then we use this number as a guess ("Estimate") of the parameter.