Introduction to ESMA 3101

This page discusses some general concepts of Statistics.

Motivation for learning Statistics

Quote: If I had only one hour to live, I would choose to live it in statistics class because it would seem to last forever (A student's complaint)


Why does Statistics appear to be so boring?

Amazingly enough, a person investigating any of these questions might well end up using the same statistical method to answer them! (It is called the two-sample t test and we will study it at the end of this semester)

The power, strength (and beauty) of statistics lies in its universal applicability!

What is Statistics?

Statistics is the Science of data (or information)

----- How to collect data
----- How to analyze data
----- How to present data

Why everybody should know a little bit about Statistics - Misuse of Statistics

Statistics can be used in many ways to make things appear to be something that it is not (lying!)

Another quote: There are Lies, Damn Lies and Statistics (Benjamin Disraeli)

What can we do with Statistics?

Let's consider for a moment the data from the WR Inc and let's have a look at some things that statistics can be used for. We will use the data for the whole company in the worksheet CensusData.

First of all there are 23791 observations, one for each employee, and with 6 pieces of information (variables) for a total of 23791×6 = 142746. Trying to look at so much information is very difficult, so organizing it in some fashion is very useful. For example, just making a little table can help:
  Counts Percentage
Male 14281 60%
Female 9510 40%

By the way, simple tasks like counting how many females there are in this company is what computers are very good at. Here we could use the command Stat > Tables > Tally Individual Variables, Gender to do most of the work for us.

Another good way to study a dataset is via graphs. For example, it seems reasonable that there should be a connection between job level and income, after all usually people with a better job make more money. Is this true for our company? For this we can use a scatterplot:

The above graph is drawn using the command Graph > Scatterplot > Simple, Y Variable=Income, X Variable=Job Level.

Clearly there is a positive relationship, the higher the job level, the higher the income. An important problem in Statistics is to try and find an equation relating the two variables. Here is an answer:

The MINITAB commands for this graph are: Graph > Regression > Fitted Line Plot, Response=Income, Predictor=Job Level

This is the same graph as above, together with a fitted line, that is a straight line that in a certain way fits the data very well. The equation is:

Income = 26541 + 2473 Job Level

Any idea what this equation is telling us about the Incomes at WR Inc?

An important question would be whether there is job discrimination in this company, that is whether men are paid more than women. How can we find out? Let's compute the average income of the men and the average income of the women. But before we do we need to understand that

• the two will not be the exactly the same!
• so one will be higher than the other, just by random chance.

In fact even if there is no job discrimination there is a 50-50 chance that the average income of the men is (a little bit) higher than the average income of the women. Of course if there is job discrimination we would expect the average income of the men to be substantially higher than the average income of the women. What we need to find out is whether the men's income is statistically significantly higher!

Something is statistically significant if it cannot be expained by random chance alone.

Example 4 heads in 4 flips of a fair coin has a probability of 1 in 16 or 6.25%, so this would not be considered unusual

Example 10 heads in 10 flips of a fair coin has a probability of 1 in 1028 or 0.01%, so this would be considered very unusual. In fact one would now conclude that his coin is not a fair coin.

Note What is and what is not statistically significant is question of probability.

back to WRInc:
Stat > Basic Statistics > Descriptive Statistics, Variable=Income, by Variable=Gender.
We find that the average income is
Female: $33151
Male: $33521
so the difference is 33521-33151 = 370.

• Is this a "substantial" or a "little" difference?
• Is this a "statistically significant" difference?
• Does it "prove" discrimination?

"prove" here has just about the same meaning as it does in a criminal trial: beyond any reasonable doubt.

To answer the question we would need to to a hypothesis test.
Actually, one way to answer that question is to use the method mentioned above, called the two-sample t test. You might be surprised to learn that indeed the difference is to large to be due to random chance! Of course it might be explained by other factors such as more men in higher job levels.

Some basic terms of Statistics

Population: all of the entities (people, events, things etc.) that are the focus of a study
Example 1 Say we are interested in the average age of the undergraduate students at the Colegio.
Example 2 All possible hurricanes, past and future. Clearly this is a much more complicated population than the undergraduate students at the Colegio. In order to properly describe it we will need probability.

Census: If all the entities of a population are included in the study.
Example 1 If we ask the Registrars Office they might give as the ages of all these students, and we would have done a census.
Example 2 Impossible

Sample: any subset of the population
Example 1: Let's take the students in the room as a sample.
Example 2: all the hurricanes during the last 10 years.

Random sample: a sample found through some randomization (flip of a coin, random numbers on computer etc.)
Example 1: Are you a random sample?
Example 2 yes

Simple Random Sample (SRS): each "entity" in the population has an equal chance of being chosen for the sample.
Example 1 Are you a simple random sample?
Example 2 yes

Stratified Sample: First divide population into subgroups, then do a SRS in each subgroup.
Example 1: Gender (Male-Female), Year (Freshman - Sophomore - Junior - Senior), Departments (English - Math - ..)
Example 2: hurricanes by category 1-5

Data: the collection of many pieces of information
Example 1 a table with the ages of the students in our sample
Example 2 all the data available about a hurricane: track, windspeed, air pressure etc.

Parameter: any numerical quantity associated with a population
Example 1 If we had the ages of all 10000 or so undergraduate students we could calculate the average, and it would be a parameter
Example 2 the average top windspeed of the strongest hurricane in any one year. This is a number that nobody knows or can know, even theoretically.
Example 3 The mean income of the employees of WRInc. Because we have the income data for all the employees this is a parameter.

Statistic: any numerical quantity associated with a sample
Example 1 let's calculate your average age. You are a sample, so this is a statistic.
Example 2 Take the last 10 years as a sample, and calculate the average of the top windspeeds of the strongest hurricane in each year.
Example 3 The mean IQ of the employees of WRInc. Because we have the IQ data for only 500 of the employees this is a statistic.

Note there is one value of the population parameter but there are many different values of the statistic, depending on the sample that was selected.

Statistical Error the uncertainty in the value of the statistic due to the fact we only used a sample.
Example The mean income of the employees of WRInc (33373) has no statistical error because it is a parameter.
Let's find a sample of employees. here is how:
Calc > Random Data > Sample from Columns > Sample 100 rows from column Income, store samples in c7.
Now find the mean: Stat > Basic Statistics > Display Descriptive Statistics, Variable= c7
You see the answer is not 33373.
We can repeat the above, and get a different answer (almost) every time.
How far apart the numbers are, that is what the statistical error is telling us.

Bias Any systematic difference between the population and the sample with respect to a variable.
Example 1 Are you (the class) a biased sample?
Example 2 Are the last 10 years a biased sample?
Avoiding bias is the main reason for using random samples.

For more on these basic terms see pages 11-24 of the textbook.

Discrete vs. Continuous Variables

Example Let's consider again WRInc

a) Variable Gender: most obvious thing to do: count how many males and females are in the company

b) Variable Income: find average income of employees.

Why the difference?

In an introductory course like this one we will do a lot of work "by hand" because that's a good way to learn the basics. In real live nobody does Statistics by hand, we use computers for all the calculations. That leaves two tasks for the human being doing Statistics:

Decide

• what is the best method for analysing a specific dataset?
• what is the result of the analysis telling you about the experiment?

Most important here are

• the computer will not do these steps for you
• (Almost always) the computer will do the analysis you ask it to do, even if this analysis is complete nonsense

In order to know what method to use it is important to understand some basic features of your data. One is its data type:

We categorize variables as follows:
Type Discrete (Categorical, Qualitative) Continuous (Quantitative)
Description Maybe numeric, maybe not (words, dates, et). If it is numbers, then relatively few different values are repeated many times. Always numbers. Almost all values are different, with few if any repetitions.
Examples 1) A students major
2) Age at which a student graduated from High School
3) Number of times a student took precalculus until they passed
1) Yearly income of a family in Puerto Rico
2) Temperature in Mayaguez at 12 Noon
3) Amount paid for the phone bill

Example Your student id number

Note: Sometimes a variable is actually both discrete and continuous:
1) Age of people in Puerto Rico
2) GPA's of students at the Colegio
3) Number of named tropical storms in a year

Note Often whether a variable is discrete or continuous depends on how (and how precisely) it is measured.

Example Our variable is "rain yesterday"

• Did it rain at all yesterday? "Yes" or "No" → discrete
• We put a cup outside. The cup has marks for each cubic inch of rain. Our data is the number of cubic inches. Values will be 0, 1, maybe 2. → discrete
• We put a bucket outside. At the end of the day we measure the amount of water with a scale. Our data is again the number of cubic inches, but values now will be 0.00, 0.513 etc. → continuous

Example Let's look at the variables in the survey of WRInc

For more on this see page 32 of the textbook.