Introduction to ESMA 3015

This page discusses some general concepts of Statistics.

Motivation for learning Statistics

Quote: If I had only one hour to live, I would choose to live it in statistics class because it would seem to last forever (A student's complaint)
Why does Statistics appear to be so boring?

Amazingly enough, a person investigating any of these questions might well end up using the same statistical method to answer them! (It is called the two-sample t test and we will study it at the end of this semester)

The power, strength (and beauty) of statistics lies in its universal applicability!

What is Statistics?

Statistics is the Science of data (or information)
----- How to collect data
----- How to analyze data
----- How to present data

Why everybody should know a little bit about Statistics - Misuse of Statistics

Statistics can be used in many ways to make things appear to be something that it is not (lying!)

Another quote: There are Lies, Damn Lies and Statistics (Benjamin Disraeli)

Do you need to know Statistics?

• If you are are about to graduate, are planning to be unemployeed and live in a cave up in the mountains, the answer is no.

• If you are about to graduate and are planning to go to work, you are likely to see some of this material again. Statistics is used today in (almost) every field of work.

• If you still have some time until you graduate, you will see this stuff in many of your other courses, and if you understand some Statistics it will make those other courses easier.

• If you are planning to go to Graduate school (soon or in a few years) the material discussed in this course is absolutely essential for your success, no matter what area you are going into.

What can we do with Statistics?

Let's consider for a moment the data from the WR Inc and let's have a look at some things that statistics can be used for.

First of all there are 527 observations, each with 7 pieces of information for a total of 527×7 = 3689. Trying to look at so much information is very difficult, so organizing it in some fashion is very useful. For example, just making a little table can help:
  Counts Percentage
Male 321 61%
Female 206 39%

By the way, simple tasks like counting how many females there are in this company is what computers are very good at. Here we could use the command Stat > Tables > Tally Individual Variables, Gender to do most of the work for us.

Another good way to study a dataset is via graphs. For example, it seems reasonable that there should be a connection between job level and income, after all usually people with a better job make more money. Is this true for our company? For this we can use a scatterplot:

The above graph is drawn using the command Graph > Scatterplot > Simple, Y Variable=Income, X Variable=Job Level.

Clearly there is a positive relationship, the higher the job level, the higher the income. An important problem in Statistics is to try and find an equation relating the two variables. Here is an answer:

The MINITAB commands for this graph are: Graph > Regression > Fitted Line Plot, Response=Income, Predictor=Job Level

This is the same graph as above, together with a fitted line, that is a straight line that in a certain way fits the data very well. The equation is:

Income = 25259 + 4730 Job Level

Any idea what this equation is telling us about the Incomes at WR Inc?

The graph also shows us several features we might or might not have suspected:

• there are many more people with low job levels than high ones.
• the range of incomes gets larger as the job level gets higher

Your most important tool in this course (and in life!) is always your brain! Anything we see in a dataset, say in a graph, needs an explanation. Often this requires some special knowledge of the science behind the data, but sometimes just thinking about it helps. So how might one explain these two features of our dataset?

An important question would be whether there is job discrimination in this company, that is whether men are paid more than women. How can we find out? Let's compute the average income of the men and the average income of the women. But before we do we need to understand that

• the two will not be the exactly the same!
• so one will be higher than the other, just by random chance.

In fact even if there is no job discrimination there is a 50-50 chance that the average income of the men is (a little bit) higher than the average income of the women. Of course if there is job discrimination we would expect the average income of the men to be substantially higher than the average income of the women. What we need to find out is whether the men's income is statistically significantly higher!

Let's see:
Stat > Basic Statistics > Descriptive Statistics, Variable=Income, by Variable=Gender.
so the difference in average income is 40205.99-38639.32 = 1566.67.
• Is this a "substantial" or a "little" difference?
• Is this a "statistically significant" difference?
• Does it "prove" discrimination?

This is a problem called "hypothesis testing" in Statistics.
Actually, one way to answer that question is to use the method mentioned above, called the two-sample t test. (The answer is that there is no evidence of job discrimination at WRInc).

Some basic terms of Statistics

Population: all of the entities (people, events, things etc.) that are the focus of a study
Example 1 Say we are interested in the average age of the undergraduate students at the Colegio.
Example 2 All possible hurricanes, past and future. Clearly this is a much more complicated population than the undergraduate students at the Colegio. In order to properly describe it we will need probability.

Census: If all the entities of a population are included in the study.
Example 1 If we ask the Registrars Office they might give as the ages of all these students, and we would have done a census.
Example 2 Impossible

Sample: any subset of the population
Example 1: Let's take the students in the room as a sample.
Example 2: all the hurricanes during the last 10 years.

Random sample: a sample found through some randomization (flip of a coin, random numbers on computer etc.)
Example 1: Are you a random sample?
Example 2 yes

Simple Random Sample (SRS): each "entity" in the population has an equal chance of being chosen for the sample.
Example 1 Are you a simple random sample?
Example 2 yes

Stratified Sample: First divide population into subgroups, then do a SRS in each subgroup.
Example 1: Gender (Male-Female), Year (Freshman - Sophomore - Junior - Senior), Departments (English - Math - ..)
Example 2: hurricanes by category 1-5

Data: the collection of many pieces of information
Example 1 a table with the ages of the students in our sample
Example 2 all the data available about a hurricane: track, windspeed, air pressure etc.

Parameter: any numerical quantity associated with a population
Example 1 If we had the ages of all 10000 or so undergraduate students we could calculate the average, and it would be a parameter
Example 2 the average top windspeed of the strongest hurricane in any one year. This is a number that nobody knows or can know, even theoretically.

Statistic: any numerical quantity associated with a sample
Example 1 let's calculate your average age. You are a sample, so this is a statistic.
Example 2 Take the last 10 years as a sample, and calculate the average of the top windspeeds of the strongest hurricane in each year.

Note there is one value of the population parameter but there are many different values of the statistic, depending on the sample that was selected.

Statistical Error the uncertainty in the value of the statistic due to the fact we only used a sample.
Example 1 Based on you as a sample we found an average of ___ . How far from the true parameter value might this be?
Example 2 Based on the last 10 years as a sample we found an average top windspeed of 137.4mph. How far from the true parameter value might this be?

Bias Any systematic difference between the population and the sample with respect to a variable.
Example 1 Are you (the class) a biased sample?
Example 2 Are the last 10 years a biased sample?
Avoiding bias is the main reason for using random samples.

For more on these basic terms see pages 11-13 of the textbook.

Discrete vs. Continuous Variables

Example Let's consider again WRInc

a) Variable Gender: most obvious thing to do: count how many males and females are in the company

b) Variable Income: find average income of employees.

Why the difference?

In an introductory course like this one we will do a lot of work "by hand" because that's a good way to learn the basics. In real live nobody does Statistics by hand, we use computers for all the calculations. That leaves two tasks for the human being doing Statistics:

Decide

• what is the best method for analysing a specific dataset?
• what is the result of the analysis telling you about the experiment?

Most important here are

• the computer will not do these steps for you
• (Almost always) the computer will do the analysis you ask it to do, even if this analysis is complete nonsense

In order to know what method to use it is important to understand some basic features of your data. One is its data type:

We categorize variables as follows:
Type Discrete (Categorical, Qualitative) Continuous (Quantitative)
Description Maybe numeric, maybe not (words, dates, et). If it is numbers, then relatively few different values are repeated many times. Always numbers. Almost all values are different, with few if any repetitions.
Examples 1) A students major
2) Age at which a student graduated from High School
3) Number of times a student took precalculus until they passed
1) Yearly income of a family in Puerto Rico
2) Temperature in Mayaguez at 12 Noon
3) Amount paid for the phone bill

Example Your student id number

Note: Sometimes a variable is actually both discrete and continuous:
1) Age of people in Puerto Rico
2) GPA's of students at the Colegio
3) Number of named tropical storms in a year

Note Often whether a variable is discrete or continuous depends on how (and how precisely) it is measured.

Example Our variable is "rain yesterday"

• Did it rain at all yesterday? "Yes" or "No" → discrete
• We put a cup outside. The cup has marks for each cubic inch of rain. Our data is the number of cubic inches. Values will be 0, 1, maybe 2. → discrete
• We put a bucket outside. At the end of the day we measure the amount of water with a scale. Our data is again the number of cubic inches, but values now will be 0.00, 0.513 etc. → continuous

Example Let's look at the variables in the survey of WRInc

For more on this see page 15 of the textbook.