Data Summaries 2

Contents of this page:
Measures of Central Tendency
Mean vs. Median
Measures of Variability
Empirical Rule

Measures of Central Tendency

Common question: what was the average grade on the last exam? - We want just one number to discribe all all the numbers in the dataset.

How do we calculate the "average"?

Usual answer: mean

Example Three of your friends are 19, 20 and 23 years old. What is their average age?

Answer: (19+20+23)/3 = 62/3 = 20.7

Example
Take Babe Ruth's homeruns again, what was his homerun average while with the Yankees? The data is
54 59 35 41 46 25 47 60 54 46 49 46 41 34 22, so

Note This formula uses the summation notation ∑X. It simply means: add up the numbers in X.

Note the standard symbol in Statistics for the sample mean is

Advice The most important thing you can do in this class (and, more importantly, in life!) after you did some calculation is to ask yourself

Does my answer make sense?

If you find that the average age of your three friends is 507.9, you have to know that this answer is wrong. You might not know why and in an exam you might not have the time to find out but you should say that you believe your answer is wrong, and why you think so. You will get points for this, whereas an obviously wrong answer will usually get 0!

Example Which of the following are obviously not correct for the mean of Babe Ruths homeruns, and why?

a) 43.2
b) 17.9
c) -45.6
d) 49.5
e) 59.0
f) 35.4

There are other methods for computing an "average", though. For example:

Median: the observation "in the middle" of the ordered data set:
22, 25, 34, 35, 41, 41, 46, 46, 46, 47, 49, 54, 54, 59, 60

What if the Babe had left the Yankees a year earlier?
25, 34, 35, 41, 41, 46, 46, 46, 47, 49, 54, 54, 59, 60
Median = (46+46)/2 = 46

Warning Forgetting to order the dataset is a popular mistake. It is one that should be easy to catch because the answer almost never makes any sense. For example, the observation in the middle of the unordered dataset of Babe Ruth's homeruns is 60, but that can't be the "average"!

MINITAB finds both the mean and the median (and some other stuff) with the Stat > Basic Statistics > Display Descriptive Statistics command. For example for the Income of the employees of WRInc we find a mean income of 33373 and a median income of 32400.
Here there is a difference of almost $1000 between the mean and the median. So which one is the right "average"?

Mean vs. Median

Consider the prices of houses recently sold (in $1000):

56, 59, 65, 66, 66, 70, 87, 89, 95, 95, 95, 99, 101, 105, 105, 110, 950

here we find Mean=136.1 but actually only 1 house sold for more than 136.1!
on the other hand Median=95, and by the way it is defined half the houses sold for more than that.

Whether the mean or the median is a better measure of "average" depends on the question asked

Example 1: what is the price of a "regular" house in this neighborhood? Median = $95000
Example 2: say property taxes are based on sales price, 2% per year. Based on the "average" sales price, how much property taxes are being paid in this neighborhood if there are a total of 250 houses?
250*0.02*136100 = $680500

Example The government has just released the data for a study of Puerto Rican households. One of the variables was household income

• you read in El Nuevo Dia that the mean income in PR is $23100
• you read in The San Juan Star that the median income in PR is $20400

Which of these number is better?

Without any explanation what the number will be used for this question has no answer, both the mean and the median are perfectly good ways to calculate an "average"

Misuse of Statistics: Mean vs. Median
Say the owner of a McDonalds wants to compute the "average" hourly wage for the people working there. Do you think she will use the mean or the median? What if it is the Union that wants to find the "average"?

For more on mean and median see section 3.3 of the textbook.

We have previously talked about ethics. The choice between mean and median can also be an ethical question.

Measures of Variability

A statistician is standing with one foot in an icebucket and the other foot in a burning fire. He says: on average I feel fine.

A "measure of central tendency" is a good start for describing a set of numbers, but it does not tell the whole story. Consider the the two examples in the next graph:

Here we have two datasets, both have a mean of 5 but they are clearly very different, with different "spreads". We would like to have some way to measure this "spread-out-ness".

Range: the first is the range of the observations, defined as Largest-Smallest observation.

Example For Babe Ruth Homeruns we find range = 60-22 = 38.

Note Some textbooks and/or computer programs define the range as the pair of numbers (smallest, largest).

Standard Deviation
Another way to measure "spread-out-ness" is via the standard deviation. To define it we need some notation. Call Sxx the sum of squares of x. It is given by

Remember the summation notation:
• (∑x)2 means find the sum of the numbers in the x variable, then square the sum
• ∑x2 means find the square of each number in the x variable, then add them up.

Note these are not the same: (2+3)2=52=25, but 22+32=4+9=13!

With this the standard deviation is defined by

Example Find the standard deviation of Babe Ruth Homeruns.

Warning Forgetting the √ is a very popular mistake. Generally if you do the answer comes out much to big.

In an exam you will not have time to write out all the details. Instead learn how to use you calculator. Here is what a correct solution looks like:

Example Find the standard deviation of the following dataset

Now we have two ways to measure the "spread-out-ness", range and standard deviation. Unfortunately the two don't quite work together. For example we have found range=38 and sd=11.25 for Babe Ruths Homeruns. As a rule of thumb we often have sd is close to range/4.

Example Babe Ruth's Homeruns: range/4 = 38/4 = 9.5, s = 11.25.
You can use this as a quick check on your answer: if you forget the √ your answer for the standard deviation would be 126.5, very different from 9.5.

In MINITAB we use again Stat > Basic Statistics > Display Descriptive Statistics to find the standard deviation.

For more on range and standard deviation see section 3.4 of the textbook.

Empirical Rule

Above we learned the following:

• If we have the dataset, how do we calculate the mean and the standard deviation?

Now we will look at the following question:

• If we know only the mean and the standard deviation, what do they tell us about the dataset, or more precisely, what do they tell us about an individuel observation in the dataset?

Example You read in the newspaper about a study on the age when a criminal committed his first crime. They found that the mean age was 18.3 with a standard deviation of 2.6 years. What is this telling you?

The information "mean age was 18.3", or with our notation =18.3, is pretty easy to understand - somewhere around age 18 people start to commit crimes. But what about "with a standard deviation of 2.6 years"? For this we can use the empirical rule :

if a data set has a bell-shaped histogram, then 95% of the observations fall into the interval
(-2s, +2s)

Example Back to the example. We have =18.3 and s=2.6, so
-2s = 18.3-2×2.6 = 13.1
+2s = 18.3+2×2.6 = 23.5
so 95% of the criminals are between 13.1 and 23.5 years old when they first commit a crime.

Knowing the mean and the standard deviation and using the empirical rule makes it possible to make some guesses about what an observation might be like.

Above we said that s is often close to range/4. The reason for this is explained by the empirical rule: (-2s, +2s) contains 95% of the data, so -2s should be close to the smallest observation and +2s should be close to the largest observation. So
range = largest-smallest is close to (+2s) - (-2s) = 4s
or
s close to range/4

Example Again back to our example. For the empirical rule to work the data should have a bell shaped histogram. Do you think this is true for this example?

For more on the empirical rule see section 3.4.3 of the textbook.