Outliers - Detection

Many of the methods discussed in this class don't work well if the dataset has outliers. An outlier is any observation that is in some way unusual/strange/weird. For example , consider the Alcohol vs. Tobacco data. A scatterplot shows one unusual observation (Northern Ireland). In general there are several ways in which an observation might be an outlier:

Case 1: One continuous variable


To check for outliers in one variable draw a boxplot. In MINITAB outliers are indicated by stars that are some distance away from the box.
Consider the following example:

Case 2: Two continuous variables


Treatment of Outliers

If we have an outlier in a dataset, what do we do then? First and foremost, don't ignore them!. Most statistical methods are very sensitive to outliers, often they simply don't work.

Example Is there a relationship between Alcohol and Tobacco expenditures in England? Because we have two continuous variables we might use Pearson's correlation coefficient to answer this question:

1) a=0.05
2) H0: r=0 (no relationship)
3) H1: r≠0 (some relationship)
4) p = 0.509
5) p > a, so we fail to reject H0, there is not enough evidence to conclude that there is a relationship.

BUT

without Northern Ireland:
1) a=0.05
2) H0: r=0 (no relationship)
3) H1: r≠0 (some relationship)
4) p = 0.007
5) p < a, so we reject H0, there is a statistically significant relationship.

Which one is right? Clearly the first one is wrong because of the outlier!

Here is another illustration of the problem: consider the MINITAB MACRO corr. It calculates the correlation coefficient of Alcohol vs. Tobacco:
CTRL-l, %K:\3102\corr 'Alcohol' 'Tobacco'

Now change the numbers for Northern Ireland to something else and see what happens.
In fact you can make the correlation coefficient be any number (between -1 and 1 of course) just by moving the point for Northern Ireland. For example, try (4,10) and (20,15)

The correlation coefficient is supposed to measure something for all the observations, but if there are outliers they can completely dominate the coefficient.

So, if there are outliers, what do we do?

1) Find a method that is not sensitive to outliers. For example, alternatives to Pearson's correlation coefficient include Spearman's rank correlation coefficient and Kendall's coefficient of concordance , although neither of them works any better here.

2) Try and "adjust" the outliers. We know what "caused" the Alcohol number for Northern Ireland to be off, so maybe we can adjust it.

3) Eliminate the outlier(s)