ANOVA - Basics

Type of Problems were ANOVA is used:

There are two ways to look at ANOVA problems:

a) Traditional View: The data consists of measurements taken from several groups.
Example: Measurements of the length of babies of mothers, with the mother belonging to one of three groups (Drug Free, 1st Trimester, Throughout)

b) Modern View: We have one continuous response variable and one (or more) discrete factors (predictor variables)
Example: Response=Length of Baby, Factor=Drug Use

MINITAB can handle the data in either format, but in order to keep things simple in this class we will assume that it is in the modern view. If it is not use the Data > Stack > Columns command

Notation: In ANOVA we use the word factor instead of the word predictor, but they mean the same.
The values of a factor (and there are only a few because factors are discrete variables) are called the levels.
Example: in Mothers and Cocaine Use "Drug Status" is a factor, and it has the three levels Drug Free, 1st Trimester and Throughout.

If there are more than one factor each combination of one value from each factor is called a factor-level combination.

Basic Question
a) Traditional View: Is there a difference in the (population) means of the groups?
This means we are testing
H0: m1 = .. = mk vs.
Ha: mimj for some i≠j
b) Modern View: Is there a relationship between the factor(s) and the response?
The hypotheses are the same as in the traditional view.
The test is done by the Stat > ANOVA > Oneway command

The basic question that ANOVA tries to answer is whether there is a difference of the population means of the different groups. So why is it called Analysis of Variance?

In other words we can study the differences in group means by studying the variances.

Modeling

Finding a model was the main task in regression. In ANOVA the main task is to decide whether there is a difference in group means. Nevertheless there is something like modeling here as well. For more on this look here

Assumptions of the ANOVA test

1) The residuals have a normal distribution
2) The residuals have equal variance (are homoscadastic)

Check these assumptions just as you did in regression problems.

An example where ANOVA doesn't work:
If there are outliers:

Multiple Comparison:

If the null hypothesis above is rejected, we would like to continue and find out exactly how the means are different.
Here are all the possibilities in the case of three groups:

1) a1=a2a3
2) a1a2=a3
3) a1=a3a2
4) a1a3a2

Each of these posibilities has different consequences!

To see which of them is most likely true, run a multiple comparion method such as Tukey's test.
The printout of Tukey's method compares one level of the factor with another, one by one. It starts with
Drug Free subtracted from:
  Lower Center Upper
First Trimester -3.880 -1.800 0.280

For us important are "Lower" and "Upper". If they have different signs (plus/minus), we can not find a stat. signif. difference between the two levels, otherwise we do.
Here "-3.880" and "0.280" have different signs -- so "Drug Free" and "First Trimester" are not stat. signif. different

For the others: Drug Free - Throughout: "-4.818" and "-1.382" --- different signs --- "Drug Free" and "First Trimester" are stat. signif. different
First Trimester - Throughout: "-3.408" and "0.808", so "First Trimester" and "Throughout" are not stat. signif. different

How to present the results of a mulitple comparison:
1) Order the groups by the sample means
2) Underline together those groups which have have not been found to be stat. signif. different.
3) If possible, simplify

Example:

Throughout 1stTrimester Drug Free
________________________________
________________________________

Mothers and Cocain Use

Here is a complete analysis of the Mothers and Cocain Use data.

Ordered and unordered Factors


ANOVA in the from described here does not use any information regarding any ordering of the factor-levels. So for example consider the following two situations:

Data Set 1: Data on the prices of T-shirts and where they were bought. Is there a difference in the prices depending on the store location? ANOVA: p-value=0.015.

Data Set 2: Data on the prices of T-shirts and their sizes. Is there a difference in the prices depending on the size? ANOVA: p-value=0.015.

But in data set 2 there is a natural ordering of the factor "size", and the data is not consistent with this ordering, so here we should be much more cautious with any conclusions than in the case of data set 1.

What to do if the assumptions fail

1) Try a transformation of the response variable
a) If the problem is "small", try the SQRT
b) If the problem is "large", try the LOGT

2) If the transformations don't work, use a nonparametric method (In the case of a oneway ANOVA use Kruskal-Wallis)

Also, in either case you should use the median instead of the mean and IQR/1.35 instead of the standard deviation in your summary table.
Example: Capacity of Wells:
Rock Type N Median IQR/1.35
Dolomite 50 1.72 6.92
Limestone 50 0.45 1.45
Siliclastic 50 0.46 0.96
Metamorphic 50 0.30 0.79

How to do a Multiple Comparison in a Two-way ANOVA with Interaction
For the Film Coatings dataset we found a significant interaction between Temperature and Pressure. In the Interaction plot we see that the combination Temp=High and Pressure=Mid results in the thinnest film. But is this combination stat. signif. better than all the others?
To be able to run a multiple comparison we need to turn this into a one-way ANOVA problem. We can do this by combining the two factors Temperature and Pressure into one new factor, call it TP, with levels Low Low, Low Mid etc. We can do this using the command concatenate, in Data. Now we run the one-way ANOVA with Tukey's multiple comparison. The result is:
HM ML HL MM MH HH LL LM LH
____________________          
    _________________________      
          _______________  
            _______________

We are only interested in the thinnest film, so we conclude the High-Mid is not stat. signif. better than Mid-Low, High-Low or Mid-Mid but is stat. signif. better than the others.

In this problem we are interested in the stat. signif. best combination of temperature and pressure. If we wanted to do a multiple comparison of just one of the factors we can use the General Linear Models command:
Stat > ANOVA > General Linear Model, Responses: Thickness, Model: Temperature Pressure Temperature* Pressure, Comparisons > Terms: Temperature
Note you need to include the interaction term Temperature* Pressure in the model, otherwise a model without interaction is fit which we know is wrong

The output of MINITAB for the pairwise comparison is different from the two-way command, it gives the p-values for the pairwise test of no difference:

mLow=mMid, p-value=0.00, they are different
mLow=mHigh, p-value=0.00, they are different
mMid=mHigh, p-value=0.88, they are not different

So we find:
Temperature
Low Mid High
___________fix

When to use Twoway or General Linear Model

Use General Linear Model if
a) you want to do a multiple comparison
b) you have an unbalanced design
c) you have a more complicated design

otherwise use Twoway