There are two ways to look at ANOVA problems:
a) Traditional View: The data consists of measurements taken from several groups.
Example: Measurements of the length of babies of mothers, with the mother belonging to one of three groups (Drug Free, 1st Trimester, Throughout)
b) Modern View: We have one continuous response variable and one (or more) discrete factors (predictor variables)
Example: Response=Length of Baby, Factor=Drug Use
MINITAB can handle the data in either format, but in order to keep things simple in this class we will assume that it is in the modern view. If it is not use the Data > Stack > Columns command
Notation: In ANOVA we use the word factor instead of the word predictor, but they mean the same.
The values of a factor (and there are only a few because factors are discrete variables) are called the levels.
Example: in Mothers and Cocaine Use "Drug Status" is a factor, and it has the three levels Drug Free, 1st Trimester and Throughout.
If there are more than one factor each combination of one value from each factor is called a factor-level combination.
Basic Question
a) Traditional View: Is there a difference in the (population) means of the groups?
This means we are testing
H0: μ1 = .. = μk vs.
Ha: μi ≠ μj for some i≠j
b) Modern View: Is there a relationship between the factor(s) and the response?
The hypotheses are the same as in the traditional view.
The test is done by the Stat > ANOVA > Oneway command
The basic question that ANOVA tries to answer is whether there is a difference of the population means of the different groups. So why is it called Analysis of Variance?

In other words we can study the differences in group means by studying the variances.
It is important to note that many of the methods we discussed in the class answer the same question but are used for different types of data:
Example A say we have the data from an experiment studying the connection between smoking and lung cancer. In this study "smoking" was measured as "I smoke" or "I do not smoke" and "lung cancer" was measured as "I have lung cancer" or "I do not have lung cancer", that is both variables are discrete. Now we would test
H0: Classifications are independent vs. Ha: Classifications are dependent
using the chisquare test of independence
Example B Same as above, but now we measure "lung cancer" as the percentage of the people who have lung cancer, so we have a discrete factor(smoking) and a continuous response (lung cancer). Now we would test
H0: α1=α2 vs Ha: α1≠α2
using ANOVA
Example C Same as above, but now we measure "smoking" as the number of cigarettes per day, so both variables are continuous. Now we would test
H0:ρ=0 vs Ha: ρ≠0
using Pearson's correlation coefficient
• Even though the null hypotheses are written differently, they all mean the same: no relationship between the variables.
• If you use the the wrong method you are almost certain to get a wrong answer!
Check these assumptions just as you did in regression problems.
An example where ANOVA doesn't work:
If there are outliers:
1) α1=α2≠α3
2) α1≠α2=α3
3) α1=α3≠α2
4) α1≠α3≠α2
Each of these posibilities has different consequences!
To see which of them is most likely true, run a multiple comparion method such as Tukey's test.
The printout of Tukey's method compares one level of the factor with another, one by one. It starts with
Drug Free subtracted from:
| Lower | Center | Upper | |
|---|---|---|---|
| First Trimester | -3.880 | -1.800 | 0.280 |
For the others:
Drug Free - Throughout: "-4.818" and "-1.382" --- different signs --- "Drug Free" and "First Trimester" are stat. signif. different
First Trimester - Throughout: "-3.408" and "0.808", so "First Trimester" and "Throughout" are not stat. signif. different
How to present the results of a mulitple comparison:
1) Order the groups by the sample means
2) Underline together those groups which have have not been found to be stat. signif. different.
3) If possible, simplify
Example:
| Throughout | 1stTrimester | Drug Free |
| ________________________________ | ||
| ________________________________ | ||
Data Set 1: Data on the prices of T-shirts and where they were bought. Is there a difference in the prices depending on the store location? ANOVA: p-value=0.015.
Data Set 2: Data on the prices of T-shirts and their sizes. Is there a difference in the prices depending on the size? ANOVA: p-value=0.015.
But in data set 2 there is a natural ordering of the factor "size", and the data is not consistent with this ordering, so here we should be much more cautious with any conclusions than in the case of data set 1.
1) Try a transformation of the response variable
a) If the problem is "small", try the SQRT
b) If the problem is "large", try the LOGT
2) If the transformations don't work, use a nonparametric method (In the case of a oneway ANOVA use Kruskal-Wallis)
Also, in either case you should use the median instead of the mean and IQR/1.35 instead of the standard deviation in your summary table.
Example: Capacity of Wells:
| Rock Type | N | Median | IQR/1.35 |
| Dolomite | 50 | 1.72 | 6.92 |
| Limestone | 50 | 0.45 | 1.45 |
| Siliclastic | 50 | 0.46 | 0.96 |
| Metamorphic | 50 | 0.30 | 0.79 |
How to do a Multiple Comparison in a Two-way ANOVA with Interaction
For the Film Coatings dataset we found a significant interaction between Temperature and Pressure. In the Interaction plot we see that the combination Temp=High and Pressure=Mid results in the thinnest film. But is this combination stat. signif. better than all the others?
To be able to run a multiple comparison we need to turn this into a one-way ANOVA problem. We can do this by combining the two factors Temperature and Pressure into one new factor, call it TP, with levels Low Low, Low Mid etc. We can do this using the command concatenate, in Data. Now we run the one-way ANOVA with Tukey's multiple comparison. The result is:
| HM | ML | HL | MM | MH | HH | LL | LM | LH | |
|---|---|---|---|---|---|---|---|---|---|
| ____________________ | |||||||||
| _________________________ | |||||||||
| _______________ | |||||||||
| _______________ | |||||||||
In this problem we are interested in the stat. signif. best combination of temperature and pressure. If we wanted to do a multiple comparison of just one of the factors we can use the General Linear Models command:
Stat > ANOVA > General Linear Model, Responses: Thickness, Model: Temperature Pressure Temperature* Pressure, Comparisons > Terms: Temperature
Note you need to include the interaction term Temperature* Pressure in the model, otherwise a model without interaction is fit which we know is wrong
The output of MINITAB for the pairwise comparison is different from the two-way command, it gives the p-values for the pairwise test of no difference:
μLow=μMid, p-value=0.00, they are different
μLow=μHigh, p-value=0.00, they are different
μMid=μHigh, p-value=0.88, they are not different
So we find:
| Temperature | ||
| Low | Mid | High |
| ___________fix | ||
When to use Twoway or General Linear Model
Use General Linear Model if
a) you want to do a multiple comparison
b) you have an unbalanced design
c) you have a more complicated design
otherwise use Twoway
For more on ANOVA see chapter 13 of the textbook