Analysing Discrete Data

Case 1: Chisquare Test for Independence

Data : Each observation has two classifications

Example 1 : Rogain
Classification 1: Group (Treatment or Control)
Classification 2: Amount of Hair Growth (No Growth, New Vellus, Minimal Growth, Moderate Growth, Dense Growth)

Example 2: Drowning in Los Angeles:
Classification 1: Gender (Male or Female)
Classification 2: Method of Drowning ( Swimming Pool, Ocean ect.)

Basic Question: Is there a connection between the classifications? (Or: are they independent?)

Step 1: Graph (Multiple Barchart)

Example 1 : Rogain

Commands: open worksheet rogaine.mtw (data in the form of a table)
Graph > Barchart, Values from a table, Cluster, ok
Graph variables c2-c6, Row labels: Groups
Bar Chart options > Show Y as Percent, Within categories of level 1

Example 2 : Drowning
The graph has to based on percentages, but the graphs MINITAB does on its own are no good. Here is what you need to do:
Calc > Calculator, Store in %Male, Expression: ROUND('Male' / SUM('Male') * 100,1)
Calc > Calculator, Store in %Female, Expression: ROUND('Female' / SUM('Female') * 100,1)
Graph > Barchart, Values from a table, Cluster, ok
Graph variables %Male %Female, Row labels: Methods
Bar Chart options > Decreasing Y


Often (mostly?) you should use percentages rather than counts.

Step 2: Hypothesis Test
H0: Classifications are independent
Ha: Classifications are dependent

Example Rogaine:
H0: Classifications are independent = Rogaine does not work
Ha: Classifications are dependent = Rogaine does work

Example Drowning
H0: Classifications are independent = there is no difference in the method of drowning between men and women.
Ha: Classifications are dependent = there is some difference in the method of drowning between men and women.

Stat - Tables - Chisquare Test - Columns containing the table (remember you need to use the counts here, not percentages)
Printout has p value

Just like the other methods we talked about in this class, the Chisquare test for Independence also has its assumptions. They are:
• no more than 20% of the expected cell counts less than 5
• all expected cell counts greater than 1


If there is a problem, join classifications

Case 2: Chisquare Goodness-of-Fit Test

Data : Each observation has one classification, "Observed", and maybe one column, "Expected", or information describing the theory to be tested which can be used to compute the column "Expected".
If there is no "Expected" column, use the theory to compute it.

Example 1 : Seat belt use
Column "Observed" with the counts for each type of injury
Column "Historic Percentages" with the theory to be tested
Make column 'Expected' by Calc >Calculator, Store in Expected, Expression: ROUND(SUM('Observed') * 'Historical Percentages' / 100,1)

Example 2: Gregor Mendel's pea experiment
Column "Observed"
Theory: counts of peas should have ratios 9:3:3:1
Note 9+3+3+1=16, so 9/16th of the peas should be smooth yellow, 3/16th wrinkled yellow and so on.
To make column 'Expected' do this:
Make column 'Proportions' with numbers 9, 3, 3,1 use calculator to devide by 16 (=9+3+3+1)
Make column 'Expected' by Calc >Calculator, Store in Expected, Expression: ROUND(SUM('Observed') * 'Proportions',1)

Graphs: once you have columns 'Observed' and 'Expected' you can do the multiple barchart for them. Note: here you do not need to use percentages.

Hypothesis Testing:
H0: Data agrees with Theory
Ha: Data does not agree with Theory

Example For seat belt theory is that historical percentages are still correct, that is, seat belts do not make a difference.
H0: Data agrees with Theory (Seat belts do not work)
Ha: Data does not agree with Theory (Seat belts do work)

Example For Mendels pea data theory is that counts have (roughly) ratios 9:3:3:1
H0: Data agrees with Theory (Ratios are 9:3:3:1)
Ha: Data does not agree with Theory (Ratios are not 9:3:3:1)

Make a new column, called T, with Calc - Calculator - Expression
SUM(('Observed' - 'Expected')**2 / 'Expected')
Calc - Probability Distributions - Chi Square
Input Column: T
Degrees of Freedom: #of classifications - 1
p value = 1 - P( X <= x )

The use of the chisquare goodness of fit test in detecting cheating
Suppose two students take a multiple choice test. Their answersheets are as follows:

Because there are always 4 wrong answers we would expect about 25% of the wrong answers to match. If the two students have a much higher rate of matching wrong answers this might indicate cheating.

For more on the chisquare test for independence see page 722 of the textbook.
For more on chisquare goodness-of-fit test see page 693 of the textbook.