Using Discrete Predictors in a Regression Model

How to include predictor variables that are discrete in your regression model

Example: Environmental safety and health data

First we need to change Sex to a Numeric column so it can be used as a predictor. We do this as follows:
Data > Code > Text to Numeric, Code Data from Columns: Sex, Into Columns: SexCode, Female=0, Male=1
Now we can run the regression:

Stat > Regression > Regression, Response= es&h, Predictors= Yrs Serv SexCode
ES&H = 7.04 + 0.0969 Yrs Serv - 2.59 SexCode
The residual vs. fits and normal plot look good, so this is a good model.

Or is it?

Let's do the following: run the regression again, but this time store the residuals and the fits in the worksheet:

Stat > Regression > Regression, Response= es&h, Predictors= Yrs Serv SexCode, Storage > Residuals, Fits

The residual vs. fits plot is of course just the scatterplot of the residuals vs. the fits. But how about this graph
Graph > Scatterplot, with groups and regression > Y variables=RESI1 ,X variables=FITS1, Categorical variable: Sex

The same priniciple applies: any pattern in the residual vs. fits plot is a problem!

So what's going on here?

Using SexCode as an additive predictor variable: Parallel Lines Model


Stat > Regression > Regression, Response= es&h, Predictors= Yrs Serv SexCode
ES&H = 7.04 + 0.0969 Yrs Serv - 2.59 SexCode

What does our equation predict for a Female? Now SexCode=0, and so:

Female: ES&H = 7.04 + 0.0969 Yrs Serv - 2.59·0, so
ES&H = 7.04 + 0.0969 Yrs Serv

Same for a Male, and now SexCode=1:

Male: ES&H = 7.04 + 0.0969 Yrs Serv - 2.59·1, so
ES&H = 4.45 + 0.0969 Yrs Serv

Both lines have the same slope (0.0969), and so they are parallel

Important: If we use a discrete variable as just another predictor in a regression model we always fit parallel lines!

What does the fitted line plot look like? We need to work a little bit to get this graph:

Make a new column x with 0 30 (range of values in Yrs Serv)
Stat > Regression > Regression, Response= es&h, Predictors= Yrs Serv SexCode, Options > Prediction intervals x 0, check store Fits.
Stat > Regression > Regression, Response= es&h, Predictors= Yrs Serv SexCode, Options > Prediction intervals x 1, check store Fits.
Graph > Scatterplot, With groups > Y variables=es&h ,X variables= Yrs Serv, Categorical variable : Sex
Right-click graph, Add > Calculated Line, y=PFIT1, x=x
Right-click graph, Add > Calculated Line, y=PFIT2, x=x
change colors of lines to fit dots.

This is also called an additive model because going from one value of the discrete predictor (say 0) to another (say 1) each fits gets added the same amount (here 7.04-4.45 = 2.59) no matter what the value of the continuous predictor.

Sometimes parallel lines are a good model, but not always. Here is an example were using parallel lines is clearly wrong:

but that is automatically what you get if you simply fit the response versus the continuous and the discrete predictor!

Using SexCode as a multiplicative predictor variable: Equal Intercept Model

Here is a another rather strange way to use SexCode:
Calc > Calculator, Store in: Yrs Serv*SexCode, Expression: Yrs Serv*SexCode

Stat > Regression > Regression, Response= es&h, Predictors= Yrs Serv Yrs Serv*SexCode
ES&H = 5.60 + 0.167 Yrs Serv - 0.136 Yrs Serv*SexCode

Again, what does this look like for Females and Males?
Female: ES&H = 5.60 + 0.167 Yrs Serv - 0.136·0·Yrs Serv = 5.60 + 0.167 Yrs Serv
Male: ES&H = 5.60 + 0.167 Yrs Serv - 0.136·1·Yrs Serv = 5.60 + 0.031 Yrs Serv

Always fits lines with the same intercept

Commands:
Stat > Regression > Regression, Response= es&h, Predictors= Yrs Serv SexCode, Options > Prediction intervals x 0, check store Fits.
Stat > Regression > Regression, Response= es&h, Predictors= Yrs Serv SexCode, Options > Prediction intervals x x, check store Fits.
Graph > Scatterplot, With groups > Y variables=es&h ,X variables= Yrs Serv, Categorical variable : Sex
Right-click graph, Add > Calculated Line, y=PFIT1, x=x
Right-click graph, Add > Calculated Line, y=PFIT2, x=x
change colors of lines to fit dots.

Using SexCode both as an additive and as a multiplicative predictor variable: Independent Lines Model

Finally, use SexCode twice, once alone and once as Yes Serv*SexCode:

Stat > Regression > Regression, Response= es&h, Predictors= Yrs Serv SexCode Yrs Serv*SexCode
ES&H = 7.32 + 0.0722 Yrs Serv - 3.20 SexCode + 0.0653 Yrs Serv*SexCode

What happens now?
Female: ES&H = 7.32 + 0.0722 Yrs Serv - 3.20·0 + 0.0653·0·Yrs Serv = 7.32 + 0.0722 Yrs Serv
Male: ES&H = 7.32 + 0.0722 Yrs Serv - 3.20·1 + 0.0653·1·Yrs Serv = 4.12 + 0.1375 Yrs Serv

So this fits two separate lines


This graph is easy to get: Graph > Scatterplot > with Groups and Regression

Note: you can get the same two equations by splitting up the dataset into two parts, the score and years of the Females and the score and years of the Males, and then doing a simple regression for both. Doing one multiple regression has some advantages, though. For example you get one R2 for the whole problem, not two for each part.

Including RaceCode as well

The same principle applies when including RaceCode in the analysis. First code RaceCode as NonWhite=0, White=1. Then make a variable RaceCode*Yrs Serv. Here are all the possible models:

1) ES&H by Yrs Serv, SexCode and RaceCode = all parallel lines

2) ES&H by Yrs Serv, SexCode*Yrs Serv and RaceCode = two pairs of lines with equal intercept

3) ES&H by Yrs Serv, SexCode and RaceCode*Yrs Serv = two pairs of lines with equal intercept

4) ES&H by Yrs Serv, SexCode*Yrs Serv and RaceCode*Yrs Serv = four lines with equal intercept

5) ES&H by Yrs Serv, SexCode, SexCode*Yrs Serv and RaceCode = two pairs of parallel lines

6) ES&H by Yrs Serv, SexCode, RaceCode and RaceCode*Yrs Serv = two pairs of parallel lines

7) ES&H by Yrs Serv, SexCode, SexCode*Yrs Serv, RaceCode and RaceCode*Yrs Serv = four different lines

Best Model

Which of these is best? Use Stat > Best Subsets to see that the best linear model is the one with four parallel lines using predictors Yrs Serv, SexCode and RaceCode with a Mallow's Cp of 2.2

If we leave RaceCode out and find the best model based on Yrs Serv, SexCode and SexCode*Yrs Serv we find that it is the model with all predictors (Mallow's Cp=4.0)

There is something strange here: with RaceCode in it SexCode*Yrs serv is significant, without RaceCode it is not. How is this possible?
One possible explanation: there is a high correlation between SexCode and RaceCode. Only this is not true here, the correlation is actually 0! (Why?)
What else might explain this? Note that in the best Subset Regression with Yrs Serv, SexCode and SexCode*Yrs Serv the model with just Yrs Serv and SexCode is second best (Mallow's Cp=4.2). Is this model actually statistically significantly worse than the model with Mallow's Cp=4.0? Cp is a Statistic, that is is depends on random fluctuations. Whether a difference of 0.2 is statistically significant is hard to tell, but my guess is it is not.

Lines and Interaction

Above we explained the problem of using discrete predictors in a regression model in terms of parallel lines vs. equal intercept or independent lines. Another way to look at this is in terms of interactions between the predictors. Parallel lines are ok if the discrete and the continuous predictors are essentially independent. Often terms such as Yrs Serv*SexCode and Yrs Serv*RaceCode are also called interaction terms. For your purposes in this class (and later when doing work such as this) simply remember to include product terms when you have discrete predictors. You can then use Best Subset Regression to decide whether they are in fact necessary. But you have to first include them yourselve, otherwise you will never find out!

Nonlinear Models

Finally, there is good reason to believe that no linear model is appropriate for this dataset. Why?

A little more than you need to know for this class: A good model for this dataset needs to take into account the fact that ES&H is bounded between 1 and 10. This can be done using a transformation on the response ES&H. A possible solution is a curve that looks like this:
.
Fitting curves such as these to the data we find the following fitted line plot: