Understanding Dummy Variable Traps in Regression
This is one of the most common problems one faces when running a simple linear regression. Typically the dependent variable is expected to be of a continuous nature whereas the independent variables can take values of continuous as well as categorical nature. Firstly we will take a look at what it means to have a dummy variable trap. Secondly we can then try to understand how the interpretation of dummy variables differ from that of continuous variables in a linear model.
Dummy variables alternatively called as indicator variables take discrete values such as 1 or 0 marking the presence or absence of a particular category. By default we can use only variables of numeric nature in a regression model. Therefore if the variable is of character by nature, we will have to transform into a quantitative variable. A simple transformation is not a dummy variable. A dummy is when we create an indicator variable. Let us see what this means by taking an example.
Let us say if we want to study the impact on price of a car – Scorpio and the location or city is one of the attributes that would probably have an impact on the price of a car. Let us say if we have four cities under consideration – Mumbai, Chennai, Bangalore and Kolkata and City is the name of this variable. The first step here would be to create four variables one each for Mumbai, Chennai, Bangalore and Kolkata respectively. Then we add them separately in the model but instead of adding four cities, we use only three. This is because the fourth city acts a baseline indicator and does not provide any incremental information to the model.
The obvious question is how to decide which variable to drop? The answer is any. For a continuous independent variable – Y = alpha + beta * X, we interpret the beta coefficient as follows – A unit change in the independent variable X will bring about beta time change in the dependent variable Y.
However, how will you interpret a categorical independent variable? Let us say if gender is your independent variable, it may not be right to interpret it as when one unit change in male!
The correct approach in this case is to interpret the coefficient with respect to the baseline dummy or the dummy that we did not add in the model. Going back to the example of price and location of Scorpio, let us assume that we dropped Mumbai from the model. If we control for Chennai, Bangalore and Kolkata and Chennai does not stay in the model or is insignificant, Bangalore gets a positive coefficient whereas Kolkata gets a negative coefficient.
A positive Bangalore coefficient means that as compared to Mumbai, consumers prefer Bangalore. A negative Kolkata coefficient will indicate that Kolkata is less preferred over Mumbai and insignificant Chennai coefficient simply means that consumer does not make any preferences between Chennai or Mumbai and treat them equally.
If we use all four cities in the model, you will get an error and the output maybe erroneous. But we are still accounting for all the information or rather adding all the cities do not provide any incremental information to the model.
Dummy variable trap is also alternatively called as a case of perfect multicollinearity . Some more reading on this concept – http://www.slideshare.net/Akramism/dummy-variable-28538000