Image source — tractorsupply.com
Hello, machine learning enthusiasts!!!!
A warm welcome to my new story where I have tried to explain dummy variables — what and why along with an important concept known as dummy variable trap.
Lets try understanding the concept of dummy variables first.
Any real world data set that you or me get to work upon does have mix of different types of variables. These different types of variables predominately fall in two buckets:
- continuous variable — Examples — age, height, weight, salary, etc. These variables can take any value in the range of minus infinity to plus infinity.
- categorical variable — Examples — gender, city, job role, car color etc. These variables can take only specific values.
I have created a sample dummy data called “Boys health data” for demonstration purpose.
I have created data of three young boys from different cities in USA, and my job is to train a linear regression model to predict dependent feature(Weight in this case) using independent features(Height, Age and City in this case)
A multiple linear regression model usually takes the form of an equation of a line as below:
y = m1x1+m2x2+m3x3….+c
x1,x2,x3…xn = independent variables
m1,m2,m3…mn = coefficients for x1,x2,x3…xn respectively
c = intercept of regression line
Next, I will try putting some dummy coefficients and create a multiple regression model for sample data.
Weight = 0.4*Height+0.8*Age+0.9*City
Observing the above equation, it is evident that third term on the right hand side of equation is not making much sense. “0.9*Chicago” or “0.9*Boston” is meaningless to estimate “Weight” which is a continuous variable!
The solution to the above problem is achieved using dummy variables. A Dummy variable is an artificial variable created to represent an attribute with two or more distinct categories.
For example, If we create dummy variables for city column in above data, the modified data may look like this:
This data set now can be well feed in the regression equation mentioned above.
Do you see a problem with above training data? If no, please have a look at below equation
Above equation hold true for all the data points in the training data. If we break the problem further,
City_Chicago can be predicted by City_Boston and City_Phoenix
City_Boston can be predicted by City_Chicago and City_Phoenix
City_Phoenix can be predicted by City_Chicago and City_Boston
If 2 values are known, third one can be predicted!!!!!
In statistics, multicollinearity is a phenomenon in which one independent variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy.
Multicollinearity is a problem because independent variables should be independent. If the degree of correlation between variables is high enough, it can cause problems when you fit the model and interpret the results.
Though we are getting rid of categorical variables by using dummy variables, we are introducing multicollinearity in the data. This is the TRAP!!!!!!!
The refined solution:
To avoid the TRAP, the solution is to declare one variable less that number of levels in the categorical variable
Dummy variables count = Category count — 1
In our case, category count = 3(Chicago,Boston,Phoenix)
Hence, to avoid TRAP, we should have 3–1=2 dummy variables created.
Can I show you how do it in python????— yes sure….
Creating dummy variables for above data by using pandas get_dummies function:
From this “ModifiedData” we can remove any of the three columns (City_Chicago,City_Boston,City_Phoenix). Please note, It can be any column at random :)
An alternate way is to tell python directly that you want to drop the fist dummy column from the data. We just need to set drop_first = True in the pandas function
In this story, my intention was to make the concept clear on dummy variables and its hidden trap. Thank you for reading the story. Please feel free to share your feedback in comments section.
You can also join my Facebook group — “Unfold Data Science ” here,where me and fellow data scientists keep discussing about data science concepts and other queries/doubts related to data science industry.This group is useful for data science aspirants as well.
Follow me on Quora for regular answers on data science here
Connect with me on LinkedIn here.
Anyone looking for guidance in data science can reach me on below mentioned email.