Image source — tractorsupply.com

Hello, machine learning enthusiasts!!!!

A warm welcome to my new story where I have tried to explain dummy variables — what and why along with an important concept known as dummy variable trap.

Lets try understanding the concept of dummy variables first.

The problem:

  1. continuous variable — Examples — age, height, weight, salary, etc. These variables can take any value in the range of minus infinity to plus infinity.
  2. categorical variable — Examples — gender, city, job role, car color etc. These variables can take only specific values.

I have created a sample dummy data called “Boys health data” for demonstration purpose.

I have created data of three young boys from different cities in USA, and my job is to train a linear regression model to predict dependent feature(Weight in this case) using independent features(Height, Age and City in this case)

Boys health data

A multiple linear regression model usually takes the form of an equation of a line as below:

y = m1x1+m2x2+m3x3….+c

Where

y=dependent variable

x1,x2,x3…xn = independent variables

m1,m2,m3…mn = coefficients for x1,x2,x3…xn respectively

c = intercept of regression line

Next, I will try putting some dummy coefficients and create a multiple regression model for sample data.

Weight = 0.4*Height+0.8*Age+0.9*City

Observing the above equation, it is evident that third term on the right hand side of equation is not making much sense. “0.9*Chicago” or “0.9*Boston” is meaningless to estimate “Weight” which is a continuous variable!

The Solution:

For example, If we create dummy variables for city column in above data, the modified data may look like this:

This data set now can be well feed in the regression equation mentioned above.

The TRAP:

City_Chicago+City_Boston+City_Phoenix =1

Above equation hold true for all the data points in the training data. If we break the problem further,

City_Chicago can be predicted by City_Boston and City_Phoenix

City_Boston can be predicted by City_Chicago and City_Phoenix

City_Phoenix can be predicted by City_Chicago and City_Boston

If 2 values are known, third one can be predicted!!!!!

About multicollinearity:

In statistics, multicollinearity is a phenomenon in which one independent variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy.

Multicollinearity is a problem because independent variables should be independent. If the degree of correlation between variables is high enough, it can cause problems when you fit the model and interpret the results.

Though we are getting rid of categorical variables by using dummy variables, we are introducing multicollinearity in the data. This is the TRAP!!!!!!!

The refined solution:

Dummy variables count = Category count — 1

In our case, category count = 3(Chicago,Boston,Phoenix)

Hence, to avoid TRAP, we should have 3–1=2 dummy variables created.

Can I show you how do it in python????— yes sure….

Creating dummy variables for above data by using pandas get_dummies function:

From this “ModifiedData” we can remove any of the three columns (City_Chicago,City_Boston,City_Phoenix). Please note, It can be any column at random :)

Alternate way:

An alternate way is to tell python directly that you want to drop the fist dummy column from the data. We just need to set drop_first = True in the pandas function

Conclusion:

You can also join my Facebook group — “Unfold Data Science ” here,where me and fellow data scientists keep discussing about data science concepts and other queries/doubts related to data science industry.This group is useful for data science aspirants as well.

Follow me on Quora for regular answers on data science here

Connect with me on LinkedIn here.

Anyone looking for guidance in data science can reach me on below mentioned email.

Thank you

Aman(amanrai77@gmail.com)

I am a data scientist continuously helping businesses grow by machine learning consulting along with data science initiatives like mentoring and training.