# Dummy variable trap

*Image source — tractorsupply.com*

Hello, machine learning enthusiasts!!!!A warm welcome to my new story where I have tried to explain dummy variables —

along with an important concept known aswhat and whydummy variable trap.

Lets try understanding the concept of dummy variables first.

# The problem:

Any real world data set that you or me get to work upon does have mix of different types of variables. These different types of variables predominately fall in two buckets:

— Examples —*continuous variable**age, height, weight, salary*, etc. These variables can take any value in the range of minus infinity to plus infinity.**categorical variable**— Examples —*gender, city, job role, car color*etc. These variables can take only specific values.

I have created a sample dummy data called **“Boys health data” **for demonstration purpose.

I have created data of three young boys from different cities in USA, and my job is to train a linear regression model to predict dependent feature(** Weight in this case**) using independent features(

**)**

*Height, Age and City in this case*A multiple linear regression model usually takes the form of an equation of a line as below:

y = m1x1+m2x2+m3x3….+c

Where

y=dependent variable

x1,x2,x3…xn = independent variables

m1,m2,m3…mn = coefficients for x1,x2,x3…xn respectively

c = intercept of regression line

Next, I will try putting some ** dummy coefficients** and create a multiple regression model for sample data.

*Weight = 0.4*Height+0.8*Age+0.9*City*

Observing the above equation, it is evident that third term on the right hand side of equation is not making much sense. “*0.9*Chicago” or “0.9*Boston” is meaningless to estimate “Weight” which is a continuous variable!*

# The Solution:

The solution to the above problem is achieved using dummy variables. A Dummy variable is an artificial variable created to represent an attribute with two or more distinct categories.

For example, If we create dummy variables for city column in above data, the modified data may look like this:

This data set now can be well feed in the regression equation mentioned above.

# The TRAP:

Do you see a problem with above training data? If no, please have a look at below equation

**City_Chicago+City_Boston+City_Phoenix =1**

Above equation hold true for all the data points in the training data. If we break the problem further,

**City_Chicago **can be predicted by** City_Boston **and **City_Phoenix**

**City_Boston **can be predicted by** City_Chicago **and **City_Phoenix**

**City_Phoenix **can be predicted by** City_Chicago **and **City_Boston**

**If 2 values are known, third one can be predicted!!!!!**

About multicollinearity:In statistics,

multicollinearityis a phenomenon in which one independent variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy.

Multicollinearityis aproblembecause independent variables should be independent. If the degree of correlation between variables is high enough, it can causeproblemswhen you fit the model and interpret the results.

Though we are getting rid of categorical variables by using dummy variables, we are introducing multicollinearity in the data. *This is the TRAP!!!!!!!*

**The refined solution:**

To avoid the **TRAP, **the solution is to declare one variable less that number of levels in the categorical variable

Dummy variables count = Category count — 1

In our case, category count = 3(**Chicago,Boston,Phoenix)**

Hence, to avoid **TRAP**, we should have 3–1=2 dummy variables created.

## Can I show you how do it in python????— yes sure….

Creating dummy variables for above data by using pandas ** get_dummies **function:

From this **“ModifiedData”** we can remove any of the three columns (City_Chicago,City_Boston,City_Phoenix).** Please note, It can be any column at random :)**

*Alternate way:*

An alternate way is to tell python directly that you want to drop the fist dummy column from the data. We just need to set *drop_first = True **in the pandas function*

# Conclusion:

In this story, my intention was to make the concept clear on dummy variables and its hidden trap. Thank you for reading the story. Please feel free to share your feedback in comments section.

You can also join my Facebook group — “**Unfold Data Science ”** here,where me and fellow data scientists keep discussing about data science concepts and other queries/doubts related to data science industry.This group is useful for data science aspirants as well.

Follow me on Quora for regular answers on data science here

Connect with me on LinkedIn here.

*Anyone looking for guidance in data science can reach me on below mentioned email.*

Thank you

Aman(amanrai77@gmail.com)