# Decoding R squared— back to basics

*Image source — shuttershock.com*

*Quite often I have seen data scientists with decent amount of experience struggling to explain “**R squared**for regression model**”.The idea of writing this story came out from one of these experience recently. My intention of the story is to make “**R squared**” absolutely clear for readers.*

Image source- dribbble.com

*Lets get refreshed with few basic question-answer pair:*

- What is a regression model —
*A regression model estimates a dependent variable based on one or more independent variable(s). In a regression model, dependent variable is continuous in nature.* - What are actual values —
*True data value(s) of dependent variable in training or test data.These are typically denoted by**“y”* - What are predicted values —
*Predicted value(s) of dependent variable in test data. These are typically denoted by**“y hat”* - What is regression error —
*The difference between actual value and predicted value is called**error**for the regression model. There are various metrics to represent this error.**“R squared”**is one of the metric.*

Lets try understanding by a simple example of estimating ** height(dependent variable)** of a person from

**. Check below data set for demo purpose.**

*weight of the person(independent variable)*# Training model:

*I will be using R-studio for demo purpose and create above dummy data in R-studio:*

`#create dummy data`

mydata <- data.frame(

Height = c(175,172,155),

Weight = c(78,82,62)

)

*Lets train the linear regression model in R-studio using **lm** function:*

`#Train linear model`

linearMod <- lm(Height ~ Weight , data=mydata)

*checking details of the model using **“summary” **function*

`summary(linearMod) #Check model summary`

As observed, R-squared for the model is **0.8952.**

*We will try reaching this number 0.8952 by manual calculation.*

# Testing model:

*let us create similar data set for independent feature, **Weight in this case** as **test data*

`mytestdata = c(78,82,62) #create test data`

*let us predict the dependent feature,** height in this case** to obtain the **predicted values:*

`predict(linearMod,data = mytestdata) #Predict values for test`

Output for above command is a list of predicted values for 3 data points in test. These values are displayed below in R console:

*So far, so good*. We have two set of values with us. Actual values and Predicted values.

# Calculate R square:

Actual values = [175,172,155]

Predicted values = [171.19,175.04,155.76]

mean of actual values = (175+172+155)/3=167.33

To calculate **Total sum of squares(TSS) of a population**, we need to take squared sum of difference between individual values with mean. Hence TSS can be calculated as:

*Total sum of squares = (175–167.33)²+(172–167.33)²+(155–167.33)² = 232.66*

To calculate **Residual sum of squares(RSS)**, we need to take squared sum of difference between Actual value and predicted values. Hence RSS can be calculates as:

*Residual sum of squares = (175–171.19)²+(172–175.04)²+(155–155.76)² = 24.33*

Mathematical formula for calculating R-square is ** 1-(RSS/TSS)** which is derived form the formula

**. Meaning of this formula is explained below.**

*(TSS-RSS)/TSS*** 100% minus “unexplained percent by model”**.To understand it in other words,

**which is**

*percentage of TSS***by the model is known as**

*explained***R-squared**for the model.

** Residuals(RSS)** are the

**of the model and hence these are**

*error***part of the model. These are**

*unexplained***subtracted from 100%**to get R-squared.

*Please read above two points again to understand formula better.*

**Putting the values in above equation, R square = 1-(24.33/232.66) =0.8954**

# Comparison:

** Summary **function in R gave R-squared as 0.895(up to three decimal points)

**Manual calculation** also gave us R-squared as 0.895(up to three decimal points)

Hence we could calculate R square and compare with R studio output.

# Conclusion:

This story was intended to give you a very clear cut idea of what is R squared for regression model and help you being more confident in data science model building.

Thank you for reading the story. Please feel free to share your feedback in comments section.

You can also join my Facebook group — “**Unfold Data Science ”** here ,where me and fellow data scientists keep discussing about data science concepts and other queries/doubts related to data science industry.This group is useful for data science aspirants as well.

Follow me on Quora for regular answers on data science here

Connect with me on LinkedIn here.

*Anyone looking for guidance in data science can reach me on below mentioned email.*

Thank you

Aman(amanrai77@gmail.com)