Image source — shuttershock.com
Quite often I have seen data scientists with decent amount of experience struggling to explain “R squared for regression model”.The idea of writing this story came out from one of these experience recently. My intention of the story is to make “R squared” absolutely clear for readers.
Image source- dribbble.com
Lets get refreshed with few basic question-answer pair:
- What is a regression model — A regression model estimates a dependent variable based on one or more independent variable(s). In a regression model, dependent variable is continuous in nature.
- What are actual values —True data value(s) of dependent variable in training or test data.These are typically denoted by “y”
- What are predicted values — Predicted value(s) of dependent variable in test data. These are typically denoted by “y hat”
- What is regression error — The difference between actual value and predicted value is called error for the regression model. There are various metrics to represent this error. “R squared” is one of the metric.
Lets try understanding by a simple example of estimating height(dependent variable) of a person from weight of the person(independent variable). Check below data set for demo purpose.
I will be using R-studio for demo purpose and create above dummy data in R-studio:
#create dummy data
mydata <- data.frame(
Height = c(175,172,155),
Weight = c(78,82,62)
Lets train the linear regression model in R-studio using lm function:
#Train linear model
linearMod <- lm(Height ~ Weight , data=mydata)
checking details of the model using “summary” function
summary(linearMod) #Check model summary
As observed, R-squared for the model is 0.8952.
We will try reaching this number 0.8952 by manual calculation.
let us create similar data set for independent feature, Weight in this case as test data
mytestdata = c(78,82,62) #create test data
let us predict the dependent feature, height in this case to obtain the predicted values:
predict(linearMod,data = mytestdata) #Predict values for test
Output for above command is a list of predicted values for 3 data points in test. These values are displayed below in R console:
So far, so good. We have two set of values with us. Actual values and Predicted values.
Calculate R square:
Actual values = [175,172,155]
Predicted values = [171.19,175.04,155.76]
mean of actual values = (175+172+155)/3=167.33
To calculate Total sum of squares(TSS) of a population, we need to take squared sum of difference between individual values with mean. Hence TSS can be calculated as:
Total sum of squares = (175–167.33)²+(172–167.33)²+(155–167.33)² = 232.66
To calculate Residual sum of squares(RSS), we need to take squared sum of difference between Actual value and predicted values. Hence RSS can be calculates as:
Residual sum of squares = (175–171.19)²+(172–175.04)²+(155–155.76)² = 24.33
Mathematical formula for calculating R-square is 1-(RSS/TSS) which is derived form the formula (TSS-RSS)/TSS. Meaning of this formula is explained below.
100% minus “unexplained percent by model”.To understand it in other words, percentage of TSS which is explained by the model is known as R-squared for the model.
Residuals(RSS) are the error of the model and hence these are unexplained part of the model. These are subtracted from 100% to get R-squared.
Please read above two points again to understand formula better.
Putting the values in above equation, R square = 1-(24.33/232.66) =0.8954
Summary function in R gave R-squared as 0.895(up to three decimal points)
Manual calculation also gave us R-squared as 0.895(up to three decimal points)
Hence we could calculate R square and compare with R studio output.
This story was intended to give you a very clear cut idea of what is R squared for regression model and help you being more confident in data science model building.
Thank you for reading the story. Please feel free to share your feedback in comments section.
You can also join my Facebook group — “Unfold Data Science ” here ,where me and fellow data scientists keep discussing about data science concepts and other queries/doubts related to data science industry.This group is useful for data science aspirants as well.
Follow me on Quora for regular answers on data science here
Connect with me on LinkedIn here.
Anyone looking for guidance in data science can reach me on below mentioned email.