Fine-Tuning your Linear Regression Model


Regression is a statistical technique that finds a linear relationship between x (input) and y (output). Hence, the name Linear Regression. The equation for uni-variate regression can be given as 

Where, y – output/target/dependent variable; x – input/feature/independent variable and Beta1, Beta2 are intercept and slope of the best fit line respectively, also known as regression coefficients.

Task is to find regression coefficients such that the line/equation best fits the given data. Regression makes assumptions about the data for the purpose of analysis. Because of this, Regression is restrictive in nature. It fails to build a good model with datasets which doesn’t satisfy the assumptions hence it becomes imperative for a good model to accommodate these assumptions.


Let us consider an example where we are trying to predict the sales of a company based on its marketing spends in various media like TV, Radio and Newspapers. The dataset is shown below:

Here the columns TV, Radio, Newspaper are (input/independent variables) and Sales (output/ dependent variable). we will try to fit a linear regression for the above dataset. Below is the python code for it:

Once the linear regression model has been fitted on the data, we are trying to use the predict function to see how well the model is able to predict sales for the given marketing spends.

When we apply the regression equation on the given values of data, there will be difference between original values of y and the predicted values of y. They are referred to as Residuals

Residual e = Observed value – Predicted Value



The score function displays the accuracy of the model which translates to how well the model can accurately predict for a new datapoint.

Assumptions for Linear Regression

1. Linearity



Linear Regression can capture only the linear relationship hence there is an underlying assumption that there is a linear relationship between the features and the target. Plotting a scatterplot with all the individual variables and the dependent variables and checking for their linear relationship is a tedious process, we can directly check for their linearity by creating a plot with the actual target variables from the dataset and the predicted ones by our linear model. If the plot trend seems to be linear, we can assume that the features would also be linear.

2. Normality check for Residuals

To test for normality in the data, we can use Anderson-Darling test


Each test will return at least two things:

Statistic: A quantity calculated by the test that can be interpreted in the context of the test via comparing it to critical values from the distribution of the test statistic.

p-value: Used to interpret the test, in this case whether the sample was drawn from a Gaussian distribution.

If p-value <= alpha (0.05) : Reject H0 => Normally distributed

If p-value > alpha (0.05) : Accept H0 

Since our p-value 2.88234545e-09 <= 0.5, we accept the alternate hypothesis, which infers us that the data is not normally distributed. To get the data to adhere to normal distribution, we can apply log, square root or power transformations.

To figure out the suitable transformation method to be applied on our data, we must try all of them and check which one gives us more accuracy. I have used power transformation for the dataset.

After applying the transformation, we can once again check for the normality

Since 0.10111624927223171 > 0.05 , we accept H0, which states that the data is normally distributed. The regplot also shows that the same.

3. Multicollinearity



Multicollinearity refers to correlation between independent variables. It is considered as disturbance in the data, if present will weaken the statistical power of the regression model. Pair plots and heat maps help in identifying highly correlated features

Why Multicollinearity should be avoided in Linear Regression?

The interpretation of a regression coefficient is that it represents the mean change in the target for each unit change in a feature when you hold all of the other features constant. However, when features are correlated, changes in one feature in turn shifts another feature/features. The stronger the correlation, the more difficult it is to change one feature without changing another. It becomes difficult for the model to estimate the relationship between each feature and the target independently because the features tend to change in unison.


The Variance Inflation Factor (VIF) is a measure of collinearity among predictor variables within a multiple regression. It is calculated by taking the the ratio of the variance of all a given model’s betas divide by the variance of a single beta if it were fit alone.

 V.I.F. = 1 / (1 – R^2).

VIF measures how much the variance of an estimated regression coefficient increases if your predictors are correlated. The higher the value of VIF for ith regressor, the more it is highly correlated to other variables. 

VIF value <= 4 suggests no multicollinearity whereas a value of >= 10 implies serious multicollinearity.

Since the VIF values are not greater than 10, we find that they are not correlated, hence would retain all the 3 features.

4. Autocorrelation



Autocorrelation refers to the degree of correlation between the values of the same variables across different observations in the data.  The concept of autocorrelation is most often discussed in the context of time series data in which observations occur at different points in time (e.g., air temperature measured on different days of the month).  For example, one might expect the air temperature on the 1st day of the month to be more similar to the temperature on the 2nd day compared to the 31st day.  If the temperature values that occurred closer together in time are, in fact, more similar than the temperature values that occurred farther apart in time, the data would be autocorrelated.


However, autocorrelation can also occur in cross-sectional data when the observations are related in some other way.  In a survey, for instance, one might expect people from nearby geographic locations to provide more similar answers to each other than people who are more geographically distant.  Similarly, students from the same class might perform more similarly to each other than students from different classes.  Thus, autocorrelation can occur if observations are dependent in aspects other than time. 

Autocorrelation can cause problems in conventional analyses (such as ordinary least squares regression) that assume independence of observations. In a regression analysis, autocorrelation of the regression residuals can also occur if the model is incorrectly specified.  For example, if you are attempting to model a simple linear relationship but the observed relationship is non-linear (i.e., it follows a curved or U-shaped function), then the residuals will be autocorrelated.

How to detect Autocorrelation

Autocorrelation can be tested with the help of Durbin-Watson test. The null hypothesis of the test is that there is no serial correlation. The Durbin-Watson test statistics is defined as:

DW statistic must lie between 0 and 4. If DW = 2, implies no autocorrelation, 0 < DW < 2 implies positive autocorrelation while 2 < DW < 4 indicates negative autocorrelation. 

The DW values is around 2 , implies that there is no autocorrelation.

Presence of Autocorrelation implies that there is some more information that our model is missing to explain.

5. Homoscedasticity



Homoscedasticity describes a situation in which the error term (that is, the “noise” or random disturbance in the relationship between the features and the target) is the same across all values of the independent variables. A scatter plot of residual values vs predicted values is a goodway to check for homoscedasticity. There should be no clear pattern in the distribution and if there is a specific pattern, the data is heteroskedastic.

Generally, non-constant variance arises in presence of outliers or extreme leverage values. Look like, these values get too much weight, thereby disproportionately influences the model’s performance.

The leftmost graph shows no definite pattern i.e constant variance among the residuals,the middle graph shows a specific pattern where the error increases and then decreases with the predicted values violating the constant variance rule and the rightmost graph also exhibits a specific pattern where the error decreases with the predicted values depicting heteroscedasticity.

From the above plot we could infer a U shaped pattern , hence Heteroskedastic.

How to handle Heteroskedasticity

Redefine the variables

Weighted regression

Transform the dependent variable

Even after transforming the accuracy remains the same for this data.

The coefficients and intercept for our final model are:

The equation now gets transformed as:

sales= 0.2755*TV + 0.6476*Radio + 0.00856*Newspaper – 0.2567

Question 1: My company currently spending 100$, 48$, 85$ (in thousands) for advertisement in TV, Radio Newspaper. What will be my sales in next quarter? I want to improve sales to 16 (million$)

Create a test data & transform our input data using power transformation as we have already applied to satisfy test for normality

Manually, by substituting the data points in the linear equation we get the sales to be

The prediction by our linear model is

  1. How much I need to invest in TV advertisement to improve sales to 20M?

Target – 20 million

Current sales – 16.58

Difference = 3.42

We should compute difference to be added for the new input as 3.42/0.2755 = 12.413

The new equation is:

We could see that the sales has now reached 20 million$

Since we have applied a power transformation, to get back the original data we have to apply an inverse power transformation

They will have to invest 177.48 (thousand$) in TV advertisement to increase their sales to 20M

2.How much I need to invest in Radio advertisement  to improve sales to 20M?

Target – 20 million

Current sales – 16.58

Difference = 3.42

We should compute difference to be added for the new input as 3.42/0.6476= 5.28

The new equation is:

We could see that the sales has now reached 20 million$

Since we have applied a power transformation, to get back the original data we have to apply an inverse power transformation

They will have to invest 73.76 (thousand$) in Radio advertisement to increase their sales to 20M

Similarly, you can compute for Newspaper and figure out which media’s marketing spend is lower and at the same time helps us achieve the sales target of 20 (million$).