5 Super Tips to Improve Your Linear Regression Models
Fun Fact- Do you know that the first published picture of a regression line illustrating this effect, was from a lecture presented by Sir Francis Galton in 1877. In fact it is said that it is he, who first coined the term linear regression. Galton was a pioneer in the application of statistical methods to measurements in many branches of science. He spent years studying data on relative sizes of parents and their offspring in various species of plants and animals. One of his most famous observations was that: a larger-than-average parent tends to produce a larger-than-average child, but the child is likely to be less large than the parent in terms of its relative position within its own generation.
Well today Linear Regression Models are widely used by Data Scientists everywhere for varied observations. In this blog post I am going to let you into a few quick tips that you can use to improve your linear regression models.
- Fit many models
Firstly build simple models. Using many independent variables need not necessarily mean that your model is good. Next step is to try and build many regression models with different combination of variables. Then you can take an ensemble of all these models. This might help you arrive at a good model.
- Exploratory analysis
The key step to getting a good model is exploratory data analysis.
- It’s important you understand the relationship between your dependent variable and all the independent variables and whether they have a linear trend. Only then you can afford to use them in your model to get a good output.
- It’s also important to check and treat the extreme values or outliers in your variables. This could be one reason why your predicted estimate values might vary as they are getting skewed by the outlier values.
- Graphing the relevant variables
Are you sure you really want to make those quantile-quantile plots, influence dia- grams, and all the other things that spew out of a statistical regression package? What are you going to do with all that? Just forget about it and focus on the simple plots that help us understand a model.
- Some simple factors to judge your model are: R square, adjusted R square, coefficient values, the p value.
- And also you can try: plotting residual plots, check for heteroscadasticity, plot the actual and predicted values of the model.
Consider transforming every variable in sight:
- Logarithms of all-positive variables (primarily because this leads to multiplicative models on the original scale, which often makes sense)
- Standardizing based on the scale or potential range of the data (so that coefficients can be more directly interpreted and scaled);
- Transforming before multilevel modelling (thus attempting to make coefficients more comparable, thus allowing more effective second-level regressions, which in turn improve partial pooling).
Apart from transformations, creating new variables out of existing variables is also very helpful. For example, for a retailer, given marketing cost and in-store costs you can create Total cost = marketing cost + in-store costs
The goal is to create models that could make sense (and can then be fit and compared to data) and that include all relevant information.
Don’t get hung up on whether a coefficient “should” vary by group. Just allow it to vary in the model, and then, if the estimated scale of variation is small, maybe you can ignore it if that would be more convenient.
Image courtesy Photobucket