Step by step linear regression example

2021-07-25 · 6 min read · Linear Regression ·

In this article we will try out linear regression on Boston data set. As we go forward in the course, we shall include more complexities.

As a first step, let us check for outliers and high leverage points.

dt %>% 
  pivot_longer(cols = c(1:14)) %>% 
  ggplot(aes(x=value))+
  geom_boxplot()+
  facet_wrap(~name)

It looks like some of the variables have substantial number of outliers. Well, in this case, we will not consider them as outliers. Outliers are usually just a few points away from the data. In our case, it is not so. Moreover, removing these will remove a lot of data. As we go forward, we shall try to find out ways to tackle this. For now, we will keep the variable.

Next we will look for collinearity.

psych::corPlot(dt[1:13])

Some heavy correlations are noticed. We can remove nox,age, dis and tax.

dt.new<-dt %>% 
  dplyr::select(-c(nox,dis,tax,age))

psych::corPlot(dt.new)

How about multicollinearity? We will calculate the variation inflation factor for that

fit1<-lm(medv~., data=dt.new)
car::vif(fit1)

##     crim       zn    indus     chas       rm      rad  ptratio    black 
## 1.764273 1.522811 2.296931 1.051265 1.759611 2.501243 1.515956 1.336424 
##    lstat 
## 2.461788

As a thumb rule, vif <=4 is acceptable. All the values are within acceptable range.

Now let us work on improving the model.

summary(fit1)

## 
## Call:
## lm(formula = medv ~ ., data = dt.new)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.3213  -2.9803  -0.8391   1.6229  29.0234 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.821352   4.353376   3.175 0.001592 ** 
## crim        -0.074061   0.034909  -2.122 0.034369 *  
## zn          -0.004990   0.011961  -0.417 0.676710    
## indus       -0.024030   0.049941  -0.481 0.630612    
## chas         3.141969   0.912558   3.443 0.000624 ***
## rm           4.535619   0.426793  10.627  < 2e-16 ***
## rad          0.096209   0.041060   2.343 0.019518 *  
## ptratio     -0.951894   0.128565  -7.404 5.71e-13 ***
## black        0.010732   0.002863   3.749 0.000198 ***
## lstat       -0.521405   0.049670 -10.497  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.08 on 496 degrees of freedom
## Multiple R-squared:  0.7003, Adjusted R-squared:  0.6949 
## F-statistic: 128.8 on 9 and 496 DF,  p-value: < 2.2e-16

Seems like zn and indus do not add value to the model.

fit2<-update(fit1,~. -zn-indus)
summary(fit2)

## 
## Call:
## lm(formula = medv ~ crim + chas + rm + rad + ptratio + black + 
##     lstat, data = dt.new)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.3932  -3.0253  -0.8268   1.5912  28.9649 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.320540   4.239152   3.142 0.001776 ** 
## crim        -0.074051   0.034786  -2.129 0.033764 *  
## chas         3.119025   0.900617   3.463 0.000580 ***
## rm           4.547522   0.423947  10.727  < 2e-16 ***
## rad          0.090037   0.038412   2.344 0.019471 *  
## ptratio     -0.942444   0.124456  -7.572 1.79e-13 ***
## black        0.010822   0.002849   3.798 0.000164 ***
## lstat       -0.524916   0.046935 -11.184  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.071 on 498 degrees of freedom
## Multiple R-squared:  0.7002, Adjusted R-squared:  0.6959 
## F-statistic: 166.1 on 7 and 498 DF,  p-value: < 2.2e-16

Now, if we want to improve this model further, we may want to include some non-linear transformation. That the data is non-linear can be verified from residual plot

ggplot(fit2, aes(x = .fitted, y = .resid)) +
  geom_point()+
    labs(
    title ="Residual Plot",
    subtitle = "Funnel not observed, but pattern observed",
    y="Residuals",
    x="fitted Values"
  ) +
  theme_ft_rc()

Before try transformations, we may want to explore which variable is perhaps best suitable for transformation.

dt.new %>% 
  dplyr::select(-c(zn,indus)) %>% 
  pivot_longer(cols=c(crim:lstat)) %>% 
  ggplot(aes(x=medv, y=value))+
  geom_point()+
  facet_wrap(~name, scales="free")

It seems there is some scope in lstat and I will give a try in crim, ptratio and black as well.

fit3<-lm(medv~chas+rm+rad+crim+ptratio+black+poly(lstat,2), data=dt.new)
summary(fit3)

## 
## Call:
## lm(formula = medv ~ chas + rm + rad + crim + ptratio + black + 
##     poly(lstat, 2), data = dt.new)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.6624  -2.5796  -0.5902   2.0314  27.0008 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       8.045020   3.617897   2.224 0.026619 *  
## chas              3.215190   0.819548   3.923 9.97e-05 ***
## rm                3.832808   0.392046   9.776  < 2e-16 ***
## rad               0.120163   0.035076   3.426 0.000664 ***
## crim             -0.123788   0.032024  -3.865 0.000126 ***
## ptratio          -0.748036   0.114832  -6.514 1.79e-10 ***
## black             0.009204   0.002598   3.543 0.000432 ***
## poly(lstat, 2)1 -94.386333   6.925106 -13.630  < 2e-16 ***
## poly(lstat, 2)2  49.729480   4.865249  10.221  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.615 on 497 degrees of freedom
## Multiple R-squared:  0.7522, Adjusted R-squared:  0.7482 
## F-statistic: 188.6 on 8 and 497 DF,  p-value: < 2.2e-16

After few attempts, I selected the above model

Next we check the residual plot again. The pattern is less visible (compared to last residual plot). Heteroscedasticity is also not observed.

ggplot(fit3, aes(x = .fitted, y = .resid)) +
  geom_point()+
    labs(
    title ="Residual Plot",
    subtitle = "No pattern or funnel observed",
    y="Residuals",
    x="fitted Values"
  ) +
  theme_ft_rc()

Before we end this section, we also check for correlations in error terms. Looks like there could be a slight correlation which can be ignored.

ggplot(data.frame(resi=fit3$residuals, ind=as.integer(names(fit3$residuals))), aes(x=ind,y=resi))+
  geom_line(aes(x=ind,y=resi))+
  geom_point() +
      labs(
    title ="Residual Plot",
    subtitle = "Possibility of slight correlation in error terms are observed",
    y="Residuals",
    x="Index"
  ) +
  theme_ft_rc()

So, our final model is as follows

summary(fit3)

## 
## Call:
## lm(formula = medv ~ chas + rm + rad + crim + ptratio + black + 
##     poly(lstat, 2), data = dt.new)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.6624  -2.5796  -0.5902   2.0314  27.0008 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       8.045020   3.617897   2.224 0.026619 *  
## chas              3.215190   0.819548   3.923 9.97e-05 ***
## rm                3.832808   0.392046   9.776  < 2e-16 ***
## rad               0.120163   0.035076   3.426 0.000664 ***
## crim             -0.123788   0.032024  -3.865 0.000126 ***
## ptratio          -0.748036   0.114832  -6.514 1.79e-10 ***
## black             0.009204   0.002598   3.543 0.000432 ***
## poly(lstat, 2)1 -94.386333   6.925106 -13.630  < 2e-16 ***
## poly(lstat, 2)2  49.729480   4.865249  10.221  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.615 on 497 degrees of freedom
## Multiple R-squared:  0.7522, Adjusted R-squared:  0.7482 
## F-statistic: 188.6 on 8 and 497 DF,  p-value: < 2.2e-16

Later in the course, we shall try to improve it further.