L9 Linear Regression

Introduction

Regression analysis is a technique for investigating and modelling the relationship between variables like X and Y, using X to estimate Y. In these cases, we refer to X as the explanatory or independent variable. It is also sometimes referred to as a predictor. Y is referred to as the response or dependent variable.

Regression models are used for two primary purposes:

to understand how certain explanatory variables affect the response variable. This aim is typically known as estimation, since the primary focus is on estimating the unknown parameters of the model.
to predict the response variable for new values of the explanatory variables. this is referred to as prediction

This course focuses on the estimation aim.

Example 9.1 (Concrete Data: Flow on Water)

Recall that we first saw this dataset in L3 Exploring Quantitative Data

The fitted regression model here estimates the relationship between the output of the flow test, and the amount of water used to create the concrete. Note that trend in the scatterplot. In this topic, we will figure out how to estimate this line.

Example 9.2 (Bike Rental Data)

In L6 Introduction to SAS we encountered data on bike rentals. Here we attempt to model the number of registered users on the number of casual users.

Contingent on whether the day is a working one or not, it does appear that the trendline is different.

Simple Linear Regression

Formal Set-up

The simple linear regression model is applicable when we have observations $(X_{i}, Y_{i})$ for n individuals. For now, let’s assume both the X and Y variables are quantitative.

Equation 9.1

The simple linear regression model is given by

Y_{i} = β_{0} + β_{1} X_{i} + e_{i} (9.1)

where:

$β_{0}$ is the intercept term
$β_{1}$ is the slope, and
$e_{i}$ is an error term, specific to each individual in the dataset

Equation 9.2

$β_{0}$ and $β_{1}$ are unknown constants that need to be estimated from the data. There is an implicit assumption in the formulation of the model that there is a linear relationship between $Y_{i}$ and $X_{i}$ . In terms of distributions, we assume that the $e_{i}$ are iid Normal.

e_{i} \sim N (0, σ^{2}), i = 1, \dots, n (9.2)

The constant variance assumption is also referred to as homoscedasticity. The validity of the above assumptions will have to be checked after the model is fitted. All in all, the assumptions imply that:

$E (Y_{i} ∣ X_{i}) = β_{0} + β_{1} X_{i}, f or i = 1, \dots, n$
$Va r (Y_{i} ∣ X_{i}) = Va r (e_{i}) = σ^{2}, f or i = 1, \dots, n$
The $Y_{i}$ ‘s are independent
The $Y_{i}$ ‘s are Normally distributed

Estimation

Equation 9.3

Before deploying or using the model, we need to estimate optimal values to use for the unknown $β_{0}$ and $β_{1}$ . We shall introduce the method of Ordinary Least Squares (OLS) for the estimation. Let us define the error Sum of Squares to be:

S S_{E} = S (β_{0}, β_{1}) = i = 1 \sum n (Y_{i} - β_{0} - β_{1} X_{i})^{2} (9.3)

Then the OLS estimates of $β_{0}$ and $β_{1}$ are given by

β_{0}, β_{1} arg min i = 1 \sum n (Y_{i} - β_{0} - β_{1} X_{i})^{2}

The minimisation above can be carried out analytically, by taking partial derivative with respect to the two parameters and setting them to 0.

\frac{\partial S}{\partial β _{0}} = - 2 i = 1 \sum n (Y_{i} - β_{0} - β_{1} X_{i}) = 0 \frac{\partial S}{\partial β _{1}} = - 2 i = 1 \sum n X_{i} (Y_{i} - β_{0} - β_{1} X_{i}) = 0

Solving and simplifying, we arrive at the following:

\hat{β}_{1} \hat{β}_{0} = \frac{\sum _{i = 1}^{n} ( X _{i} - X ˉ ) ( Y _{i} - Y ˉ )}{\sum _{i = 1}^{n} ( X _{i} - X ˉ ) ^{2}} = \overset{ˉ}{Y} - \hat{β}_{1} \overset{ˉ}{X}

where $\overset{ˉ}{Y} = (1/ n) \sum_{i = 1}^{n} Y_{i} and \overset{ˉ}{X} = (1/ n) \sum_{i = 1}^{n} X_{i}$

If we define the following sums:

S_{X Y} S_{XX} = i = 1 \sum n X_{i} Y_{i} - \frac{( \sum _{i = 1}^{n} X _{i} ) ( \sum _{i = 1}^{n} Y _{i} )}{n} = i = 1 \sum n X_{i}^{2} - \frac{( \sum _{i = 1}^{n} X _{i} ) ^{2}}{n}

then a form convenient for computation of $\hat{β_{1}}$ is

\hat{β_{1}} = \frac{S _{X Y}}{S _{XX}}

Once we have the estimates, we can use Equation 9.1 to compute fitted values for each observation. These correspond to our best guess of the mean of the distributions from which the observations arose:

\hat{Y_{i}} = \hat{β_{0}} + \hat{β_{1}} X_{i}, i = 1, \dots, n

Equation 9.4

As always, we can form residuals as the deviations from fitted values.

r_{i} = Y_{i} - \hat{Y_{i}} (9.4)

Residuals are our best guess at the unobserved error terms $e_{i}$ . Squaring the residuals and summing over all observations, we can arrive at the following decomposition, which is very similar to the one in the ANOVA model:

S S_{T} i = 1 \sum n (Y_{i} - \overset{ˉ}{Y})^{2} = S S_{R es} i = 1 \sum n (Y_{i} - \hat{Y}_{i})^{2} + S S_{R e g} i = 1 \sum n (\hat{Y}_{i} - \overset{ˉ}{Y})^{2}

where:

$S S_{T}$ is known as the total sum of squares
$S S_{R es}$ is known as the residual sum of squares
$S S_{R e g}$ is known as the regression sum of squares

In our model, recall from Equation 9.2 that we had assumed equal variance for all our observations. We can estimate $σ^{2}$ with

\hat{σ^{2}} = \frac{S S _{R es}}{n - 2}

Equation 9.5 & 9.6

Our distributional assumptions lead to the following for our estimates $\hat{β_{0}}$ and $\hat{β_{1}}$ :

\hat{β_{0}} \hat{β_{1}} \sim N (β_{0}, σ^{2} (1/ n + \overset{ˉ}{X^{2}} / S_{XX})) \sim N (β_{1}, σ^{2} / S_{XX}) (9.5) (9.6)

The above are used to construct confidence intervals for $β_{0}$ and $β_{1}$ , based on t-distributions.

Hypothesis Test for Model Significance

This is to test if the coefficient $β_{1}$ is significantly different from 0. It is essentially a test of whether it was worthwhile to use a regression model of the form in Equation 9.1 instead of a simple mean to represent the data.
The null and alternative hypotheses are:

H_{0} : β_{1} = 0 H_{1} : β_{1} \neq = 0

Equation 9.7

The test statistic is

F_{0} = \frac{S S _{R e g} /1}{S S _{R es} / ( n - 2 )} (9.7)

Equation 9.8

Under the null hypothesis, $F_{0} \sim F_{1, n - 2}$
It is also possible to perform this same test as a t-test, using the result earlier. The statement of the hypotheses is equivalent to the F-test. The test statistic:

T_{0} = \frac{β _{1} ^}{σ ^{2} ^ / S _{XX}} (9.8)

Under $H_{0}$ , the distribution of $T_{0}$ is $t_{n - 2}$ . This t-test and the earlier F-test in this section are identical. It can be proved that $F_{0} = T_{0}^{2}$ ; the obtained p-values will be identical.

Coefficient of Determination, $R^{2}$

The coefficient of determination $R^{2}$ is defined as

R^{2} = 1 - \frac{S S _{R es}}{S S _{T}} = \frac{S S _{R e g}}{S S _{T}}

It can be interpreted as the proportion of variation in $Y_{i}$ , explained by the inclusion of $X_{i}$ . Since $0 \leq S S_{R es} \leq S S_{T}$ , we can easily prove that $0 \leq R^{2} \leq 1$ . The larger the value of $R^{2}$ is, the better the model is.

When we get to the case of multiple linear regression, take note that simply including more variables in the model can increase $R^{2}$ . This is undesirable, it is preferable to have a parsimonious model (uses the minimum number of parameters necessary to explain a given phenomenon) that explains the response variable well.

Example 9.3 (Concrete Data Model)

In this example, we focus on the estimation of the model parameters for the two variables we introduced in Example 9.1 (Concrete Data Flow on Water)

R code

concrete <- read.csv("data/concrete+slump+test/slump_test.data")
names(concrete)[c(1,11)] <- c("id", "Comp.Strength")
lm_flow_water <- lm(FLOW.cm. ~ Water, data=concrete)
summary(lm_flow_water)

Python code

import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols
 
concrete = pd.read_csv("../data/concrete+slump+test/slump_test.data")
concrete.rename(columns={'No':'id', 
                         'Compressive Strength (28-day)(Mpa)':'Comp_Strength',
                         'FLOW(cm)': 'Flow'},
                inplace=True)
lm_flow_water = ols('Flow ~ Water', data=concrete).fit()
print(lm_flow_water.summary())

SAS output

From the output, we note that the estimated model for Flow (Y) against Water (X) is:

Y = - 58.73 + 0.55 X

The estimates are $\hat{β_{0}} = - 58.73$ and $\hat{β_{1}} = 0.55$ . This is the precise equation that was plotted in Figure 9.1. The $R^{2}$ was labelled as “Multiple R-squared” in the R output. The value is 0.3995, which means that about 40% of the variation in Y is explained by X.

A simple interpretation of the model is as follows:

For every 1 unit increase in Water, there is an average associated increase in Flow rate of 0.55 units.

To obtain confidence intervals for the parameters, we can use the following code in R. The Python summary already contains the confidence intervals.

R code

confint(lm_flow_water)

We can read off that the 95% Confidence Intervals are:

for $β_{0}$ : (-85.08, -32.37)
for $β_{1}$ : (0.42, 0.68)

Example 9.4 (Bike Data F-test)

We shall fit a simple linear regression model to the bike data, constrained to the non-working days.
Take note that in this example, in the R and Python output, we print an analysis of variance table instead of using the summary() methods. The latter provides coefficient estimates, but the former output only returns a sum-of-squares breakdown.

R code

bike2 <- read.csv("data/bike2.csv")
bike2_sub <- bike2[bike2$workingday == "no", ]
lm_reg_casual <- lm(registered ~ casual, data=bike2_sub)
anova(lm_reg_casual)

Python code

bike2 = pd.read_csv("../data/bike2.csv")
bike2_sub = bike2[bike2.workingday == "no"]
 
lm_reg_casual = ols('registered ~ casual', bike2_sub).fit()
anova_tab = sm.stats.anova_lm(lm_reg_casual,)
anova_tab

SAS output

The output above includes the sum-of-squares that we need to perform the F-test outlined in Hypothesis Test for Model Significance. From the output table, we can see that $S S_{R e g}$ = 237654556 and $S S_{R es}$ = 147386970. The value of $F_{0}$ for this dataset is 369.25. The p-value is extremely small (2 x 10^-16), indicating strong evidence against H_0, ie. that $β_{1} = 0$

If you observe carefully in Example 9.3 (Concrete Data Model) , the output from R contains both the t-test for significance of $β_{1}$ and the F-test statistic based on sum-of-squares. The p-value in both cases is 8.10 x 10^-13.

In linear regression, we almost always wish to use the model to understand what the mean of future observations would be. In the concrete case, we may wish to use the model to understand how the Flow test output values change as the amount of Water in the mixture changes. This is because, based on our formulation

E (Y ∣ X) = β_{0} + β_{1} X

After estimating the parameters, we would have:

E (Y ∣ X) = \hat{β_{0}} + \hat{β_{1}} X

Thus we can vary the values of X to study how much the mean of Y changes. Here is how we can do so in the concrete model for data.

Example 9.5

R code

new_df <- data.frame(Water = seq(160, 240, by = 5))
conf_intervals <- predict(lm_flow_water, new_df, interval="conf")
 
plot(concrete$Water, concrete$FLOW.cm., ylim=c(0, 100),
     xlab="Water", ylab="Flow", main="Confidence Bands for Flow vs. Water")
abline(lm_flow_water, col="red")
lines(new_df$Water, conf_intervals[,"lwr"], col="red", lty=2)
lines(new_df$Water, conf_intervals[,"upr"], col="red", lty=2)
legend("bottomright", legend=c("Fitted line", "Lower/Upper CI"), 
       lty=c(1,2), col="red")

Python code

new_df = sm.add_constant(pd.DataFrame({'Water' : np.linspace(160,240, 10)}))
 
predictions_out = lm_flow_water.get_prediction(new_df)
 
ax = concrete.plot(x='Water', y='Flow', kind='scatter', alpha=0.5 )
ax.set_title('Confidence Bands for Flow vs. Water');
ax.plot(new_df.Water, predictions_out.conf_int()[:, 0].reshape(-1), 
        color='blue', linestyle='dashed');
ax.plot(new_df.Water, predictions_out.conf_int()[:, 1].reshape(-1), 
        color='blue', linestyle='dashed');
ax.plot(new_df.Water, predictions_out.predicted, color='blue');

SAS output

The fitted line is the straight line formed using $\hat{β_{0}}$ and $\hat{β_{1}}$ . The dashed lines are 95% Confidence Intervals for E(Y|X), for varying values of X. They are formed by joining up the lower bounds and upper bounds separately. Notice how the limits get wider the further away we are from $\overset{ˉ}{X} \approx 200$ .

Multiple Linear Regression

Formal Setup

When we have more than 1 explanatory variable, we turn to multiple linear regression - generalised version of what we have been dealing with so far. We would still have observed information from n individuals, but for each one, we now observe a vector of values:

Y_{i}, X_{1, i}, X_{2, i}, \dots, X_{p - 1, i}, X_{p, i}

Equation 9.9

In other words, we observe p independent variables and 1 response variable for each individual in our dataset. The analogous equation to Equation 9.1 is

Y_{i} = β_{0} + β_{1} X_{1, i} + \dots + β_{p} X_{p, i} + e_{i} (9.9)

It is easier to write things with matrices for multiple linear regression:

Y β = Y_{1} Y_{2} ⋮ Y_{n}, X = 11 ⋮ 1 X_{1, 1} X_{1, 2} ⋮ X_{1, n} X_{2, 1} X_{2, 2} ⋮ X_{2, n} \dots \dots ⋱ \dots X_{p, 1} X_{p, 2} ⋮ X_{p, n}, = β_{0} β_{1} ⋮ β_{p}, e = e_{1} e_{2} ⋮ e_{n}

With the above matrices, we can re-write Equation 9.9 as

Y = X β + e

We retain the same distributional assumptions as in Formal Set-up

Estimation

Similar to Estimation, we can define $S S_{E}$ to be

S S_{E} = S (β_{0}, β_{1}, \dots, β_{p}) = i = 1 \sum n (Y_{i} - β_{0} - β_{1} X_{1, i} - \dots - β_{p} X_{p, i})^{2} (9.10)

Minimising the above cost function leads to the OLS estimates:

\hat{β} = (X^{'} X)^{- 1} X^{'} Y

The fitted values can be computed with

\hat{Y} = X \hat{β} = X (X^{'} X)^{- 1} X^{'} Y

Residuals are obtained as

r = Y - \hat{Y}

Finally, we estimate $σ^{2}$ using

\overset{σ}{^}^{2} = \frac{S S _{Res}}{n - p} = \frac{r ^{'} r}{n - p}

Coefficient of Determination

In the case of multiple linear regression, $R^{2}$ is calculated exactly as in simple linear regression, and its interpretation remains the same:

R^{2} = 1 - \frac{S S _{R es}}{S S _{T}}

However, note that $R^{2}$ can be inflated simply by adding more terms to the model (even insignificant terms). Thus, we use the adjusted $R^{2}$ , which penalises the model for adding more and more terms to the model:

R_{a d j}^{2} = 1 - \frac{S S _{R es} / ( n - p )}{S S _{T} / ( n - 1 )}

Hypothesis Tests

The F-test in the multiple linear regression helps determine if our regression model provides any advantage over the simple mean model. The null and alternative hypotheses are:

H_{0} H_{1} : β_{1} = β_{2} = \dots = β_{p} = 0 : β_{j} \neq = 0 for at least one j \in {1, 2, \dots, p}

Equation 9.11

The test statistic is

F_{1} = \frac{S S _{Reg} / p}{S S _{Res} / ( n - p - 1 )} (9.11)

Under the null hypothesis, $F_{0} \sim F_{p, n - p - 1}$ .
It is also possible to test for the significance of individual $β$ terms, using a $t$ -test. The output is typically given for all the coefficients in a table. The statement of the hypotheses pertaining to these tests is:

H_{0} H_{1} : β_{j} = 0 : β_{j} \neq = 0

However, note that these $t$ -tests are partial because it should be interpreted as a test of the contribution of $β_{j}$ , given that all other terms are already in the model.

Example 9.6 (Concrete Data Multiple Linear Regression).

In this second model for concrete, we add a second predictor variable, Slag. The updated model is

Y = β_{0} + β_{1} X_{1} + β_{2} X_{2} + e

where $X_{1}$ corresponds to Water, and $X_{2}$ corresponds to Slag.

R code

lm_flow_water_slag <- lm(FLOW.cm. ~ Water + Slag, data=concrete)
summary(lm_flow_water_slag)

Python code

lm_flow_water_slag = ols('Flow ~ Water + Slag', data=concrete).fit()
print(lm_flow_water_slag.summary())

SAS output

The F-test is now concerned with the hypotheses:

H_{0} H_{1} : β_{1} = β_{2} = 0 : β_{1} \neq = 0 or β_{2} \neq = 0

From the output above, we can see that $F_{1} = 49.17$ , with a corresponding p-value of $1.3 \times 1 0^{- 15}$ . The individual t-tests for the coefficients all indicate significant differences from 0. The final estimated model can be written as

Y = - 50.27 + 0.54 X_{1} - 0.09 X_{2}

Notice that the coefficients have changed slightly from the model in Example 9.3 (Concrete Data Model). Notice also that we have an improved $R^{2}$ of 0.50. However, as we pointed out earlier, we should be using the adjusted $R^{2}$ , which adjusts for the additional variable included. This value is 0.49.

While we seem to have found a better model than before, we still have to assess if all the assumptions listed in Formal Set-up have been met. We shall do so in the subsequent sections.

Indicator Variables

Including a Categorical Variable

The explanatory variables in a linear regression model do not need to be continuous. Categorical variables can also be included in the model. In order to include them, they have to be coded using dummy variables.

For instance, suppose that we wish to include gender in a model as $X_{3}$ . There are only two possible genders in our dataset: Female and Male. We can represent $X_{3}$ as an indicator variable, with

X_{3, i} = {10 individual i is male individual i is female

The model (without subscripts for the $n$ individuals) is then:

Y = β_{0} + β_{1} X_{1} + β_{2} X_{2} + β_{3} X_{3} + e

For females, the value of $X_{3}$ is 0. Hence the model reduces to

Y = β_{0} + β_{1} X_{1} + β_{2} X_{2} + e

On the other hand, for males, the model reduces to

Y = (β_{0} + β_{3}) + β_{1} X_{1} + β_{2} X_{2} + e

The difference between the two models is in the intercept. The other coefficients remain the same.

In general, if the categorical variable has $a$ levels, we will need $a - 1$ columns of indicator variables to represent it. This is in contrast to machine learning models which use one-hot encoding. The latter encoding results in columns that are linearly dependent if we include an intercept term in the model.

Example 9.7 (Bike Data Working Day)

In this example, we shall improve on the simple linear regression model from Example 9.4 (Bike Data F-test).

R code

lm_reg_casual2 <- lm(registered ~ casual + workingday, data=bike2)
summary(lm_reg_casual2)

Python code

lm_reg_casual2 = ols('registered ~ casual + workingday', bike2).fit()
print(lm_reg_casual2.summary())

SAS output

The estimated model is now

Y = 605 + 1.72 X_{1} + 2330 X_{2}

But $X_{2} = 1$ for working days and $X_{2} = 0$ for non-working days. This results in two separate models for the two types of days:

Y = {605 + 1.72 X_{1}, 2935 + 1.72 X_{1}, for non-working days for working days

We can plot the two models on the scatterplot to see how they work better than the original model.

The dashed line corresponds to the earlier model, from Example 9.7 (Bike Data Working Day). With the new model, we have fitted separate intercepts to the two days, but the same slope. The benefit of fitting the model in this way, instead of breaking up the data into two portions and a different model on each one is that we use the entire dataset to estimate the variability.

If we wish to fit separate intercepts and slopes, we need to include an interaction term.

Interaction Term

A more complex model arises from an interaction between two terms. Here, we shall consider an interaction between a continuous variable and a categorical explanatory variable. Suppose that we have three predictors: height ( $X_{1}$ ), weight ( $X_{2}$ ) and gender ( $X_{3}$ ). As spelt out in Section 9.5.1, we should use indicator variables to represent $X_{3}$ in the model.
If we were to include an interaction between gender and weight, we would be allowing for males and females to have separate coefficients for $X_{2}$ . Here is what the model would appear as:

Y = β_{0} + β_{1} X_{1} + β_{2} X_{2} + β_{3} X_{3} + β_{4} X_{2} X_{3} + e

Remember that $X_{3}$ will be 1 for males and 0 for females. The simplified equation for males would be:

Y = (β_{0} + β_{3}) + β_{1} X_{1} + (β_{2} + β_{4}) X_{2} + e

For females, it would be:

Y = β_{0} + β_{1} X_{1} + β_{2} X_{2} + e

Both the intercept and coefficient of $X_{2}$ are different now. Recall that in Including a Categorical Variable, only the intercept term was different.

Example 9.8 (Bike Data Working Day)

Finally, we include an interaction in the model, resulting in separate intercepts and slopes.

R code

lm_reg_casual3 <- lm(registered ~ casual * workingday, data=bike2)
summary(lm_reg_casual3)

Python code

lm_reg_casual3 = ols('registered ~ casual * workingday', bike2).fit()
print(lm_reg_casual3.summary())

SAS output

Notice that $R_{a d j}^{2}$ has increased from 50.8% to 60.7%. The estimated models for each type of day are:

Y = {1362 + 1.16 X_{1}, 2168 + 2.97 X_{1}, for non-working days for working days

Here is visualisation of the lines that have been estimated for each sub-group of day. This is the image that we had earlier on in Example 9.2 (Bike Rental Data).

Residual Diagnostics

Recall from Equation 9.4 that residuals are computed as

r_{i} = Y_{i} - \hat{Y_{i}}

Residual analysis is a standard approach for identifying how we can improve a model. In the case of linear regression, we can use the residuals to asses if the distributional assumptions hold. We can also use residuals to identify influential points that are masking the general trend of other points. Finally, residuals can provide direction on how to improve the model.

Standardised Residuals

It can be shown that the variance of the residuals is in fact not constant. Let us define the hat-matrix as

H = X (X^{'} X)^{- 1} X^{'}

The diagonal values of $H$ will be denoted $h_{ii}$ , for $i = 1, \dots, n$ . It can then be shown that

Var (r_{i}) = σ^{2} (1 - h_{ii}), Cov (r_{i}, r_{j}) = - σ^{2} h_{ij}

As such, we use the standardised residuals when checking if the assumption of Normality has been met.

r_{i, std} = \frac{r _{i}}{σ ^ 1 - h _{ii}}

If the model fits well, standardised residuals should look similar to a $N (0, 1)$ distribution. In addition, large values of the standardised residual indicate potential outlier points.

By the way, $h_{ii}$ is also referred to as the leverage of a point. It is a measure of the potential influence of a point (on the parameters, and future predictions). $h_{ii}$ is a value between 0 and 1. For a model with p parameters, the average $h_{ii}$ should be $p / n$ . We consider points for whom $h_{ii} > 2 \times p / n$ to be high leverage points.

Normality

Example 9.9 (Concrete Data Normality Check)

R code

r_s <- rstandard(lm_flow_water_slag)
hist(r_s)
qqnorm(r_s)
qqline(r_s)

Python code

r_s = pd.Series(lm_flow_water_slag.resid_pearson)
r_s.hist()

While it does appear that we have slightly left-skewed data, the departure from Normality seems to arise mostly from a thinner tail on the right.

shapiro.test(r_s)
##
## Shapiro-Wilk normality test
##
## data: r_s
## W = 0.97223, p-value = 0.02882
ks.test(r_s, "pnorm")
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: r_s
## D = 0.08211, p-value = 0.491
## alternative hypothesis: two-sided

At the 5% level, the two Normality tests do not agree on the result either. In any case, we should keep in mind where Normality is needed most: in the hypothesis tests. The estimated model is still valid - it is still the best fitting line according to the least-squares criteria.

Scatterplots

To understand the model fit better, a set of scatterplots are typically made. These are plots of standardised residuals (on the y-axis) against:

fitted values
explanatory variables, one at a time
potential variables

Residuals are meant to contain only the information that our model cannot explain. Hence, if the model is good, the residuals should only contain random noise. There should be no apparent pattern to them. If we find such a pattern in one of the above plots, we would have some clue as to how we could improve the model.

We typically inspect the plots for the following patterns:

(left to right)

A pattern lie this is ideal. Residuals are randomly distributed around zero; there is no pattern or trend in the plot.
The second plot is something rarely seen. It would probably appear if we were to plot residuals against a new variable that is not currently in the model. If we observe this plot, we should then include this variable in the model.
This plot indicates we should include a quadratic term in the model.
The wedge (or funnel) shape indicates that we do not have homoscedasticity. The solution to this is either a transformation of the response or weighted least squares.

Example 9.10 (Concrete Data Residual Plots)

R code

opar <- par(mfrow=c(1,3))
plot(x=fitted(lm_flow_water_slag), r_s, main="Fitted")
plot(x=concrete$Water, r_s, main="X1")
plot(x=concrete$Slag, r_s, main="X2")
par(opar)

SAS Plots

While the plots of residuals versus explanatory variables look satisfactory, the plot of the residual versus fitted values appears to have funnel shape. Coupled with the observations about the deviations from Normality of the residuals in Example 9.6 (Concrete Data Multiple Linear Regression)., we might want to try a square transform of the response.

Influential Points

The influence of a point on the inference can be judged by how much the inference changes with and without the point. For instance to assess if point i is influential on coefficient j:

Estimate the model coefficients with all the data points
Leave out the observations $(Y_{i}, X_{i})$ one at a time and re-estimate the model coefficients.
Compare the $β$ ‘s from step 2 with the original estimate from step 1.

While the above method assesses influence on parameter estimates, Cook’s distance performs a similar iteration to assess the influence on the fitted values. Cook’s distance values greater than 1 indicate possibly influential points.

Example 9.11 (Concrete Data Influential Points)

R code

infl <- influence.measures(lm_flow_water_slag)
summary(infl)

The set of 6 points above appear to be influencing the covariance matrix of the parameter estimates greatly. To proceed, we would typically leave these observations out one-at-a-time to study the impact on our eventual decision.

kienans garden *

Explorer

L9 Linear Regression

Introduction

Example 9.1 (Concrete Data: Flow on Water)

Example 9.2 (Bike Rental Data)

Simple Linear Regression

Formal Set-up

Equation 9.1

Equation 9.2

Estimation

Equation 9.3

Equation 9.4

Equation 9.5 & 9.6

Hypothesis Test for Model Significance

Equation 9.7

Equation 9.8

Coefficient of Determination, R2

Example 9.3 (Concrete Data Model)

R code

Python code

SAS output

R code

Example 9.4 (Bike Data F-test)

R code

Python code

SAS output

Example 9.5

R code

Python code

SAS output

Multiple Linear Regression

Formal Setup

Equation 9.9

Estimation

Coefficient of Determination

Hypothesis Tests

Equation 9.11

Example 9.6 (Concrete Data Multiple Linear Regression).

R code

Python code

SAS output

Indicator Variables

Including a Categorical Variable

Example 9.7 (Bike Data Working Day)

R code

Python code

SAS output

Interaction Term

Example 9.8 (Bike Data Working Day)

R code

Python code

SAS output

Residual Diagnostics

Standardised Residuals

Normality

Example 9.9 (Concrete Data Normality Check)

R code

Python code

Scatterplots

Example 9.10 (Concrete Data Residual Plots)

R code

SAS Plots

Influential Points

Example 9.11 (Concrete Data Influential Points)

R code

SAS Output

Graph View

Table of Contents

Backlinks

Coefficient of Determination, $R^{2}$