Part 5 - Multiple Linear Regression

MA3518_5

Introduction

Last time we discussed simple linear regression, which is a model that predicts the value of a dependent variable based on the value of a single independent variable. But, often in the real world, we may want to model our dependant variable from multiple independent variables.

Or, mathematically represented as,

$$ Y_i = \beta_0 + \beta_1 x_{1i} 0 \beta_2 x_{2i} + \ldots + \beta_p x_{pi} + e_i $$

To have a “good” fit, we try to minimize RSS, however, since we now have multiple variables, we need to minimize the RSS with respect to all the variables.

$$ \begin{align*} RSS &= \sum_{i=1}^{n} (y_i - b_0 - b_1 - \ldots - b_p x_{pi})^2 \newline &= \frac{\partial RSS}{\partial b_0} = -2 \sum_{i=1}^{n} (y_i - b_0 - b_1 x_{1i} - b_2 x_{2i} - \ldots - b_p x_{pi}) = 0 \newline &= \frac{\partial RSS}{\partial b_1} = -2 \sum_{i=1}^{n} x_{1i} (y_i - b_0 - b_1 x_{1i} - b_2 x_{2i} - \ldots - b_p x_{pi}) = 0 \newline & \vdots \newline &= \frac{\partial RSS}{\partial b_p} = -2 \sum_{i=1}^{n} x_{pi} (y_i - b_0 - b_1 x_{1i} - b_2 x_{2i} - \ldots - b_p x_{pi}) = 0 \newline \end{align*} $$

Matrix Form

Using matrix formulation is more convenient,

$$ \mathbf{Y} = \begin{bmatrix} y_1 \newline y_2 \newline \vdots \newline y_n \end{bmatrix}, \mathbf{X} = \begin{bmatrix} 1 & x_{11} & \ldots & x_{1p} \newline 1 & x_{21} & \ldots & x_{2p} \newline \vdots & \vdots & \ddots & \vdots \newline 1 & x_{n1} & \ldots & x_{np} \end{bmatrix}, \beta = \begin{bmatrix} \beta_0 \newline \beta_1 \newline \vdots \newline \beta_p \end{bmatrix}, \mathbf{e} = \begin{bmatrix} e_1 \newline e_2 \newline \vdots \newline e_n \end{bmatrix} $$

Thus, the matrix form of the equation is,

$$ \mathbf{Y} = \mathbf{X} \beta + \mathbf{e} $$

Now, we can write the RSS, denoted as $RSS(\beta)$, in matrix form as,

$$ RSS(\beta) = (\mathbf{Y} - \mathbf{X} \beta)^T (\mathbf{Y} - \mathbf{X} \beta) $$

We can also write our predictor for $\hat{\beta}$ as,

$$ \hat{\beta} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{Y} $$

Since we are trying to generalize linear regression with this, it must mean that simple linear regression is a special case of multiple linear.

$$ \mathbf{X}^T \mathbf{X} = \begin{bmatrix} 1 & 1 & \ldots & 1 \newline x_1 & x_2 & \ldots & x_n \end{bmatrix} \begin{bmatrix} 1 & x_1 \newline 1 & x_2 \newline \vdots & \vdots \newline 1 & x_n \end{bmatrix} = \begin{bmatrix} n & \sum_{i=1}^{n} x_i \newline \sum_{i=1}^{n} x_i & \sum_{i=1}^{n} x_i^2 \end{bmatrix} = n \begin{bmatrix} 1 & \bar{x} \newline \bar{x} & \frac{1}{n} \sum_{i=1}^{n} x_i^2 \end{bmatrix} $$

The inverse of this matrix is, $$ (\mathbf{X}^T \mathbf{X})^{-1} = \frac{1}{n \left( \frac{1}{n} \sum_{i=1}^{n} x_i^2 - \bar{x}^2 \right)} \begin{bmatrix} \frac{1}{n} \sum_{i=1}^{n} x_i^2 & -\bar{x} \newline -\bar{x} & 1 \end{bmatrix} = \frac{1}{S_{xx}} \begin{bmatrix} \frac{1}{n} \sum_{i=1}^{n} x_i^2 & -\bar{x} \newline -\bar{x} & 1 \end{bmatrix} $$

$$ \mathbf{X}^T \mathbf{Y} = \begin{bmatrix} 1 & 1 & \ldots & 1 \newline x_1 & x_2 & \ldots & x_n \end{bmatrix} \begin{bmatrix} y_1 \newline y_2 \newline \vdots \newline y_n \end{bmatrix} = \begin{bmatrix} \sum_{i=1}^{n} y_i \newline \sum_{i=1}^{n} x_i y_i \end{bmatrix} $$

Which finally means that $\hat{\beta}$ is,

$$ \begin{align*} \hat{\beta} &= \frac{1}{S_{xx}} \begin{bmatrix} \frac{1}{n} \sum_{i=1}^{n} x_i^2 & -\bar{x} \newline -\bar{x} & 1 \end{bmatrix} \begin{bmatrix} \sum_{i=1}^{n} y_i \newline \sum_{i=1}^{n} x_i y_i \end{bmatrix} \newline &= \frac{1}{S_{xx}} \begin{bmatrix} \frac{1}{n} \sum_{i=1}^{n} y_i \sum_{i=1}^{n} x_i^2 - \bar{x} \sum_{i=1}^{n} x_i y_i \newline \sum_{i=1}^{n} x_i y_i - \bar{x} \sum_{i=1}^{n} y_i \end{bmatrix} \newline &= \begin{bmatrix} \frac{\bar{y} \left[ \sum_{i=1}^n x_i^2 - n\bar{x}^2 \right] - \bar{x} \left[ \sum_{i=1}^n x_i y_i - n\bar{x}\bar{y} \right]}{S_{xx}} \newline \frac{S_{xy}}{S_{xx}} \end{bmatrix} \newline &= \begin{bmatrix} \bar{y} - \frac{S_{xy}}{S_{xx}} \bar{x} \newline \frac{S_{xy}}{S_{xx}} \end{bmatrix} \end{align*} $$

So we finally have our $\beta_0$ and $\beta_1$.

Let’s take a look at some properties of the multiple linear regression model.

$$ \begin{align*} \mathbb{E}(\hat{\beta} | \mathbf{X}) &= \beta \newline \text{Var}(\hat{\beta} | \mathbf{X}) &= \sigma^2 (\mathbf{X}^T \mathbf{X})^{-1} S^2 &= \frac{RSS}{n-p-1} = \frac{\sum_{i=1}^{n} \hat{e}_i^2}{n-p-1} \end{align*} $$

Inference

We can define inference for a single parameter as such,

$$ T_i = \frac{\hat{\beta_i} - \beta_i}{SE(\hat{\beta_i})} \sim t_{n-p-1} $$

Analysis of variance approach to inference is also possible,

$$ H_0 : \beta_1 = \ldots = \beta_p = 0 \newline H_1 : \text{At least one } \beta_j \neq 0 $$

Let’s recall some important definitions,

$$ \begin{align*} SST = S_{yy} &= \sum_{i=1}^{n} (y_i - \bar{y})^2 \newline RSS = &= \sum_{i=1}^{n} (y_i - \hat{y_i})^2 \newline SS_{\text{reg}} &= \sum_{i=1}^{n} (\hat{y_i} - \bar{y})^2 SST = SS_{\text{reg}} + RSS \end{align*} $$

The test statistic is defined as,

$$ F = \frac{SS_{reg}/p}{RSS/(n-p-1)} \sim F_{p, n-p-1} $$

where we reject $H_0$ if $F$ is too large.

Recall our analysis of variance table,

Source of Variation	Degrees of Freedom (df)	Sum of Squares (SS)	Mean Square (MS)	F-statistic
Regression	p	$SS_{\text{reg}}$	$SS_{\text{reg}}/p$	$F = \frac{SS_{\text{reg}}/p}{RSS/(n - p - 1)}$
Residual	$n - p - 1$	$RSS$	$S^2 = \frac{RSS}{n - p - 1}$
Total	$n - 1$	$SST = S_{yy}$

The coefficient of determination and adjusted coefficient of determination are defined as,

$$ \begin{align*} R^2 &= \frac{SS_{\text{reg}}}{SST} = 1 - \frac{RSS}{SST} \newline R^2_{\text{adj}} &= 1 - \frac{RSS/(n-p-1)}{SST/(n-1)} \end{align*} $$

This new adjusted coefficient of determination is due to the fact of, if we add variables (even irrelevant variables) $R^2$ never decreases. Thus to compare the model with different number of variables, we use this.

We can also test a subset of coefficients using ANOVA.

$$ H_0 : \beta_1 = \ldots = \beta_k = 0 \newline H_1 : \text{At least one } \beta_j \neq 0 $$

Then the test statistic is,

$$ \begin{align*} F &= \frac{(RSS(\text{reduced} - RSS(\text{full}))/(df_{\text{reduced}} - df_{\text{full}})}{RSS(\text{full})/df_{\text{full}}} \newline &= \frac{(RSS(\text{reduced} - RSS(\text{full}))/k}{RSS(\text{full})/(n-p-1)} \end{align*} $$

since $df_{\text{reduced}} - df_{\text{full}} = (n - (p + 1 - k) - (n - (p + 1)) = k$.

Example

Let’s look at a simple example and the corresponding R code.

Data from surveys of customers of 168 Italian restaurants in the target area are available, with the following variables: $$ \begin{align*} Y &= \text{Price} = \text{the price (in \$US) of a dinner (including 1 drink \& a tip)} \newline x_1 &= \text{Food} = \text{customer rating of the food (out of 30)} \newline x_2 &= \text{Décor} = \text{customer rating of the decor (out of 30)} \newline x_3 &= \text{Service} = \text{customer rating of the service (out of 30)} \newline x_4 &= \text{East} = \text{dummy variable} = 1 \text{ if the restaurant is east of Fifth Avenue, 0 if it is west} \end{align*} $$

So, let’s first devlop a regression model that predicts the price of a dinner.

1
# Assume data is loaded into `nyc`.
2
attach(nyc)
3
m1<-lm(Price~Food+Decor+Service+East)
4
summary(m1)
5
> Coefficients:
6
>             Estimate Std. Error t value Pr(>|t|)
7
> (Intercept)  -24.244248   4.758009  -5.095 9.70e-07 ***
8
> Food          1.553922   0.372672   4.170 4.98e-05 ***
9
> Decor         1.897926   0.219100   8.662 4.79e-15 ***
10
> Service       0.005318   0.399493   0.013 0.9894
11
> East          2.003432   0.961794   2.083 0.0388 *
12
> ---
13
> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

So, our initial regression model is,

$$ \text{Price} = -24.244248 + 1.553922 \text{Food} + 1.897926 \text{Decor} + 0.005318 \text{Service} + 2.003432 \text{East} $$

We can see that Décor has the largest effect on the price (note that Food, Décor and Service are each measured on the same 0 to 30 scale so it is meaningful to compare the regression coefficients).

We can also observe that, if a new resturant wants to choose a location, it should be east of Fifth Avenue, since our dummy variable is statistically significantly larger than 0.

Interaction Effects

Consider we want to model $Y$ based on a continuous predictor $x$ and a dummy variable $d$. Compared to using a single variable $x$, the additional effect of $d$ could be,

$d$ ahs no effect on $Y$, that is, $Y = \beta_0 + \beta_1 x + e$.
$d$ produces only an additive change in $Y$, that is, $$ Y = \beta_0 + \beta_1 x + \beta_2 d + e = \begin{cases} \beta_0 + \beta_1 x + e & d = 0 \newline \beta_0 + \beta_2 + \beta_1 x + e & d = 1 \end{cases} $$
$d$ produces an additive change in $Y$ and also changes the size of the effect of $x$ on $Y$, that is, $$ Y = \beta_0 + \beta_1 x + \beta_2 d + \beta_3 xd + e = \begin{cases} \beta_0 + \beta_1 x + e & d = 0 \newline \beta_0 + \beta_2 + (\beta_1 + \beta_3) x + e & d = 1 \end{cases} $$

Let’s go back to our example again.

After you present your model to the management team, the debate foucses on the statistical insignificane of the Service variable. The team feels like the model may be too simple to reflect the reality of a Italian restaurant in Manhattan. In particular, there is general consensus amongst the team that restaurants on the east of Fifth Avenue are very different from those on the west side. As such, you have been asked to consider different models for the East and West.

So, we can develop a mathematical model as such,

$$ Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 \text{East} + \beta_5 x_1 \text{East} + \beta_6 x_2 \text{East} + \beta_7 x_3 \text{East} + e $$

In our code now,

1
m2<-lm(Price~Food+Decor+Service+East+Food:East+Decor:East+Serivce:East)
2
> Coefficients:
3
>             Estimate Std. Error t value Pr(>|t|)
4
> (Intercept) -26.9305   8.5195  -3.161 0.00189 **
5
> Food          0.9986   0.5742   1.739 0.08398 .
6
> Decor         1.8824   0.3005   6.264 3.45e-09 ***
7
> Service       0.7566   0.6489   1.166 0.24536
8
> East          5.5702  10.3285   0.539 0.59044
9
> Food:East     1.2679   0.7815   1.622 0.10673
10
> Decor:East   -0.2791   0.4618  -0.604 0.54646
11
> Service:East -1.2842   0.8230  -1.560 0.12071
12
> Residual standard error: 5.747 on 157 degrees of freedom
13
> (3 observations deleted due to missingness)
14
> Multiple R-squared: 0.6381, Adjusted R-squared: 0.622
15
> F-statistic: 39.55 on 7 and 157 DF, p-value: < 2.2e-16

If we also do an ANOVA test,

1
anova(m1, m2)
2
> Analysis of Variance Table
3
> Model 1: Price ~ Food + Decor + East
4
> Model 2: Price ~ Food + Decor + Service + East + Food:East + Decor:East + Service:East
5
>   Service:East
6
>   Res.Df RSS Df Sum of Sq F Pr(>F)
7
> 1 161 5338.9
8
> 2 157 5186.1 4 152.82 1.1566 0.3322

This means we are happy to adopt the original reduced model.

Diagnostics Summary

The main tools we will use to validate regression assumptions are,

Plots involving standardized residuals and/or *fitted values.
Determine leverage points
Determine which (if any) of the data points are outliers.
Asses the effect of each predictor variable, having adjusted for the effect of other predictor variables using added variable plots.
Asses the extent of collinearity among the predictor variables using variance inflation factors.
Examine whether the assumption of normality of error and constant error variance is reasonable.

Leverage Points and Residuals

Remember from simple linear regression, leverage points are points that have extreme values of the predictor variable.

$$ \mathbf{\hat{Y}} = \mathbf{X} \hat{\beta} = \mathbf{X}(\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{Y} = \mathbf{H} \mathbf{Y}, $$

where $\mathbf{H} = \mathbf{X}(\mathbf{X} \mathbf{X})^{-1} \mathbf{X}^T$ (also called the hat matrix).

Let $h_{ij}$ be the $(i, j)$-entry of $\mathbf{H}$, then,

$$ \hat{Y_i} = h_{ii} Y_i + \sum_{j \neq i} h_{ij} Y_j $$

The rule of thumb for leverage points are,

$$ h_{ii} > 2 \times \text{average}(h_{ii}) = 2 \times \frac{p+1}{n} $$

The residuals are defined as,

$$ \mathbf{\hat{e}} = \mathbf{Y} - \mathbf{\hat{Y}} = (\mathbf{I} - \mathbf{H}) \mathbf{Y} $$

We can show that,

$$ \text{Var}(\mathbf{\hat{e}} | \mathbf{X}) = \sigma^2(\mathbf{I} - \mathbf{H}), $$

and that the standardized residuals are,

$$ r_i = \frac{\hat{e_i}}{s \sqrt{1 - h_{ii}}} $$

where,

$$ s = \sqrt{\frac{\sum_{i=1}^n \hat{e_i^2}}{n - p - 1}} $$

Any pattern in a residual plot indicates that an incorrect model has been fit, but the pattern in general does not provide direct information on how the model is misspecified.

Added Variable Plots

Assume we originally have the model,

$$ \mathbf{Y} \mathbf{X} \beta + \mathbf{e}. $$

With an additional predictor, we now consider,

$$ \mathbf{Y} = \mathbf{X} \beta + \mathbf{Z} \alpha + \mathbf{e}, $$

where,

$$ \mathbf{Z} = \begin{bmatrix} z_1 \newline z_2 \newline \vdots \newline z_n \end{bmatrix} $$

The procedure is as follows,

Perform Regression $$ \mathbf{Y} = \mathbf{X} \beta + \mathbf{e} $$

to get the residual $\mathbf{e}_{\mathbf{Y}.\mathbf{X}}$.

Perform Regression $$ \mathbf{Z} = \mathbf{X} \beta + \mathbf{e} $$

to get residual $\mathbf{e}_{\mathbf{Z}.\mathbf{X}}$.

Plot $\mathbf{e}_{\mathbf{Y}.\mathbf{X}}$ (on $y$-axis)

against $\mathbf{e}_{\mathbf{Z}.\mathbf{X}}$ (on $x$-axis).

Transformations

Transforming only $Y$ using inverse response plot.

Assume the true model is actually,

$$ Y = g(\beta_0 + \beta_1 x_1 + \ldots + \beta_p x_p + e), $$

the inverse model is thus,

$$ g^{-1}(Y) = \beta_0 + \beta_1 x_1 + \ldots + \beta_p x_p + e $$

To use inverse response plot, an important assumption is that the predictors are pairwise linearly related.

Example: Defective rates.

We want to develop a model for number of defectives based on $x_1$: temperature, $x_2$: density, and $x_3$: production rate.

To get the pairwise linear relationship, we can use the scatterplot matrix.

1
pairs(~Defective + Temperature + Density + Rate)

To get the inverse response plot, we can use the following code.

1
m1<-lm(Defective~Temperature+Density+Rate)
2
inverseResponsePlot(m1)

Collinearity of Predictors

When higly correlated predictor variables are included, they are effectively carrying very similar information about the response variable.

Thus, it is difficult for least squares to distinguish their separate effects on the response variable.

Some of the coefficients in the regression model are of the opposite sign than expected.

Consider the multiple regression model,

$$ Y = \beta_0 + \beta_1 x_1 + \ldots \beta_p x_p + e. $$

Let $R_j^2$ be the coefficient of determination $R^2$ obtained when regressing $x_j$ on other predictors.

Then it can be shown that,

$$ \text{Var}(\hat{\beta_j}) = \frac{1}{1 - R_j^2} \frac{\sigma^2}{(n - 1) S_{x_j}^2} $$

$\frac{1}{1 - R_j^2}$ is called the variance inflation factor.

A rough guide for identifying large VIF is to use the cut-off value 5.

What do you do when collinearity exists?

Do nothing but be careful of interpretation.
Remove highly correlated variables (keep only one of them).