Part 5 - Regression

The Regression Task

Definition of the regression task,

Given a feature vector xRN\mathbf{x} \in \mathbb{R}^N, predict its corresponding output vale yRy \in \mathbb{R}.

Example: Curve Fitting

Given MM points sampled from some underlying curve (assuming no measurement noise), the goal is to find that curve.

But let’s first think about this problem.

  • Would a small handful of points work well, or them or the better?
  • Given that the number of points are fixed, do their locations matter?
  • Given a set of MM points, is curve fitting unique?
  • Lastly, how do we determine a “seemingly best” curve, compared to other possibilities?

I would say the answer to 1-3 are quite obvious, but the fourth one is the deal-breaker.

Answers

  • We always like more samples, if possible.
  • Samples need to be representative or informative.
  • Estimating continuous model from discrete data is an ill-posed problem and in most cases cannot be certain.
  • You need to have some criteria to choose your preferred model.

The Regression Learning Problem

Definition of the regression learning problem,

Given a data set of example pairs D={(x(i),y(i)),i=1,,M}\mathcal{D} = \{(\mathbf{x}^{(i)}, y^{(i)}), i = 1, \ldots, M \} where x(i)RN\mathbf{x}^{(i)} \in \mathbb{R}^N is a feature vector and y(i)Ry^{(i)} \in \mathbb{R} is the output, learn a function f:RNRf : \mathbb{R}^N \mapsto \mathbb{R} that accurately predicts yy for any feature vector x\mathbf{x}.

Okay, so we have our goal defined. Let’s define our error.

Definition of error measure, in this case the mean squared error (MSE),

Given a data set of example pairs D={(x(i),y(i)),i=1,,M}\mathcal{D} = \{(\mathbf{x}^{(i)}, y^{(i)}), i = 1, \ldots, M \} and a function f:RNRf : \mathbb{R}^N \mapsto \mathbb{R}, the MSE of ff on D\mathcal{D} is defined as,

MSE(D,f)=1Mi=1M(y(i)f(x(i)))2MSE(\mathcal{D}, f) = \frac{1}{M} \sum_{i=1}^{M} (y^{(i)} - f(\mathbf{x}^{(i)}))^2

Linear Regression

If we are in 11-D, the output yy is a linear function of input feature xx. I.e.,

y=wx+by = wx + b

where ww is the slope and bb is the intercept.

Let’s generalize this to NN dimensions.

In the NN-D case, the output yy is a linear combination of NN input features x1,x2,,xNx_1, x_2, \ldots, x_N. I.e.,

y=w1x1+w2x2++wNxN+w0y = w_1x_1 + w_2x_2 + \ldots + w_Nx_N + w_0

Or equivalently,

y=wTx+w0=i=1Nwixi+w0=i=0Nwixi  x0=1y = \mathbf{w}^T \mathbf{x} + w_0 = \sum_{i=1}^{N} w_i x_i + w_0 = \sum_{i=0}^{N} w_i x_i \ \bigg\rvert \ x_0 = 1

xRN\mathbf{x} \in \mathbb{R}^{N} is the vector of input values.

wRN\mathbf{w} \in \mathbb{R}^{N} are the weights of the linear function and w0w_0 is the intercept (bias term).

Ordinary Least Squares (OLS)

The linear function has form f(x)=wTx+bf(\mathbf{x}) = \mathbf{w}^T \mathbf{x} + b.

How do we estimate the parameters (w,b)(\mathbf{w}, b) from the data?

Ordinary Least Squares (OLS) selects the linear regression parameters to minimize the MSE on the training data set.

w,b=argminw,b1Mi=1M(y(i)wTx(i)b)2\mathbf{w}^{\star}, b^{\star} = \underset{\mathbf{w}, b}{\arg \min} \frac{1}{M} \sum_{i=1}^{M} (y^{(i)} - \mathbf{w}^T \mathbf{x}^{(i)} - b)^2
Solving OLS For One Feature

Let’s solve this for one feature, i.e., take the derivate of the objective with respect to ww and set it to zero.

w1Mi=1M(y(i)wx(i)b)2=021Mi=1M(y(i)wx(i)b)(x(i))=0(i=1M(x(i))2)w+(i=1Mx(i))b=i=1My(i)x(i)\frac{\partial}{\partial w} \frac{1}{M} \sum_{i=1}^{M} (y^{(i)} - w x^{(i)} - b)^2 = 0 \newline 2 \frac{1}{M} \sum_{i=1}^{M} (y^{(i)} - w x^{(i)} - b)(-x^{(i)}) = 0 \newline \left( \sum_{i=1}^{M} (x^{(i)})^2 \right) w + \left( \sum_{i=1}^{M} x^{(i)} \right) b = \sum_{i=1}^{M} y^{(i)} x^{(i)}

Now, let’s do the same for the intercept bb.

b1Mi=1M(y(i)wx(i)b)2=021Mi=1M(y(i)wx(i)b)(1)=0(i=1Mx(i))w+Mb=i=1My(i)\frac{\partial}{\partial b} \frac{1}{M} \sum_{i=1}^{M} (y^{(i)} - w x^{(i)} - b)^2 = 0 \newline 2 \frac{1}{M} \sum_{i=1}^{M} (y^{(i)} - w x^{(i)} - b)(-1) = 0 \newline \left( \sum_{i=1}^{M} x^{(i)} \right) w + M b = \sum_{i=1}^{M} y^{(i)}

We can now write the two equations in matrix form,

[i=1M(x(i))2i=1Mx(i)i=1Mx(i)M][wb]=[i=1My(i)x(i)i=1My(i)]\begin{bmatrix} \sum_{i=1}^{M} (x^{(i)})^2 & \sum_{i=1}^{M} x^{(i)} \newline \sum_{i=1}^{M} x^{(i)} & M \end{bmatrix} \begin{bmatrix} w \newline b \end{bmatrix} = \begin{bmatrix} \sum_{i=1}^{M} y^{(i)} x^{(i)} \newline \sum_{i=1}^{M} y^{(i)} \end{bmatrix}

Solving for the parameters,

[wb]=[i=1M(x(i))2i=1Mx(i)i=1Mx(i)M]1[i=1My(i)x(i)i=1My(i)]\begin{bmatrix} w \newline b \end{bmatrix} = \begin{bmatrix} \sum_{i=1}^{M} (x^{(i)})^2 & \sum_{i=1}^{M} x^{(i)} \newline \sum_{i=1}^{M} x^{(i)} & M \end{bmatrix}^{-1} \begin{bmatrix} \sum_{i=1}^{M} y^{(i)} x^{(i)} \newline \sum_{i=1}^{M} y^{(i)} \end{bmatrix}
General OLS Solution

In matrix notation, XRM×(N+1)\mathbf{X} \in \mathbb{R}^{M \times (N+1)} with one data case x(i)RN+1\mathbf{x}^{(i)} \in \mathbb{R}^{N+1} per row.

X=[(x(1))T(x(2))T(x(M))T],\mathbf{X} = \begin{bmatrix} -- & (\mathbf{x}^{(1)})^T & -- \newline -- & (\mathbf{x}^{(2)})^T & -- \newline & \vdots & \newline -- & (\mathbf{x}^{(M)})^T & -- \end{bmatrix},

where x(i)RN+1\mathbf{x}^{(i)} \in \mathbb{R}^{N+1} and we have defined x0(i)=1x_0^{(i)} = 1.

yRM\mathbf{y} \in \mathbb{R}^{M} is a column vector containing the corresponding outputs,

y=[y(1)y(2)y(M)]\mathbf{y} = \begin{bmatrix} y^{(1)} \newline y^{(2)} \newline \vdots \newline y^{(M)} \end{bmatrix}

and wRN+1\mathbf{w} \in \mathbb{R}^{N+1}, where w0=bw_0 = b.

This yields the general solution for w=RN+1\mathbf{w} = \in \mathbb{R}^{N+1},

w=argminw1Mi=1M(y(i)(x(i))Tw)2=argminw1M(yXw)T(yXw)\begin{align*} \mathbf{w}^{\star} & = \underset{\mathbf{w}}{\arg \min} \frac{1}{M} \sum_{i=1}^{M} (y^{(i)} - (\mathbf{x}^{(i)})^T \mathbf{w})^2 \newline & = \underset{\mathbf{w}}{\arg \min} \frac{1}{M} (\mathbf{y} - \mathbf{X} \mathbf{w})^T (\mathbf{y} - \mathbf{X} \mathbf{w}) \end{align*}

Taking the derivative with respect to w\mathbf{w} and setting it to zero,

w1M(yXw)T(yXw)=0w1M(yT(Xw)T)(yXw)=0w1M(yTyyTXw(Xw)Ty+(Xw)TXw)=0w1M(yTyyTXwwTXTy+wTXTXw)=0(AB)T=BTATw1M(yTyyTXwyTXw+wTXTXw)=0(AB)T=BTATw1M(yTy2yTXw+wTXTXw)=0\begin{align*} \frac{\partial}{\partial \mathbf{w}} \frac{1}{M} (\mathbf{y} - \mathbf{X} \mathbf{w})^T (\mathbf{y} - \mathbf{X} \mathbf{w}) & = 0 \newline \frac{\partial}{\partial \mathbf{w}} \frac{1}{M} (y^T - (\mathbf{X} \mathbf{w})^T)(\mathbf{y} - \mathbf{X} \mathbf{w}) & = 0 \newline \frac{\partial}{\partial \mathbf{w}} \frac{1}{M} (y^T y - y^T \mathbf{X} \mathbf{w} - (\mathbf{X} \mathbf{w})^T \mathbf{y} + (\mathbf{X} \mathbf{w})^T \mathbf{X} \mathbf{w}) & = 0 \newline \frac{\partial}{\partial \mathbf{w}} \frac{1}{M} (y^T y - y^T \mathbf{X} \mathbf{w} - \mathbf{w}^T \mathbf{X}^T \mathbf{y} + \mathbf{w}^T \mathbf{X}^T \mathbf{X} \mathbf{w}) & = 0 \bigg\rvert (AB)^T = B^T A^T \newline \frac{\partial}{\partial \mathbf{w}} \frac{1}{M} (y^T y - y^T \mathbf{X} \mathbf{w} - y^T \mathbf{X}\mathbf{w} + \mathbf{w}^T \mathbf{X}^T \mathbf{X} \mathbf{w}) & = 0 \bigg\rvert (AB)^T = B^T A^T \newline \frac{\partial}{\partial \mathbf{w}} \frac{1}{M} (y^T y - 2 y^T \mathbf{X} \mathbf{w} + \mathbf{w}^T \mathbf{X}^T \mathbf{X} \mathbf{w}) & = 0 \newline \end{align*}

Now, the first term is a constant, so we can ignore it. For the second term, we can use the identity,

wATw=A\frac{\partial}{\partial \mathbf{w}} A^T \mathbf{w} = A

For the third term, we can use the identity,

wwTAw=2Aw\frac{\partial}{\partial \mathbf{w}} \mathbf{w}^T A \mathbf{w} = 2 A \mathbf{w}

If AA is symmetric, i.e., A=ATA = A^T.

Which yields,

2M(XTy+XTXw)=0XTXw=XTyw=(XTX)1XTy\begin{align*} \frac{2}{M} (-\mathbf{X}^T \mathbf{y} + \mathbf{X}^T \mathbf{X} \mathbf{w}) & = 0 \newline \mathbf{X}^T \mathbf{X} \mathbf{w} & = \mathbf{X}^T \mathbf{y} \newline \mathbf{w}^{\star} & = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} \end{align*}

Connection to Probabilistic Models

The same solution can be derived as the MLE of the parameters under a conditional Gaussian model.

Assumption, the output (or the observation) y\mathbf{y} is from a deterministic function with additive Gaussian noise,

y=wTx+ϵ, where p(ϵ,σ2)=N(ϵ;0,σ2I)\mathbf{y} = \mathbf{w}^T \mathbf{x} + \mathbf{\epsilon}, \ \text{where} \ p(\mathbf{\epsilon}, \sigma^2) = \mathcal{N}(\mathbf{\epsilon}; 0, \sigma^2 \mathbf{I})

This implies that,

p(yX;w,σ2)=p(yXw;σ2)=N(yXw;0,σ2I)=N(y;Xw,σ2I)p(\mathbf{y} | \mathbf{X}; \mathbf{w}, \sigma^2) = p(\mathbf{y} - \mathbf{X} \mathbf{w}; \sigma^2) = \mathcal{N}(\mathbf{y} - \mathbf{X} \mathbf{w}; 0, \sigma^2 \mathbf{I}) = \mathcal{N}(\mathbf{y}; \mathbf{X} \mathbf{w}, \sigma^2 \mathbf{I})

Or equivalently,

p(yX;w,σ2)=1(2πσ2)Mexp(12σ2(yXw)T(yXw))p(\mathbf{y} | \mathbf{X}; \mathbf{w}, \sigma^2) = \frac{1}{\sqrt{(2\pi \sigma^2)^{M}}} \exp \left( -\frac{1}{2\sigma^2} (\mathbf{y} - \mathbf{X} \mathbf{w})^T (\mathbf{y} - \mathbf{X} \mathbf{w}) \right)

Let’s maximize the log likelihood,

argmaxw,σ2logp(yX;w,σ2)=argmaxw,σ2M2log(2π)M2log(σ2)12σ2(yXw)T(yXw)=argminw,σ2M2log(σ2)+12σ2(yXw)T(yXw)\begin{align*} \underset{\mathbf{w}, \sigma^2}{\arg \max} \log p(\mathbf{y} | \mathbf{X}; \mathbf{w}, \sigma^2) & = \underset{\mathbf{w}, \sigma^2}{\arg \max} -\frac{M}{2} \log(2\pi) - \frac{M}{2} \log(\sigma^2) - \frac{1}{2\sigma^2} (\mathbf{y} - \mathbf{X} \mathbf{w})^T (\mathbf{y} - \mathbf{X} \mathbf{w}) \newline & = \underset{\mathbf{w}, \sigma^2}{\arg \min} \frac{M}{2} \log(\sigma^2) + \frac{1}{2\sigma^2} (\mathbf{y} - \mathbf{X} \mathbf{w})^T (\mathbf{y} - \mathbf{X} \mathbf{w}) \end{align*}

Note that solving for w\mathbf{w} is independent of σ2\sigma^2.

w=argminw12(yXw)T(yXw)\mathbf{w}^{\star} = \underset{\mathbf{w}}{\arg \min} \frac{1}{2} (\mathbf{y} - \mathbf{X} \mathbf{w})^T (\mathbf{y} - \mathbf{X} \mathbf{w})

More on Linear Regression

Take a closer look at the closed-form solution,

w=(XTX)1XTy\mathbf{w}^{\star} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}

We call X=(XTX)1XT\mathbf{X}^{\dagger} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T the pseudo-inverse (generalization of inverse to non-square matrix). See for a square invertible matrix X\mathbf{X}, we have X=X1(XT)1XT=X1\mathbf{X}^{\dagger} = \mathbf{X}^{-1} (\mathbf{X}^T)^{-1} \mathbf{X}^T = \mathbf{X}^{-1}.

What if XTXR(N+1)×(N+1)\mathbf{X}^T \mathbf{X} \in \mathbb{R}^{(N+1) \times (N+1)} is non-invertible?

  • If M<N+1M < N + 1, add more data or enforce regularization.
  • If MN+1M \geq N + 1, remove redundant features or enforce regularization.

What if NN is too large? We can use gradient descent,

w(t+1)=w(t)αXt(Xw(t)y)\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - \alpha \mathbf{X}^t (\mathbf{X} \mathbf{w}^{(t)} - \mathbf{y})

What if MM is too large? Use mini-batch gradient descent.

w(t+1)=w(t)α(x,y)Bx(xTw(t)y)\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - \alpha \sum_{(\mathbf{x}, y) \in \mathcal{B}} \mathbf{x} (\mathbf{x}^T \mathbf{w}^{(t)} - y)

Regularized Linear Regression

As we have discussed previously, regularization is a technique to prevent overfitting. But also to obtain numerically more stable solutions (e.g., when XTX\mathbf{X}^T\mathbf{X} is not invertible), and enforce the desired parameter space as a prior (e.g., select a subset of features that are good for prediction by enforcing the weights of irrelevant features to zero).

But the price we pay for having regularization is that the regularization term will bias the least squares estimate.

Ridge Regression

Let’s add regularization to OLS,

minw12Xwy22+α2w22\underset{\mathbf{w}}{\min} \frac{1}{2} \Vert \mathbf{X} \mathbf{w} - \mathbf{y} \Vert_2^2 + \frac{\alpha}{2} \Vert \mathbf{w} \Vert_2^2

The first term is the data-fit term, we sum the squared error of the prediction.

The second term is the regularization term, w22=j=1N+1wj2\Vert \mathbf{w} \Vert_2^2 = \sum_{j=1}^{N+1} w_j^2 penalizes large weights.

α\alpha is the hyperparameter that controls the amount of “shrinkage”. Larger α\alpha means more shrinkage, α=0\alpha = 0 is OLS.

This also has a probabilistic interpretation,

p(wX;y,α)p(yw,X;σ2)p(w;α)p(\mathbf{w} | \mathbf{X}; \mathbf{y}, \alpha) \propto p(\mathbf{y} | \mathbf{w}, \mathbf{X}; \sigma^2) p(\mathbf{w}; \alpha)
But, Why?

Why should we penalize large weights in the first place?

Equivalently, why are smaller weights (i.e., weights that are close to zero) more preferred than larger weights?

  • Reason 1: Smaller weights are more robust to perturbations of input features.

    • Consider f1(x)=2x+0.1f_1(x) = 2x + 0.1 and f2(x)=100x98f_2(x) = 100x-98.
    • If we add a small perturbation to xx+0.1x \rightarrow x + 0.1.
    • Then, Δf1(x)=0.2\Delta f_1(x) = 0.2 and Δf2(x)=10\Delta f_2(x) = 10.
  • Reason 2

    • There are better chances to zero out some input features xx that are redundant or uninformative, leading to more accurate prediction of yy.

Ridge Regression Closed-Form Solution

The closed-form solution for Ridge regression is,

w=(XTX+αI)1XTy\mathbf{w}^{\star} = (\mathbf{X}^T \mathbf{X} + \alpha \mathbf{I})^{-1} \mathbf{X}^T \mathbf{y}

The term “ridge regression” comes from the closed-form solution, where a “ridge” is added to the diagonal of XTX\mathbf{X}^T \mathbf{X}.

Lasso Regression

With ridge regression, some weights are small but still non-zero. These are less important, but somehow still necessary.

To get a better shrinkage to zero, we can change the regularization term to encourge more weights to be zero.

The Lasso regression is defined as,

minw12Xwy22+αw1\underset{\mathbf{w}}{\min} \frac{1}{2} \Vert \mathbf{X} \mathbf{w} - \mathbf{y} \Vert_2^2 + \alpha \Vert \mathbf{w} \Vert_1

Least Absolute Shrinkage and Selection Operator, or LASSO.

Keeps the same data-fit term, but changes the regularization term. We take the sum of absolute weight values w1=j=1Nwj\Vert \mathbf{w} \Vert_1 = \sum_{j=1}^{N} |w_j|.

When a weight is close to zero, the regularization term will force it to be equal to zero.

This is a convex optimization problem, with no closed-form solution in general. However, it can be solved efficiently using an algorithm called least angle regression.

Why Shrinkage?

Under the orthogonal design (XTX=I)(\mathbf{X}^T \mathbf{X} = \mathbf{I}), we have,

Linear Regression: minw12Xwy22\underset{\mathbf{w}}{\min} \frac{1}{2} \Vert \mathbf{X} \mathbf{w} - \mathbf{y} \Vert_2^2,

wL=(XTX)1XTy=XTy\mathbf{w}_L = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} = \mathbf{X}^T \mathbf{y}

Ridge Regression: minw12Xwy22+α2w22\underset{\mathbf{w}}{\min} \frac{1}{2} \Vert \mathbf{X} \mathbf{w} - \mathbf{y} \Vert_2^2 + \frac{\alpha}{2} \Vert \mathbf{w} \Vert_2^2,

wRR=(XTX+αI)1XTy=XTy1+α\mathbf{w}_{RR} = (\mathbf{X}^T \mathbf{X} + \alpha \mathbf{I})^{-1} \mathbf{X}^T \mathbf{y} = \frac{\mathbf{X}^T \mathbf{y}}{1 + \alpha}

Which is just

wL1+α\frac{\mathbf{w}_L}{1 + \alpha}

Lasso Regression: minw12Xwy22+αw1\underset{\mathbf{w}}{\min} \frac{1}{2} \Vert \mathbf{X} \mathbf{w} - \mathbf{y} \Vert_2^2 + \alpha \Vert \mathbf{w} \Vert_1,

wLR=argminw12Xwy22+αw1\mathbf{w}_{LR} = \underset{\mathbf{w}}{\arg \min} \frac{1}{2} \Vert \mathbf{X} \mathbf{w} - \mathbf{y} \Vert_2^2 + \alpha \Vert \mathbf{w} \Vert_1

Which is just,

sign(wL)max(wLα,0)sign(\mathbf{w}_L) \max(|\mathbf{w}_L - \alpha|, 0)

Linear Regression Summary

  • OLS needs at least MN+1M \geq N + 1 data cases to learn a model with a NN dimensional feature vector.
  • Ridge and Lasso work when XTX\mathbf{X}^T \mathbf{X} is close to singular.
    • E.g., caused by co-linear features (xiaxj+b)(x_i \approx ax_j + b).
  • MSE objective function for OLS, Ridge, and Lasso is sensitive to noise and outliers.
    • Regularization (Ridge and Lasso) can prevent very large weights, and reduce the possibility of overfitting to outliers.
  • All fail when output yy non-linearly depends on input features x\mathbf{x}.