Given a feature vector x∈RN, predict its corresponding output vale y∈R.
Example: Curve Fitting
Given M points sampled from some underlying curve (assuming no measurement noise), the goal is to find that curve.
But let’s first think about this problem.
Would a small handful of points work well, or them or the better?
Given that the number of points are fixed, do their locations matter?
Given a set of M points, is curve fitting unique?
Lastly, how do we determine a “seemingly best” curve, compared to other possibilities?
I would say the answer to 1-3 are quite obvious, but the fourth one is the deal-breaker.
Answers
We always like more samples, if possible.
Samples need to be representative or informative.
Estimating continuous model from discrete data is an ill-posed problem and in most cases cannot be certain.
You need to have some criteria to choose your preferred model.
The Regression Learning Problem
Definition of the regression learning problem,
Given a data set of example pairs D={(x(i),y(i)),i=1,…,M} where x(i)∈RN is a feature vector and y(i)∈R is the output, learn a function f:RN↦R that accurately predicts y for any feature vector x.
Okay, so we have our goal defined. Let’s define our error.
Definition of error measure, in this case the mean squared error (MSE),
Given a data set of example pairs D={(x(i),y(i)),i=1,…,M} and a function f:RN↦R, the MSE of f on D is defined as,
MSE(D,f)=M1i=1∑M(y(i)−f(x(i)))2
Linear Regression
If we are in 1-D, the output y is a linear function of input feature x.
I.e.,
y=wx+b
where w is the slope and b is the intercept.
Let’s generalize this to N dimensions.
In the N-D case, the output y is a linear combination of N input features x1,x2,…,xN.
I.e.,
y=w1x1+w2x2+…+wNxN+w0
Or equivalently,
y=wTx+w0=i=1∑Nwixi+w0=i=0∑Nwixix0=1
x∈RN is the vector of input values.
w∈RN are the weights of the linear function and w0 is the intercept (bias term).
Ordinary Least Squares (OLS)
The linear function has form f(x)=wTx+b.
How do we estimate the parameters (w,b) from the data?
Ordinary Least Squares (OLS) selects the linear regression parameters to minimize the MSE on the training data set.
w⋆,b⋆=w,bargminM1i=1∑M(y(i)−wTx(i)−b)2
Solving OLS For One Feature
Let’s solve this for one feature, i.e., take the derivate of the objective with respect to w and set it to zero.
We call X†=(XTX)−1XT the pseudo-inverse (generalization of inverse to non-square matrix).
See for a square invertible matrix X, we have X†=X−1(XT)−1XT=X−1.
What if XTX∈R(N+1)×(N+1) is non-invertible?
If M<N+1, add more data or enforce regularization.
If M≥N+1, remove redundant features or enforce regularization.
What if N is too large? We can use gradient descent,
w(t+1)=w(t)−αXt(Xw(t)−y)
What if M is too large? Use mini-batch gradient descent.
w(t+1)=w(t)−α(x,y)∈B∑x(xTw(t)−y)
Regularized Linear Regression
As we have discussed previously, regularization is a technique to prevent overfitting.
But also to obtain numerically more stable solutions (e.g., when XTX is not invertible), and enforce the desired parameter space as a prior (e.g., select a subset of features that are good for prediction by enforcing the weights of irrelevant features to zero).
But the price we pay for having regularization is that the regularization term will bias the least squares estimate.
Ridge Regression
Let’s add regularization to OLS,
wmin21∥Xw−y∥22+2α∥w∥22
The first term is the data-fit term, we sum the squared error of the prediction.
The second term is the regularization term, ∥w∥22=∑j=1N+1wj2 penalizes large weights.
α is the hyperparameter that controls the amount of “shrinkage”. Larger α means more shrinkage, α=0 is OLS.
This also has a probabilistic interpretation,
p(w∣X;y,α)∝p(y∣w,X;σ2)p(w;α)
But, Why?
Why should we penalize large weights in the first place?
Equivalently, why are smaller weights (i.e., weights that are close to zero) more preferred than larger weights?
Reason 1: Smaller weights are more robust to perturbations of input features.
Consider f1(x)=2x+0.1 and f2(x)=100x−98.
If we add a small perturbation to x→x+0.1.
Then, Δf1(x)=0.2 and Δf2(x)=10.
Reason 2
There are better chances to zero out some input features x that are redundant or uninformative, leading to more accurate prediction of y.
Ridge Regression Closed-Form Solution
The closed-form solution for Ridge regression is,
w⋆=(XTX+αI)−1XTy
The term “ridge regression” comes from the closed-form solution, where a “ridge” is added to the diagonal of XTX.
Lasso Regression
With ridge regression, some weights are small but still non-zero. These are less important, but somehow still necessary.
To get a better shrinkage to zero, we can change the regularization term to encourge more weights to be zero.
The Lasso regression is defined as,
wmin21∥Xw−y∥22+α∥w∥1
Least Absolute Shrinkage and Selection Operator, or LASSO.
Keeps the same data-fit term, but changes the regularization term.
We take the sum of absolute weight values ∥w∥1=∑j=1N∣wj∣.
When a weight is close to zero, the regularization term will force it to be equal to zero.
This is a convex optimization problem, with no closed-form solution in general.
However, it can be solved efficiently using an algorithm called least angle regression.
Why Shrinkage?
Under the orthogonal design (XTX=I), we have,
Linear Regression: wmin21∥Xw−y∥22,
wL=(XTX)−1XTy=XTy
Ridge Regression: wmin21∥Xw−y∥22+2α∥w∥22,
wRR=(XTX+αI)−1XTy=1+αXTy
Which is just
1+αwL
Lasso Regression: wmin21∥Xw−y∥22+α∥w∥1,
wLR=wargmin21∥Xw−y∥22+α∥w∥1
Which is just,
sign(wL)max(∣wL−α∣,0)
Linear Regression Summary
OLS needs at least M≥N+1 data cases to learn a model with a N dimensional feature vector.
Ridge and Lasso work when XTX is close to singular.
E.g., caused by co-linear features (xi≈axj+b).
MSE objective function for OLS, Ridge, and Lasso is sensitive to noise and outliers.
Regularization (Ridge and Lasso) can prevent very large weights, and reduce the possibility of overfitting to outliers.
All fail when output ynon-linearly depends on input features x.