In this part we will introduce linear models for classification including logistic regression, linear discriminant analysis (LDA), and quadratic discriminant analysis (QDA).
The Classification Problem
Intuition (Classification Problem)
In classification, the goal is, given a training set D:={(x1,y1=,…,(xN,yN)}, with yi∈{C1,…,Ck}, assign (classify) a new example x to one of the classes Ci.
Usually, for convenience yi∈{1,…,K}.
More mathematically, we can think of it as dividing the input space into decision regions Ri,i∈[K] (points in Ri are assigned to class Ci).
Thus, for linear models, the decision boundaries (surfaces) are linear functions of x (i.e., (d−1)-dimensional hyperplanes).
Intuition (Modeling Approaches)
There are three main approaches to modeling classification problems.
Discriminative Deterministic Models: Model directly the deterministic mapping between x and y via a parametrized function μ(x,w) (discriminant function).
However, a more powerful approach is to model p(Ck∣x) in an inference stage, then use this distribution to make optimal decisions.
Discriminative Probabilistic Models: Model p(Ck∣x) via a parametrized model.
Generative Probabilistic Models: Model p(x,y) specifying p(y) (p(Ck)) and p(x∣y) (p(x∣Ck)). Then,
p(Ck∣x)=p(x)p(x∣Ck)p(Ck)Intuition (Generalized Linear Models and Activation Functions)
Recall, in the linear regression setting, μ(x,w) is a linear function of w, where in the simplest case we had μ(x,w)=wTx+w0.
However, in classification, we need to ensure that the output is a discrete range or the posterior probabilities (i.e., [0,1] and sum to one).
Thus, generalized linear models are models where μ(x,w)=a(wTx), where a(⋅) is an activation function that maps the output of the linear function to the desired range.
For example, in binary classification, using a discriminative deterministic model, the simplest form would be,
y^(x,w)=sign(wTx+w0)
where sign(⋅) is the sign function, i.e.,
{x∈C1,x∈C2,if μ(x,w)≥0if μ(x,w)<0
In the more general multi-class setting, a single K-class discriminant with K linear functions,
μk(x)=wkTx+wk,0
and thus,
y^(x,W)=kargmaxμk(x)Note
One can interpret μk(x) as p(y=k∣x).
Further, the decision boundary between Ck and Cj is given by the hyperplane,
{x:(wk−wj)Tx+(wk,0−wj,0)=0}
Linear (Probabilistic) Models for Classification
As we have discussed, discriminative probabilistic models aim at modeling p(Ck∣x) directly.
Which are potentially more powerful than deterministic models since it allows us to model sources of uncertainty in the label assignment to input variables.
We will consider linear decision boundaries (in the input space) but also applying a nonlinear transformation ϕ=(ϕ0,…,ϕM) to x, then work with ϕ(x) instead of x.
For clarity, decision boundaries linear in the feature space ϕ(x) are also called linear models, but they are nonlinear in the input space x.
Thus, inspired by the linear regression model, we consider p(y=C1∣x) of the form,
p(y=C1∣x,w)=f(wTx),0≤f(⋅)≤1
A popular choice is the logit (or logistic sigmoid) function,
f(x)=1+exp(x)exp(x)=1+exp(−x)1=:σ(x)Logistic Sigmoid FunctionExample 1 (Logistic Regression for Binary Classification)
In the binary classification setting, one can derive,
However, this has no closed-form solution, but NLL has unique minimum unless classes are perfectly separated by a hyperplane.
The NLL function is also concave and the derivatives and Hessian can easily be computed ⇒ We can use iterative optimization methods such as gradient descent or Newton-Raphson to find w⋆.
Recall (Maximum A Posteriori Learning)
ML can lead to overfitting when too many parameters are used compared to training examples. That is, a large value of ∥w∥ a manifestation of overfitting.
By introducing a prior on w that gives lower probability to larger values, e.g., w∼N(0,α−1I), we can use MAP learning to mitigate overfitting.
Further, recall that the MAP criterion is defined as,
wMAP:=wargmaxLD(w)+Nλ∥w∥2.Note
Note that ML learning will only return a single w. Thus, to deal with uncertainty in estimating w, we need to determine the posterior distribution p(w∣D).
Bayesian Logistic Regression
Intuition (Bayesian Logistic Regression)
We want to compute the full posterior p(w∣D). However, exact Bayesian inference for logistic regression is not possible, as exact evaluation of p(w∣D) is intractable.
Thus, this is more complex than normal linear regression models (cannot integrate exactly over w since posterior is not Gaussian).
Thus, we need to introduce some approximations such as Laplace approximation.
Definition 1 (The Laplace Approximation)
The Laplace approximation aimds to find a Gaussian approximation to a posterior distribution.
In the case of a single continuous random variable z,
p(z)=Z1f(z)
where Z (unknown) is the normalization constant (Z=∫f(z)dz).
Thus, the goal is to find a Gaussian approximation q(z) centered on a mode of p(z).
Find a mode of p(z) (i.e., p′(z0)=0)
Note
Since the logarithm of a Gaussian is a quadratic function, we can equivalently find the mode of logp(z).
Taylor series expansion of logf(z) around z0 up to second order,
Thus, the Laplace approximation of p(w∣D) is given by,
q(w)=N(w∣wMAP,SN)
Ultimately, we want to predict y⇒ Want to learn a posterior predictive distribution p(y∣D,x)=p(y∣D,ϕ(x)).
p(y∣D,x)=∫posterior dist. of wp(w∣D)p(y∣x,w)dw.
Since p(y∣x,w) is associated with each value of w weighted by the posterior belief,
p(w∣D)=p(yD∣xD)p(w)p(yD∣xD,w).
Given ϕ (or x), we can forecast y via the predictive distribution.
Definition 2 (Predictive Distribution)
The predictive distribution for class C1, p(y=C1∣D,ϕ) is given by,
p(y=C1∣D,ϕ)=∫p(C1∣ϕ,w)p(w∣D)dw.
Since p(y=C1∣ϕ,w)=σ(wTϕ) and p(w∣D)≈q(w)=N(w∣wMAP,SN), we obtain,
p(C1∣D,ϕ)≈∫σ(wTϕ)q(w)dw=∫σ(a)N(a∣μa,σa2)da
where,
aμaσa2=wTϕ=E[a]=wMAPTϕ=Var[a]=ϕTSNϕ
However, this integral cannot be evaluated analytically. But we can exploit that σ(a) is similar to the probit function Φ(a) (CDF of standard normal distribution) to obtain an approximation,
In multi-class classification with K classes, we can generalize logistic regression using the softmax function,
p(y=i∣x=x)=∑j=1Kexp(wjTx)exp(wiTx)Multi-Class Linear and Quadratic Regression
Generative Probabilistic Models for Classification
Intuition (Generative Probabilistic Models for Classification)
As we have discussed, making more assumptions about the data by considering distribution of covariates x, we may suffer from bias when our model is incorrectly selected.
The capability to capture properties of x can improve learning if p(x∣y) has significant structure.
Since different models for p(x∣Ci) lead to different generative models, generative models are typically defined by assuming,
x∣y=y∼exponential(ηy)
i.e., the exponential family,
p(x∣η)=Z(η)1h(x)exp(ηTu(x))
For example, Gaussian Discriminant Analysis (GDA), the class-conditional density is modeled by a multivariate Gaussian.
Definition 3 (Linear Discriminant Analysis (LDA))
in LDA, the principle is that x∈Rd is Gaussian distributed for each class Ck, with covariance matrix Σk=Σ (shared across classes),
Thus, fitting an LDA model to data is equivalent to estimating the class means μk, the shared covariance matrix Σ, and the class priors p(Ck).
As the name suggests, LDA gives rise to linear decision boundaries between classes.
Example 2 (LDA Decision Boundary in the Two-Class Case)logp(y=Cj∣x)p(y=Ci∣x)=logp(x∣Cj)p(y=Cj)p(x∣Ci)p(y=Ci)=logp(x∣Cj)p(x∣Ci)+logp(y=Cj)p(y=Ci)=−21μiTΣ−1μi+21μjTΣ−1μj+xTΣ−1(μi−μj)+logp(y=Cj)p(y=Ci)=w0+wTx
Two alternative ways to realize quadratic decision boundaries are:
Apply QDA in the original d-dimensional space x=(x1,…,xd).
Apply LDA in an augmented spasce of dimension d2+1, x′=(x1,…,xd,x12,…,xd2,x1x2,…,xd−1xd).
Since, linear functions in the augmented space corresponds to quadratic functions in the original d-dimensional space.