Part 4 & 5 - Linear Models for Classification

Introduction

In this part we will introduce linear models for classification including logistic regression, linear discriminant analysis (LDA), and quadratic discriminant analysis (QDA).

The Classification Problem

Intuition (Classification Problem)

In classification, the goal is, given a training set D{(x1,y1=,,(xN,yN)}\mathcal{D} \coloneqq \{(\mathbf{x}_1, y_1=, \ldots, (\mathbf{x}_N , y_N)\}, with yi{C1,,Ck}y_i \in \{C_1, \ldots, C_k\}, assign (classify) a new example x\mathbf{x} to one of the classes CiC_i. Usually, for convenience yi{1,,K}y_i \in \{1, \ldots, K\}.

More mathematically, we can think of it as dividing the input space into decision regions Ri,i[K]\mathcal{R}_i, i \in [K] (points in Ri\mathcal{R}_i are assigned to class CiC_i). Thus, for linear models, the decision boundaries (surfaces) are linear functions of x\mathbf{x} (i.e., (d1)(d - 1)-dimensional hyperplanes).

Intuition (Modeling Approaches)

There are three main approaches to modeling classification problems.

Discriminative Deterministic Models: Model directly the deterministic mapping between x\mathbf{x} and yy via a parametrized function μ(x,w)\mu(\mathbf{x}, \mathbf{w}) (discriminant function).

However, a more powerful approach is to model p(Ckx)p(C_k \mid \mathbf{x}) in an inference stage, then use this distribution to make optimal decisions.

Discriminative Probabilistic Models: Model p(Ckx)p(C_k \mid \mathbf{x}) via a parametrized model.

Generative Probabilistic Models: Model p(x,y)p(\mathbf{x}, y) specifying p(y)p(y) (p(Ck)p(C_k)) and p(xy)p(\mathbf{x} \mid y) (p(xCk)p(\mathbf{x} \mid C_k)). Then,

p(Ckx)=p(xCk)p(Ck)p(x)p(C_k \mid \mathbf{x}) = \frac{p(\mathbf{x} \mid C_k) p(C_k)}{p(\mathbf{x})}
Intuition (Generalized Linear Models and Activation Functions)

Recall, in the linear regression setting, μ(x,w)\mu(\mathbf{x}, \mathbf{w}) is a linear function of w\mathbf{w}, where in the simplest case we had μ(x,w)=wTx+w0\mu(\mathbf{x}, \mathbf{w}) = \mathbf{w}^T \mathbf{x} + w_0. However, in classification, we need to ensure that the output is a discrete range or the posterior probabilities (i.e., [0,1][0, 1] and sum to one). Thus, generalized linear models are models where μ(x,w)=a(wTx)\mu(\mathbf{x}, \mathbf{w}) = a(\mathbf{w}^T \mathbf{x}), where a()a(\cdot) is an activation function that maps the output of the linear function to the desired range.

For example, in binary classification, using a discriminative deterministic model, the simplest form would be,

y^(x,w)=sign(wTx+w0)\hat{y}(\mathbf{x}, \mathbf{w}) = \mathrm{sign}(\mathbf{w}^T \mathbf{x} + w_0)

where sign()\mathrm{sign}(\cdot) is the sign function, i.e.,

{xC1,if μ(x,w)0xC2,if μ(x,w)<0\begin{cases} \mathbf{x} \in C_1, & \text{if } \mu(\mathbf{x}, \mathbf{w}) \geq 0 \newline \mathbf{x} \in C_2, & \text{if } \mu(\mathbf{x}, \mathbf{w}) < 0 \end{cases}

In the more general multi-class setting, a single KK-class discriminant with KK linear functions,

μk(x)=wkTx+wk,0\mu_k(\mathbf{x}) = \mathbf{w}_k^T \mathbf{x} + w_{k, 0}

and thus,

y^(x,W)=argmaxk μk(x)\hat{y}(\mathbf{x}, \mathbf{W}) = \underset{k}{\arg\max} \ \mu_k(\mathbf{x})
Note

One can interpret μk(x)\mu_k(\mathbf{x}) as p(y=kx)p(y = k \mid \mathbf{x}). Further, the decision boundary between CkC_k and CjC_j is given by the hyperplane,

{x:(wkwj)Tx+(wk,0wj,0)=0}\{\mathbf{x} : (\mathbf{w}_k - \mathbf{w}_j)^T \mathbf{x} + (w_{k, 0} - w_{j, 0}) = 0\}

Linear (Probabilistic) Models for Classification

As we have discussed, discriminative probabilistic models aim at modeling p(Ckx)p(C_k \mid \mathbf{x}) directly. Which are potentially more powerful than deterministic models since it allows us to model sources of uncertainty in the label assignment to input variables.

We will consider linear decision boundaries (in the input space) but also applying a nonlinear transformation ϕ=(ϕ0,,ϕM)\boldsymbol{\phi} = (\phi_0, \ldots, \phi_M) to x\mathbf{x}, then work with ϕ(x)\boldsymbol{\phi}(\mathbf{x}) instead of x\mathbf{x}. For clarity, decision boundaries linear in the feature space ϕ(x)\boldsymbol{\phi}(\mathbf{x}) are also called linear models, but they are nonlinear in the input space x\mathbf{x}.

Thus, inspired by the linear regression model, we consider p(y=C1x)p(\mathsf{y} = C_1 \mid \mathbf{x}) of the form,

p(y=C1x,w)=f(wTx),0f()1p(\mathsf{y} = C_1 \mid \mathbf{x}, \mathbf{w}) = f(\mathbf{w}^T \mathbf{x}), \quad 0 \leq f(\cdot) \leq 1

A popular choice is the logit (or logistic sigmoid) function,

f(x)=exp(x)1+exp(x)=11+exp(x)σ(x)f(x) = \frac{\exp(x)}{1 + \exp(x)} = \frac{1}{1 + \exp(-x)} \eqqcolon \sigma(x)
Logistic Sigmoid Function
Logistic Sigmoid Function
Example 1 (Logistic Regression for Binary Classification)

In the binary classification setting, one can derive,

p(y=C1ϕ,x,w)=p(y=C1ϕ(x),w)=exp(wTϕ(x))1+exp(wTϕ(x))=11+exp(wTϕ(x))σ(wTϕ(x))\begin{align*} p(\mathsf{y} = C_1 \mid \boldsymbol{\phi}, \mathbf{x}, \mathbf{w}) & = p(\mathsf{y} = C_1 \mid \boldsymbol{\phi}(\mathbf{x}), \mathbf{w}) \newline & = \frac{\exp(\mathbf{w}^T \boldsymbol{\phi}(\mathbf{x}))}{1 + \exp(\mathbf{w}^T \boldsymbol{\phi}(\mathbf{x}))} \newline & = \frac{1}{1 + \exp(-\mathbf{w}^T \boldsymbol{\phi}(\mathbf{x}))} \eqqcolon \sigma(\mathbf{w}^T \boldsymbol{\phi}(\mathbf{x})) \newline \end{align*}

For an MM-dimensional feature space ϕ\boldsymbol{\phi}, i.e., MM adjustable parameters.

Thus, for inference, the misclassification error is minimized by,

{xC1,if p(C1ϕ(x),w)>12xC2,if p(C1ϕ(x),w)<12\begin{cases} \mathbf{x} \in C_1, & \text{if } p(C_1 \mid \boldsymbol{\phi}(\mathbf{x}), \mathbf{w}) > \frac{1}{2} \newline \mathbf{x} \in C_2, & \text{if } p(C_1 \mid \boldsymbol{\phi}(\mathbf{x}), \mathbf{w}) < \frac{1}{2} \end{cases}

Or equivalently,

{xC1,if wTϕ(x)>0xC2,if wTϕ(x)<0\begin{cases} \mathbf{x} \in C_1, & \text{if } \mathbf{w}^T \boldsymbol{\phi}(\mathbf{x}) > 0 \newline \mathbf{x} \in C_2, & \text{if } \mathbf{w}^T \boldsymbol{\phi}(\mathbf{x}) < 0 \end{cases}

Thus, the decision boundary is given by the hyperplane wTϕ(x)=0\mathbf{w}^T \boldsymbol{\phi}(\mathbf{x}) = 0.

As previously, for learning we use ML learning,

w=argmaxw p(yDxD,w)\mathbf{w}^{\star} = \underset{\mathbf{w}}{\arg\max} \ p(y_{\mathcal{D}} \mid x_{\mathcal{D}}, \mathbf{w})

where the likelihood function is defined as,

p(yDxD,w)i=1Np(yixi,w)=i=1Np(yiϕ(xi),w)=i=1Np(y=C1ϕ(xi),w)yi(1p(y=C1ϕ(xi),w))1yi\begin{align*} p(y_{\mathcal{D}} \mid x_{\mathcal{D}}, \mathbf{w}) & \coloneqq \prod_{i=1}^{N} p(y_i \mid \mathbf{x}_i, \mathbf{w}) = \prod_{i=1}^{N} p(y_i \mid \boldsymbol{\phi}(\mathbf{x}_i), \mathbf{w}) \newline & = \prod_{i=1}^{N} p(\mathsf{y} = C_1 \mid \boldsymbol{\phi}(\mathbf{x}_i), \mathbf{w})^{y_i} (1 - p(\mathsf{y} = C_1 \mid \boldsymbol{\phi}(\mathbf{x}_i), \mathbf{w}))^{1 - y_i} \newline \end{align*}

Since log()log(\cdot) is a monotonic function, we can equivalently maximize the log-likelihood, which will yield us the negative log-likelihood,

log(p(yDxD,w))=i=1N(yilogp(y=C1ϕ(xi),w)+(1yi)log(1p(y=C1ϕ(xi),w)))-\log(p(y_{\mathcal{D}} \mid x_{\mathcal{D}}, \mathbf{w})) = -\sum_{i=1}^{N} \left(y_i \log p(\mathsf{y} = C_1 \mid \boldsymbol{\phi}(\mathbf{x}_i), \mathbf{w}) + (1 - y_i) \log (1 - p(\mathsf{y} = C_1 \mid \boldsymbol{\phi}(\mathbf{x}_i), \mathbf{w})) \right)

Thus, the ML estimate is given by,

w=argminw log(p(yDxD,w))=argminw i=1N(yilog(p(y=C1ϕ(xi),w))+(1yi)log(1p(y=C1ϕ(xi),w)))\begin{align*} \mathbf{w}^{\star} & = \underset{\mathbf{w}}{\arg\min} \ -\log(p(y_{\mathcal{D}} \mid x_{\mathcal{D}}, \mathbf{w})) \newline & = \underset{\mathbf{w}}{\arg\min} \ -\sum_{i=1}^{N} \left(y_i \log(p(\mathsf{y} = C_1 \mid \boldsymbol{\phi}(\mathbf{x}_i), \mathbf{w})) + (1 - y_i) \log(1 - p(\mathsf{y} = C_1 \mid \boldsymbol{\phi}(\mathbf{x}_i), \mathbf{w})) \right) \newline \end{align*}

Which is the cross-entropy loss function.

However, this has no closed-form solution, but NLL has unique minimum unless classes are perfectly separated by a hyperplane. The NLL function is also concave and the derivatives and Hessian can easily be computed \Rightarrow We can use iterative optimization methods such as gradient descent or Newton-Raphson to find w\mathbf{w}^{\star}.

Recall (Maximum A Posteriori Learning)

ML can lead to overfitting when too many parameters are used compared to training examples. That is, a large value of w\Vert \mathbf{w} \Vert a manifestation of overfitting. By introducing a prior on w\mathbf{w} that gives lower probability to larger values, e.g., wN(0,α1I)\mathbf{w} \sim \mathcal{N}(0, \alpha^{-1} \mathbf{I}), we can use MAP learning to mitigate overfitting. Further, recall that the MAP criterion is defined as,

wMAPargmaxw LD(w)+λNw2.\mathbf{w}_{\text{MAP}} \coloneqq \underset{\mathbf{w}}{\arg\max} \ L_{\mathcal{D}}(\mathbf{w}) + \frac{\lambda}{N} \Vert \mathbf{w} \Vert^2.
Note

Note that ML learning will only return a single w\mathbf{w}. Thus, to deal with uncertainty in estimating w\mathbf{w}, we need to determine the posterior distribution p(wD)p(\mathbf{w} \mid \mathcal{D}).

Bayesian Logistic Regression

Intuition (Bayesian Logistic Regression)

We want to compute the full posterior p(wD)p(\mathbf{w} \mid \mathcal{D}). However, exact Bayesian inference for logistic regression is not possible, as exact evaluation of p(wD)p(\mathbf{w} \mid \mathcal{D}) is intractable.

p(wD)p(w)p(yDxD,w)=p(w)i=1Np(yiϕ(xi),w)\begin{align*} p(\mathbf{w} \mid \mathcal{D}) & \propto p(\mathbf{w}) p(y_{\mathcal{D}} \mid x_{\mathcal{D}}, \mathbf{w}) \newline & = p(\mathbf{w}) \prod_{i=1}^{N} p(y_i \mid \boldsymbol{\phi}(\mathbf{x}_i), \mathbf{w}) \newline \end{align*}

Thus, this is more complex than normal linear regression models (cannot integrate exactly over w\mathbf{w} since posterior is not Gaussian). Thus, we need to introduce some approximations such as Laplace approximation.

Definition 1 (The Laplace Approximation)

The Laplace approximation aimds to find a Gaussian approximation to a posterior distribution. In the case of a single continuous random variable z\mathsf{z},

p(z)=1Zf(z)p(z) = \frac{1}{Z} f(z)

where ZZ (unknown) is the normalization constant (Z=f(z) dzZ = \int f(z) \ dz). Thus, the goal is to find a Gaussian approximation q(z)q(z) centered on a mode of p(z)p(z).

  1. Find a mode of p(z)p(z) (i.e., p(z0)=0p'(z_0) = 0)
Note

Since the logarithm of a Gaussian is a quadratic function, we can equivalently find the mode of logp(z)\log p(z).

  1. Taylor series expansion of logf(z)\log f(z) around z0z_0 up to second order,
logf(z)logf(z0)+(zz0)ddzlogf(z)z=z0+(zz0)212d2dz2logf(z)z=z0=logf(z0)+(zz0)212d2dz2logf(z)z=z0=logf(z0)12A(zz0)2\begin{align*} \log f(z) & \approx \log f(z_0) + (z - z_0) \frac{d}{dz} \log f(z) \bigg|_{z = z_0} + (z - z_0)^2 \frac{1}{2} \frac{d^2}{dz^2} \log f(z) \bigg|_{z = z_0} \newline & = \log f(z_0) + (z - z_0)^2 \frac{1}{2} \frac{{d^2}}{dz^2} \log f(z) \bigg|_{z = z_0} \newline & = \log f(z_0) - \frac{1}{2} A(z - z_0)^2 \newline \end{align*}

where A=d2dz2logf(z)z=z0A = -\frac{d^2}{dz^2} \log f(z) \big|_{z = z_0}. Thus, we obtain,

f(z)f(z0)exp(A2(zz0)2)f(z) \approx f(z_0) \exp \left(-\frac{A}{2} (z - z_0)^2 \right)

and the normalized distribution as,

q(z)=(A2π)1/2exp(A2(zz0)2)=N(zz0,A1)\begin{align*} q(z) & = \left(\frac{A}{2 \pi}\right)^{1/2} \exp \left(-\frac{A}{2} (z - z_0)^2 \right) \newline & = \mathcal{N}(z \mid z_0, A^{-1}) \newline \end{align*}

In higher dimensions, p(z)=1Zf(z)p(\mathbf{z}) = \frac{1}{Z} f(\mathbf{z}).

  1. We find a mode of p(z)p(\mathbf{z}) (i.e., f(z=0)\nabla f(\mathbf{z} = 0)).

  2. Taylor series expansion of logf(z)\log f(\mathbf{z}) around centered on mode z0z_0,

logf(z)logf(z0)12(zz0)TA(zz0)\log f(\mathbf{z}) \approx \log f(\mathbf{z}_0) - \frac{1}{2} (\mathbf{z} - \mathbf{z}_0)^T \mathbf{A} (\mathbf{z} - \mathbf{z}_0)

with A\mathbf{A} the d×dd \times d Hessian matrix (evaluated at z=z0\mathbf{z} = \mathbf{z}_0),

A=logf(z)z=z0=[2logf(z)z122logf(z)z1zd2logf(z)zdz12logf(z)zd2]\mathbf{A} = -\nabla \nabla \log f(\mathbf{z}) \bigg|_{\mathbf{z} = \mathbf{z}_0} = \begin{bmatrix} \frac{\partial^2 \log f(\mathbf{z})}{\partial z_1^2} & \cdots & \frac{\partial^2 \log f(\mathbf{z})}{\partial z_1 \partial z_d} \newline \vdots & \ddots & \vdots \newline \frac{\partial^2 \log f(\mathbf{z})}{\partial z_d \partial z_1} & \cdots & \frac{\partial^2 \log f(\mathbf{z})}{\partial z_d^2} \newline \end{bmatrix}

Thus, we obtain,

f(z)f(z0)exp(12(zz0)TA(zz0))f(\mathbf{z}) \approx f(\mathbf{z}_0) \exp \left(-\frac{1}{2} (\mathbf{z} - \mathbf{z}_0)^T \mathbf{A} (\mathbf{z} - \mathbf{z}_0) \right)

and the normalized distribution as,

q(z)=A1/2(2π)d/2exp(12(zz0)TA(zz0))=N(zz0,A1)\begin{align*} q(\mathbf{z}) & = \frac{|\mathbf{A}|^{1/2}}{(2 \pi)^{d/2}} \exp \left(-\frac{1}{2} (\mathbf{z} - \mathbf{z}_0)^T \mathbf{A} (\mathbf{z} - \mathbf{z}_0) \right) \newline & = \mathcal{N}(\mathbf{z} \mid \mathbf{z}_0, \mathbf{A}^{-1}) \newline \end{align*}
Intuition (Bayesian Logistic Regression (Continued))

Thus, we can use a Laplace approximation, but we also need to select a prior for w\mathbf{w} \Rightarrow a Gaussian prior,

p(w)=N(wm0,S0)(N(w0,α1I))p(\mathbf{w}) = \mathcal{N}(\mathbf{w} \mid \mathbf{m}_0, \mathbf{S}_0) \quad \left(\mathcal{N}(\mathbf{w} \mid 0, \alpha^{-1} \mathbf{I})\right)

Thus, we again, define the log-likelihood and proceed as follows,

  1. We maximize p(wD)p(\mathbf{w} \mid \mathcal{D}) (MAP solution wMAP\mathbf{w}_{\text{MAP}}) \Rightarrow defines the mean of the Gaussian.

  2. Covariance is given by,

SN1=logp(wD)=S01+i=1Nyi(1yi)ϕ(xi)ϕ(xi)T\mathbf{S}_N^{-1} = -\nabla \nabla \log p(\mathbf{w} \mid \mathcal{D}) = \mathbf{S}_0^{-1} + \sum_{i=1}^{N} y_i (1 - y_i) \boldsymbol{\phi}(\mathbf{x}_i) \boldsymbol{\phi}(\mathbf{x}_i)^T

Thus, the Laplace approximation of p(wD)p(\mathbf{w} \mid \mathcal{D}) is given by,

q(w)=N(wwMAP,SN)q(\mathbf{w}) = \mathcal{N}(\mathbf{w} \mid \mathbf{w}_{\text{MAP}}, \mathbf{S}_N)

Ultimately, we want to predict yy \Rightarrow Want to learn a posterior predictive distribution p(yD,x)=p(yD,ϕ(x))p(y \mid \mathbf{D}, \mathbf{x}) = p(y \mid \mathbf{D}, \boldsymbol{\phi}(\mathbf{x})).

p(yD,x)=p(wD)posterior dist. of wp(yx,w) dw.p(y \mid \mathcal{D}, \mathbf{x}) = \int \underbrace{p(\mathbf{w} \mid \mathcal{D})}_{\text{posterior dist. of } \mathbf{w}} p(y \mid \mathbf{x}, \mathbf{w}) \ d\mathbf{w}.

Since p(yx,w)p(y \mid \mathbf{x}, \mathbf{w}) is associated with each value of w\mathbf{w} weighted by the posterior belief,

p(wD)=p(w)p(yDxD,w)p(yDxD).p(\mathbf{w} \mid \mathcal{D}) = \frac{p(\mathbf{w}) p(y_{\mathcal{D}} \mid x_{\mathcal{D}}, \mathbf{w})}{p(y_{\mathcal{D}} \mid x_{\mathcal{D}})}.

Given ϕ\boldsymbol{\phi} (or x\mathbf{x}), we can forecast yy via the predictive distribution.

Definition 2 (Predictive Distribution)

The predictive distribution for class C1C_1, p(y=C1D,ϕ)p(y = C_1 \mid \mathcal{D}, \boldsymbol{\phi}) is given by,

p(y=C1D,ϕ)=p(C1ϕ,w)p(wD) dw.p(y = C_1 \mid \mathcal{D}, \boldsymbol{\phi}) = \int p(C_1 \mid \boldsymbol{\phi}, \mathbf{w}) p(\mathbf{w} \mid \mathcal{D}) \ d\mathbf{w}.

Since p(y=C1ϕ,w)=σ(wTϕ)p(\mathsf{y} = C_1 \mid \boldsymbol{\phi}, \mathbf{w}) = \sigma(\mathbf{w}^T \boldsymbol{\phi}) and p(wD)q(w)=N(wwMAP,SN)p(\mathbf{w} \mid \mathcal{D}) \approx q(\mathbf{w}) = \mathcal{N}(\mathbf{w} \mid \mathbf{w}_{\text{MAP}}, \mathbf{S}_N), we obtain,

p(C1D,ϕ)σ(wTϕ)q(w) dw=σ(a)N(aμa,σa2) da\begin{align*} p(C_1 \mid \mathcal{D}, \boldsymbol{\phi}) & \approx \int \sigma(\mathbf{w}^T \boldsymbol{\phi}) q(\mathbf{w}) \ d\mathbf{w} \newline & = \int \sigma(a) \mathcal{N}(a \mid \mu_a, \sigma_a^2) \ da \newline \end{align*}

where,

a=wTϕμa=E[a]=wMAPTϕσa2=Var[a]=ϕTSNϕ\begin{align*} a & = \mathbf{w}^T \boldsymbol{\phi} \newline \mu_a & = \mathbb{E}[a] = \mathbf{w}_{\text{MAP}}^T \boldsymbol{\phi} \newline \sigma_a^2 & = \mathrm{Var}[a] = \boldsymbol{\phi}^T \mathbf{S}_N \boldsymbol{\phi} \newline \end{align*}

However, this integral cannot be evaluated analytically. But we can exploit that σ(a)\sigma(a) is similar to the probit function Φ(a)\Phi(a) (CDF of standard normal distribution) to obtain an approximation,

Φ(a)aN(x0,1) dx\Phi(a) \coloneqq \int_{-\infty}^{a} \mathcal{N}(x \mid 0, 1) \ dx

σ(a)\sigma(a) is well approximated by Φ(λa)\Phi(\lambda a) with λ=π/8\lambda = \sqrt{\pi / 8},

Φ(λa)N(aμ,σ2) da=Φ(μ(λ2+σ2)1/2)\int \Phi(\lambda a) \mathcal{N}(a \mid \mu, \sigma^2) \ da = \Phi \left( \frac{\mu}{\left(\lambda^{-2} + \sigma^2\right)^{1/2}} \right)

Thus,

p(C1D,ϕ)σ(a)N(aμa,σa2) daΦ(μa(λ2+σa2)1/2)σ(μa(1+πσa28)1/2)\begin{align*} p(C_1 \mid \mathcal{D}, \boldsymbol{\phi}) & \approx \int \sigma(a) \mathcal{N}(a \mid \mu_a, \sigma_a^2) \ da \newline & \approx \Phi \left( \frac{\mu_a}{\left(\lambda^{-2} + \sigma_a^2\right)^{1/2}} \right) \newline & \approx \sigma \left( \frac{\mu_a}{\left(1 + \frac{\pi \sigma_a^2}{8}\right)^{1/2}} \right) \newline \end{align*}

Multi-Class Classification

Intuition (Multi-Class Logistic Regression)

In multi-class classification with KK classes, we can generalize logistic regression using the softmax function,

p(y=ix=x)=exp(wiTx)j=1Kexp(wjTx)p(\mathsf{y} = i \mid \mathsf{\mathbf{x}} = \mathbf{x}) = \frac{\exp(\mathbf{w}_i^T \mathbf{x})}{\sum_{j=1}^{K} \exp(\mathbf{w}_j^T \mathbf{x})}
Multi-Class Linear and Quadratic Regression
Multi-Class Linear and Quadratic Regression

Generative Probabilistic Models for Classification

Intuition (Generative Probabilistic Models for Classification)

As we have discussed, making more assumptions about the data by considering distribution of covariates x\mathbf{x}, we may suffer from bias when our model is incorrectly selected. The capability to capture properties of x\mathbf{x} can improve learning if p(xy)p(\mathbf{x} \mid y) has significant structure.

Since different models for p(xCi)p(\mathbf{x} \mid C_i) lead to different generative models, generative models are typically defined by assuming,

xy=yexponential(ηy)\mathbf{x} \mid \mathsf{y} = y \sim \mathrm{exponential}(\eta_y)

i.e., the exponential family,

p(xη)=1Z(η)h(x)exp(ηTu(x))p(\mathbf{x} \mid \boldsymbol{\eta}) = \frac{1}{Z(\boldsymbol{\eta})} h(\mathbf{x}) \exp(\boldsymbol{\eta}^T \mathbf{u}(\mathbf{x}))

For example, Gaussian Discriminant Analysis (GDA), the class-conditional density is modeled by a multivariate Gaussian.

Definition 3 (Linear Discriminant Analysis (LDA))

in LDA, the principle is that xRd\mathbf{x} \in \mathbb{R}^d is Gaussian distributed for each class CkC_k, with covariance matrix Σk=Σ\mathbf{\Sigma}_k = \mathbf{\Sigma} (shared across classes),

xy=CkN(xμk,Σ)\mathbf{x} \mid \mathsf{y} = C_k \sim \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \mathbf{\Sigma})

or, equivalently,

p(xy=Ck)=1(2π)d/2Σ1/2exp(12(xμk)TΣ1(xμk))p(\mathbf{x} \mid \mathsf{y} = C_k) = \frac{1}{(2 \pi)^{d/2} |\mathbf{\Sigma}|^{1/2}} \exp \left( -\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu}_k)^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}_k) \right)

Thus, fitting an LDA model to data is equivalent to estimating the class means μk\boldsymbol{\mu}_k, the shared covariance matrix Σ\mathbf{\Sigma}, and the class priors p(Ck)p(C_k). As the name suggests, LDA gives rise to linear decision boundaries between classes.

Example 2 (LDA Decision Boundary in the Two-Class Case)logp(y=Cix)p(y=Cjx)=logp(xCi)p(y=Ci)p(xCj)p(y=Cj)=logp(xCi)p(xCj)+logp(y=Ci)p(y=Cj)=12μiTΣ1μi+12μjTΣ1μj+xTΣ1(μiμj)+logp(y=Ci)p(y=Cj)=w0+wTx\begin{align*} \log \frac{p(\mathsf{y} = C_i \mid \mathbf{x})}{p(\mathsf{y} = C_j \mid \mathbf{x})} & = \log \frac{p(\mathbf{x} \mid C_i) p(\mathsf{y} = C_i)}{p(\mathbf{x} \mid C_j) p(\mathsf{y} = C_j)} \newline & = \log \frac{p(\mathbf{x} \mid C_i)}{p(\mathbf{x} \mid C_j)} + \log \frac{p(\mathsf{y} = C_i)}{p(\mathsf{y} = C_j)} \newline & = -\frac{1}{2} \boldsymbol{\mu}_i^T \mathbf{\Sigma}^{-1} \boldsymbol{\mu}_i + \frac{1}{2} \boldsymbol{\mu}_j^T \mathbf{\Sigma}^{-1} \boldsymbol{\mu}_j + \mathbf{x}^T \mathbf{\Sigma}^{-1} (\boldsymbol{\mu}_i - \boldsymbol{\mu}_j) + \log \frac{p(\mathsf{y} = C_i)}{p(\mathsf{y} = C_j)} \newline & = w_0 + \mathbf{w}^T \mathbf{x} \newline \end{align*}

where,

w=Σ1(μiμj)w0=12μiTΣ1μi+12μjTΣ1μj+logp(y=Ci)p(y=Cj)\begin{align*} \mathbf{w} & = \mathbf{\Sigma}^{-1} (\boldsymbol{\mu}_i - \boldsymbol{\mu}_j) \newline w_0 & = -\frac{1}{2} \boldsymbol{\mu}_i^T \mathbf{\Sigma}^{-1} \boldsymbol{\mu}_i + \frac{1}{2} \boldsymbol{\mu}_j^T \mathbf{\Sigma}^{-1} \boldsymbol{\mu}_j + \log \frac{p(\mathsf{y} = C_i)}{p(\mathsf{y} = C_j)} \newline \end{align*}

Thus, the decision boundaries are points for which p(y=Cix)=p(y=Cjx)p(\mathsf{y} = C_i \mid \mathbf{x}) = p(\mathsf{y} = C_j \mid \mathbf{x}) \Rightarrow {x:w0+wTx=0}\{\mathbf{x} : w_0 + \mathbf{w}^T \mathbf{x} = 0\}, which is a hyperplane.

Definition 4 (Quadratic Discriminant Analysis (QDA))

In QDA, the principle is that xRd\mathbf{x} \in \mathbb{R}^d is Gaussian distributed for each class CkC_k, with covariance matrix Σk\mathbf{\Sigma}_k (different across classes),

xy=CkN(xμk,Σk)\mathbf{x} \mid \mathsf{y} = C_k \sim \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \mathbf{\Sigma}_k)

or, equivalently,

p(xy=Ck)=1(2π)d/2Σk1/2exp(12(xμk)TΣk1(xμk))p(\mathbf{x} \mid \mathsf{y} = C_k) = \frac{1}{(2 \pi)^{d/2} |\mathbf{\Sigma}_k|^{1/2}} \exp \left( -\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu}_k)^T \mathbf{\Sigma}_k^{-1} (\mathbf{x} - \boldsymbol{\mu}_k) \right)

Thus, the discriminant functions are,

logp(y=Cix)p(y=Cjx)=12logΣi12(xμi)TΣi1(xμi)+logp(Ci)p(Cj)\log \frac{p(\mathsf{y} = C_i \mid \mathbf{x})}{p(\mathsf{y} = C_j \mid \mathbf{x})} = -\frac{1}{2} \log |\mathbf{\Sigma}_i| - \frac{1}{2} (\mathbf{x} - \boldsymbol{\mu}_i)^T \mathbf{\Sigma}_i^{-1} (\mathbf{x} - \boldsymbol{\mu}_i) + \log \frac{p(C_i)}{p(C_j)}

The decision boundaries are quadratic functions of x\mathbf{x} (since covariance matrices differ across classes).

Intuition (GDA Learning)

In the two-class case, we need to fit model and determine values (μ1,μ2,Σ,π)(\boldsymbol{\mu}_1, \boldsymbol{\mu}_2, \mathbf{\Sigma}, \mathbf{\pi}),

(μ1,μ2,Σ,π)=argmax(μ1,μ2,Σ,π) p(Dμ1,μ2,Σ,π)(\boldsymbol{\mu}_1^{\star}, \boldsymbol{\mu}_2^{\star}, \mathbf{\Sigma}^{\star}, \pi^{\star}) = \underset{(\boldsymbol{\mu}_1, \boldsymbol{\mu}_2, \mathbf{\Sigma}, \pi)}{\arg\max} \ p(\mathcal{D} \mid \boldsymbol{\mu}_1, \boldsymbol{\mu}_2, \mathbf{\Sigma}, \pi)

I will not show all the derivations (I’ve already done this once), but the ML estimates are given by,

π=N1Nμ1=1N1i=1Nyixiμ2=1N2i=1N(1yi)xiΣ=N1NS1+N2NS2\begin{align*} \pi^{\star} & = \frac{N_1}{N} \newline \boldsymbol{\mu}_1^{\star} & = \frac{1}{N_1} \sum_{i=1}^{N} y_i \mathbf{x}_i \newline \boldsymbol{\mu}_2^{\star} & = \frac{1}{N_2} \sum_{i=1}^{N} (1 - y_i) \mathbf{x}_i \newline \mathbf{\Sigma}^{\star} & = \frac{N_1}{N} \mathbf{S}_1 + \frac{N_2}{N} \mathbf{S}_2 \newline \end{align*}

where,

N1=i=1NyiN2=i=1N(1yi)S1=1N1i=1;yi=C1N(xiμ1)(xiμ1)TS2=1N2i=1;yi=C2N(xiμ2)(xiμ2)T\begin{align*} N_1 & = \sum_{i=1}^{N} y_i \newline N_2 & = \sum_{i=1}^{N} (1 - y_i) \newline \mathbf{S}_1 & = \frac{1}{N_1} \sum_{i=1; \newline y_i = C_1}^{N} (\mathbf{x}_i - \boldsymbol{\mu}_1)(\mathbf{x}_i - \boldsymbol{\mu}_1)^T \newline \mathbf{S}_2 & = \frac{1}{N_2} \sum_{i=1; \newline y_i = C_2}^{N} (\mathbf{x}_i - \boldsymbol{\mu}_2)(\mathbf{x}_i - \boldsymbol{\mu}_2)^T \newline \end{align*}

and in the mutli-class case,

πk=NkNμk=1Nki=1NI(yi=Ck)xiΣ=k=1KNkNSk\begin{align*} \pi_k^{\star} & = \frac{N_k}{N} \newline \boldsymbol{\mu}_k^{\star} & = \frac{1}{N_k} \sum_{i=1}^{N} \mathbb{I}(y_i = C_k) \mathbf{x}_i \newline \mathbf{\Sigma}^{\star} & = \sum_{k=1}^{K} \frac{N_k}{N} \mathbf{S}_k \newline \end{align*}

Multi-Class LDA Multi-Class QDA

Note (GDA)

Two alternative ways to realize quadratic decision boundaries are:

  1. Apply QDA in the original dd-dimensional space x=(x1,,xd)\mathbf{x} = (x_1, \ldots, x_d).

  2. Apply LDA in an augmented spasce of dimension d2+1d^2 + 1, x=(x1,,xd,x12,,xd2,x1x2,,xd1xd)\mathbf{x}^{\prime} = (x_1, \ldots, x_d, x_1^2, \ldots, x_d^2, x_1 x_2, \ldots, x_{d-1} x_d). Since, linear functions in the augmented space corresponds to quadratic functions in the original dd-dimensional space.