Part 2 - Generative VS. Discriminative Models

The Bayes Optimal Classifier

Recall from the previous part, that the best classifier, in theory, is the Bayes Optimal Classifier. However, it is not useful in reality due to the difficulty of estimating $p(\mathbf{x} | y = c)$ in practice.

Naive Bayes Classifier

The Naive Bayes classifier is a simplified version of the Bayes Optimal Classifier.

It has the Naive Bayes assumption, which assumes that all of the feature dimensions are statistically independant given the value of the class variable, for example,

p(x_1, x_2 | y) = p(x_1 | y) p(x_2 | y)

It also accumulates evidence from each feature dimension,

\log p(x_1, x_2 | y) = \log p(x_1 | y) + \log p(x_2 | y)

This allows us to model each dimension of the observation with a simple univariate distribution.

Therefore, the general form for the classification is,

f_{NB}(\mathbf{x}) = \underset{c \in \{1, \ldots, C\}}{\arg\max} p(y = c) \prod_{j = 1}^{N} p(x_j | y = c)

Gaussian Classifier

If we treat each dimension as an independent Gaussian and fit the Gaussian parameters with MLE, we have a Gaussian classifier.

But what is the geometry of the decision boundary, let’s take a look at the 2D case.

Geometric Interpretation

We can rewrite and simplify our equation,

\begin{align*} f_{NB}(\mathbf{x}) &= \underset{c \in \{1, 2\}}{\arg\max} p(y = c) \prod_{j = 1}^{2} p(x_j | y = c) \newline &= \underset{c \in \{1, 2\}}{\arg\max} \log p(y = c) + \sum_{j = 1}^{2} \log p(x_j | y = c) \newline &= \underset{c \in \{1, 2\}}{\arg\max} \log p(y = c) + \sum_{j = 1}^{2} \log \mathcal{N}(x_j; \mu_{j | c}, \sigma_{j | c}^2) \newline &= \underset{c \in \{1, 2\}}{\arg\max} \log p(y = c) + \sum_{j = 1}^{2} \log \left[ \frac{1}{\sqrt{2\pi\sigma_{j | c}^2}} \exp\left(-\frac{(x_j - \mu_{j | c})^2}{2\sigma_{j | c}^2}\right) \right] \newline &= \underset{c \in \{1, 2\}}{\arg\max} \log p(y = c) + \sum_{j = 1}^{2} \log \left[\left(2\pi\sigma_{j | c}^2\right)^{-\frac{1}{2}} \exp\left(-\frac{(x_j - \mu_{j | c})^2}{2\sigma_{j | c}^2}\right) \right] \newline &= \underset{c \in \{1, 2\}}{\arg\max} \log p(y = c) + \sum_{j = 1}^{2} \left( -\frac{1}{2} \log(2\pi\sigma_{j | c}^2) - \frac{(x_j - \mu_{j | c})^2}{2\sigma_{j | c}^2} \right) \end{align*}

Okay, so the decision boundary consists of the set of points $(x_1, x_2)$ where,

\log p(y = 1) + \sum_{j = 1}^2 \log p(x_j | y = 1) = \log p(y = 2) + \sum_{j = 1}^2 \log p(x_j | y = 2)

Which means,

\log p(y = 1) + \sum_{j = 1}^2 \left(-\frac{1}{2} \log(2\pi\sigma_{j | 1}^2) - \frac{(x_j - \mu_{j | 1})^2}{2\sigma_{j | 1}^2}\right) - \newline \log p(y = 2) + \sum_{j = 1}^2 \left(-\frac{1}{2} \log(2\pi\sigma_{j | 2}^2) - \frac{(x_j - \mu_{j | 2})^2}{2\sigma_{j | 2}^2}\right) \newline = 0

So it is a quadratic function of $(x_1, x_2)$ with the form,

\sum_{j = 1}^2 (a_j x_j^2 + b_j x_j) + c = 0

Naive Bayes for Boolean Vectors

In the case we are working with boolean vectors, we can simplify the Naive Bayes classifier.

We still model each feature independently, but we can use the Bernoulli distribution to model the feature.

If there is a presence of a feature, we model it as,

p(x_j = 1 | y = c) = \varphi_{j | c}

and in the absence of a feature, we model it as,

p(x_j = 0 | y = c) = 1 - \varphi_{j | c}

Therefore, the MLE parameters are $\varphi_{j | c} = M_{j | c} / M_c$ , where, $M_{j | c}$ is the number of training examples in class $c$ that contain feature $x_j$ . $M_c$ is the number of training examples in class $c$ .

We therefore get class-conditional distributions,

p(x_1, \ldots, x_N | y = c) = \prod_{j = 1}^{N} p(x_j | y = c) \newline \log p(x_1, \ldots, x_N | y = c) = \sum_{j = 1}^{N} \log p(x_j | y = c)

So, for an input example $\mathbf{x}$ , the log-probabilities of features present given $y = c$ adds.

Smoothing

Some features may not be present in any training examples for a given class, which means, $M_{j | c} = 0$ and thus $\varphi_{j | c} = 0$ .

Which is problematic, since it may appear in a real-world example.

To solve this we need to smooth our MLE, by adding a smoothing parameter $\alpha$ , which can be interpreted as a “pseudo-count”.

So, our new parameter is,

\varphi_{j | c} = \frac{M_{j | c} + \alpha}{M_c + N\alpha}

If $\alpha = 1$ , then we have Laplace Smoothing (also called additive smoothing).

In general, smoothing helps to prevent overfitting of the parameters.

Multinomial Naive Bayes

What if $x_j$ has a finite set of categories?

x_j \in \chi_j = \{1, \ldots, | \chi_j |\}

Then we need to use a multinomial (or categorical) distribution as class-conditional,

p(x_j = k | y = c) = \varphi_{x, j | c} \newline \sum_{x \in \chi_j} \varphi_{x, j | c} = 1

Our MLE parameter, $\varphi_{x, j | c} = M_{x, j | c} / M_{j | c}$ , where, $M_{x, j | c}$ is the number of training examples in class $c$ that have feature $x_j = x$ . $M_{j | c}$ is the number of training examples in class $c$ .

Linear Discriminant Analysis (LDA)

LDA is a classification technique that dates back to the 1930s, with origins from Fisher.

It can be interpreted as a approximation to the Bayes optimal classifier, but for real-valued data.

Instead of a product of independent Gaussians, LDA assumes that $p(\mathbf{x} | y = c) = \mathcal{N}(\mathbf{x}; \mathbf{\mu}_c, \mathbf{\Sigma})$ . A multivariate Gaussian with a shared covariance matrix $\mathbf{\Sigma}$ .

Which equates to,

p(\mathbf{x} | y = c) = \frac{1}{|(2\pi)^N \mathbf{\Sigma}}|^{\frac{1}{2}} exp\left(-\frac{1}{2} (\mathbf{x} - \mathbf{\mu}_c)^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \mathbf{\mu}_c)\right)

The classification function is then,

f_{LDA}(\mathbf{x}) = \underset{c \in \{1, \ldots, C\}}{\arg\max} p(y = c) p(\mathbf{x} | y = c)

Training LDA

As with Naive Bayes, the LDA parameters are learned using MLE.

Our class probabilities are,

p(y = c) = \frac{1}{M} \sum_{i = 1}^{M} \mathbb{I}(y^{(i)} = c) = \frac{M_c}{M}

Class means are,

\mathbf{\mu_c} = \frac{\sum_{i = 1}^M \mathbb{I}(y^{(i)} = c) \mathbf{x}^{(i)}}{\sum_{i = 1}^M \mathbb{I}[y^{(i)} = c]} = \frac{\sum_{i = 1}^{M_c} \mathbf{x}^{(i)}}{M_c}

where $\mathbf{x}^{(i)}$ belongs to class $c$ .

The shared covariance matrix is,

\mathbf{\Sigma} = \frac{1}{M} \sum_{i = 1}^M \left(\mathbf{x}^{(i)} - \mathbf{\mu_{y^{(i)}}}\right) \left(\mathbf{x}^{(i)} - \mathbf{\mu_{y^{(i)}}}\right)^T

Geometric Interpretation

Let’s take a look at the geometry of the decision boundry for LDAs.

The decision boundary is the set of points $\mathbf{x}$ where,

\log p(y = 1) - \frac{1}{2} \log |(2\pi)^2 \mathbf{\Sigma}| - \frac{1}{2}(\mathbf{x} - \mathbf{\mu}_1)^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \mathbf{\mu}_1) \newline - \log p(y = 2) - \frac{1}{2} \log |(2\pi)^2 \mathbf{\Sigma}| - \frac{1}{2}(\mathbf{x} - \mathbf{\mu}_2)^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \mathbf{\mu}_2) = 0

By cancelling out the common terms, we get,

\log \frac{p(y = 1)}{p(y = 2)} - \frac{1}{2} \mathbf{\mu}_1^T \mathbf{\Sigma}^{-1} \mathbf{\mu}_1 + \frac{1}{2} \mathbf{\mu}_2^T \mathbf{\Sigma}^{-1} \mathbf{\mu}_2 + (\mathbf{\mu}_1 - \mathbf{\mu}_2)^T \mathbf{\Sigma}^{-1} \mathbf{x} = 0

This shows that the decision boundary is linear in $\mathbf{x}$ .

Naive Bayes vs LDA

So, which is better?

Let’s break it down.

Storage
- Naive Bayes: $\mathcal{O}(N)$ parameters
- LDA: $\mathcal{O}(N^2)$ parameters
Speed The quadratic dependence on N makes LDA slower than NB during learning and classification.
Interpretability
- Both models have good interpretability since the parameters of $p(\mathbf{x} | y = c)$ corresponds to class conditional averages.
Data
- LDA will generally need more data than NB since it needs to estimate $\mathcal{O}(N^2)$ parameters in the shared covariance matrix.

Summary of Generative Models

So, what we have been looking at can also be called generative classification models.

They estimate a probability distribution of features from each class. Given a feature, predict the class with the largest posterior probability.

Advantages
- Works with small amounts of data.
- Works with multiple classes.
Disadvantages
- Accuracy depends on selecting an appropriate probability distribution.
- If the probability distribution can’t model the data, the accuracy will be poor.

Discriminative Models

Discriminative models directly model the posterior probability $p(y | \mathbf{x})$ .

Let’s start with the simplest discriminative model, the linear classifier.

Linear Classifier

For a linear classifier we have this “setup”. Observation (feature vectors) $\mathbf{x} \in \mathbb{R}^N$ . Class $y \in \{-1, +1\}$ .

Our goal is to calculate a linear function of the feature vector $\mathbf{x}$ ,

f(\mathbf{x}) = \mathbf{w}^T \mathbf{x} + b = \sum_{j = 1}^N w_j x_j + b

where $\mathbf{w} \in \mathbb{R}^N$ is the weight vector and $b \in \mathbb{R}$ is the bias term.

From this value generated by the function, if $f(\mathbf{x}) > 0$ , we predict class $+1$ , otherwise we predict class $-1$ , or just $sign(f(\mathbf{x}))$ .

Geometric Interpretation

As the name suggests, the linear classifier seperates the feature space into 2 half-spaces.

Each space corresponds to feature values belonging to class $+1$ and class $-1$ .

The class boundary is normal (orthogonal or perpendicular since we are in 2D) to $\mathbf{w}$ . This boundary is also called the separating hyperplane.

Separating Hyperplane

In an $N$ -dimensional feature space, the parameters are $\mathbf{w} \in \mathbb{R}^N$ .

The equation $\mathbf{w}^T \mathbf{x} + b = 0$ defines an $N-1$ -dimensional linear surface.

For $N = 2$ , $\mathbf{w}$ defines a 1D line.
For $N = 3$ , $\mathbf{w}$ defines a 2D plane.
For $N = n$ , $\mathbf{w}$ defines a hyperplane.

Training a Linear Classifier

But how do we set these classifier parameters ( $\mathbf{w}$ , $b$ )?

Simple, learn them from the data. But depending on which approach we take, we get different classifiers.

Let’s take a look at the probabilistic approach.

Logistic Regression

If we take the probabilistic route, we need to map the function value $f(\mathbf{x}) = \mathbf{w}^T \mathbf{x} + b$ to a probability (value between 0 and 1).

Luckily, the sigmoid function does exactly this. It maps any real number to the interval $[0, 1]$ ,

\sigma(z) = \frac{1}{1 + e^{-z}}, \text{ for } z \in \mathbb{R}

So, given a feature vector $\mathbf{x}$ , the probability of a class is,

p(y = +1 | \mathbf{x}) = \sigma(f(\mathbf{x})) \newline p(y = -1 | \mathbf{x}) = 1 - \sigma(f(\mathbf{x})) = \sigma(-f(\mathbf{x})) \ | \ \text{since } \sigma(-z) = 1 - \sigma(z)

or simply,

p(y | \mathbf{x}) = \sigma(y f(\mathbf{x})).

So the important thing here is that we are direcrtly modeling the class posterior probability, not the class-conditional probability.

Learning the Parameters

Given the data set $\mathcal{D} = \{(\mathbf{x}^{(i)}, y^{(i)})\}_{i = 1}^M$ , we can learn the parameters ( $\mathbf{w}$ , $b$ ) using MLE.

Maximize the conditional log likelihood of the data $\mathcal{D}$ ,

\begin{align*} (\mathbf{w}^\star, b^\star) &= \underset{\mathbf{w}, b}{\arg\max} \sum_{i = 1}^M \log p(y^{(i)} | \mathbf{x}^{(i)}, \mathbf{w}; b) \newline &= \underset{\mathbf{w}, b}{\arg\max} -\frac{1}{M} \sum_{i = 1}^M \log (1 + \exp(-y^{(i)} (\mathbf{w}^T \mathbf{x}^{(i)} + b))) \end{align*}

To prevent overfitting, add a prior distribution on $\mathbf{w}$ , for example, a Gaussian distribution with variance $C/2$ .

p(\mathbf{w}) \propto \exp\left(-\frac{1}{C} \mathbf{w}^T \mathbf{w}\right)

Now maximize,

\begin{align*} (\mathbf{w}^\star, b^\star) &= \underset{\mathbf{w}, b}{\arg\max} \frac{1}{M} \sum_{i = 1}^M \log p(y^{(i)} | \mathbf{x}^{(i)}, \mathbf{w}; b) \newline &= \underset{\mathbf{w}, b}{\arg\max} \frac{1}{M} \sum_{i = 1}^M \log \frac{p(\mathbf{w}, y^{(i)} | \mathbf{x}^{(i)}; b)}{p(y^{(i)} | \mathbf{x}^{(i)})} \newline &= \underset{\mathbf{w}, b}{\arg\max} \frac{1}{M} \sum_{i = 1}^M \log \frac{p(\mathbf{w}) p(y^{(i)} | \mathbf{x}^{(i)}, \mathbf{w}; b)}{p(y^{(i)} | \mathbf{x}^{(i)})} \newline &= \underset{\mathbf{w}, b}{\arg\max} \log p(\mathbf{w}) + \frac{1}{M} \sum_{i = 1}^M \log p(y^{(i)} | \mathbf{x}^{(i)}, \mathbf{w}; b) \newline \end{align*}

which equivalently,

(\mathbf{w}^\star, b^\star) = \underset{\mathbf{w}, b}{\arg\min} \frac{1}{C} \mathbf{w}^T \mathbf{w} + \frac{1}{M} \sum_{i = 1}^M \log (1 + \exp(-y^{(i)} (\mathbf{w}^T \mathbf{x}^{(i)} + b)))

The first term is the regularization term. Note that $\mathbf{w}^T \mathbf{w} = \sum_{j = 1}^N w_j^2$ . Therefore, this is a penalty term so $\mathbf{w}$ doesn’t get too large. $C$ is the regularization hyperparameter. Larger $C$ allow large values in $\mathbf{w}$ , smaller $C$ penalize large values in $\mathbf{w}$ .

The second term is the data-fit term. This is what wants to make $(\mathbf{w}, b)$ fit the data.

If we define $z^{(i)} = y^{(i)} f(\mathbf{x}^{(i)})$ , we have a interesting observation:

$z^{(i)} > 0$ when sample $i$ is correctly classified.
$z^{(i)} < 0$ when sample $i$ is incorrectly classified.
$z^{(i)} = 0$ when sample $i$ is on the decision boundary.

Logistic Loss Function

This leads us to define the logistic loss function,

L(z) = log(1 + \exp(-z))

\underset{\mathbf{w}, b}{\arg\max} \log p(\mathbf{w}) + \frac{1}{M} \sum_{i = 1}^M \log p(y^{(i)} | \mathbf{x}^{(i)}, \mathbf{w}; b) \newline

No-closed-form solution to solve for $(\mathbf{w}, b)$ , so we need to use numerical optimization. For example, gradient descent.

$\mathbf{w} \leftarrow \mathbf{w} - \eta \frac{\partial \ell}{\partial \mathbf{w}}$
$\ell$ is the objective function.
$\eta$ is the learning rate.

Cross-Validation

We need to use cross-validation on training set to select the best value of $C$ .

Run many experiments on the traning set to see which parameters work on different versions of the data.

First, partition the data into folds of training and validation data. Try a range of $C$ values on each fold (as validation set). Simply, pick the value that works best over all folds.

Cross-Validation Algorithm

Procedure
1. Select a range of $C$ values to try.
2. Repeat $K$ $K$ rounds:
  1. Split the training set into training data and validation data.
  2. Learn a classifier for each value of $C$ .
  3. Record the accuracy on the validation data for each $C$ .
3. Select the value of that has the highest average accuracy over all $K$ rounds.
4. Retrain the classifier using all data and the selected $C$ .

Multiclass Classification

So far, we have only learned a classifier for 2 classes $y \in \{-1, +1\}$ , also called a binary classifier.

For more than 2 classes, a good stragety is to simply, split the problem into several binary classifier problems.

One-vs-Rest

In this strategy, when we train, for each class, train a classifier for that class versus the other classes.

So for example, if we have 3 classes then we train 3 binary classifiers, $1$ vs $\{2, 3\}$ , $2$ vs $\{1, 3\}$ , and $3$ vs $\{1, 2\}$ .

When predicitng, calculate the probablity for each binary classifier. Select the class with the highest probablity.

Multiclass Logistic Regression

Another way to get a multiclass classifier is to define a multiclass objective function.

One weight vector $\mathbf{w}_c$ for each class $c$ . We omit the bias $b_c$ for each class for notational simplicity.

So, when we had the binary classifier we used the sigmoid function, for a multiclass objective, we need to use the softmax function. Define probabilities with the softmax function,

p(y = c | x) = \frac{\exp(\mathbf{w}_c^T \mathbf{x})}{exp(\mathbf{w}_1^T \mathbf{x}) + \ldots + \exp(\mathbf{w}_C^T \mathbf{x})}

Here we need to estimate the $\{w_c\}_{c = 1}^C$ parameters using MLE as before.

Geometry of Multiclass Logistic Regression

In logistic regressino, it is explcitily designed to have a linear decision boundary in the binary case.

In the multiclass case, the decision boundary is piece-wise linear.

Logistic Regression VS. NB and LDA

Speed
- Learning logistic regression requires iterative numerical optimization, which will be slower than NB and LDA.
Storage
- The model requires $\mathcal{O}(N)$ parameters, the same order as NB but much less than LDA’s $\mathcal{O}(N^2)$ .
- Interpretability
  - The “importance” of feature $x_j$ can be understood in terms of the corresponding learned weight $w_j$ .

Logistic Regression Summary

Classifier
- Linear function $f(\mathbf{x}) = \mathbf{w}^T \mathbf{x} + b$ .
- Given a feature vector $\mathbf{x}$ $x$ the probability of a class is
  - $p(y = +1 | \mathbf{x}) = \sigma(f(\mathbf{x})), p(y = -1 | \mathbf{x}) = \sigma(-f(\mathbf{x}))$ .
  - Sigmod function $\sigma(z) = \frac{1}{1 + e^{-z}}$ .
- Logistic loss function $L(z) = \log(1 + \exp(-z))$ .
Training
- Maximize the likelihood of the training data
- Use regularization to prevent overfitting.
  - Use cross-validation to pick the regularization parameter $C$ .
Classification
- Given a new sample $\mathbf{x}^\star$ , pick class with the highest $p(y | \mathbf{x}^\star)$ .

y^\star = \begin{cases} +1, & \frac{p(y = +1 | \mathbf{x}^\star)}{p(y = -1 | \mathbf{x}^\star)} > 1 \newline -1, & \text{otherwise} \end{cases}

or simply,

y^\star = \begin{cases} +1, & f(\mathbf{x}^\star) > 0 \newline -1, & \text{otherwise} \end{cases}