Recall from the previous part, that the best classifier, in theory, is the Bayes Optimal Classifier.
However, it is not useful in reality due to the difficulty of estimating p(x∣y=c) in practice.
Naive Bayes Classifier
The Naive Bayes classifier is a simplified version of the Bayes Optimal Classifier.
It has the Naive Bayes assumption, which assumes that all of the feature dimensions are statistically independant given the value of the class variable, for example,
p(x1,x2∣y)=p(x1∣y)p(x2∣y)
It also accumulates evidence from each feature dimension,
logp(x1,x2∣y)=logp(x1∣y)+logp(x2∣y)
This allows us to model each dimension of the observation with a simple univariate distribution.
Therefore, the general form for the classification is,
fNB(x)=c∈{1,…,C}argmaxp(y=c)j=1∏Np(xj∣y=c)
Gaussian Classifier
If we treat each dimension as an independent Gaussian and fit the Gaussian parameters with MLE, we have a Gaussian classifier.
But what is the geometry of the decision boundary, let’s take a look at the 2D case.
So it is a quadratic function of (x1,x2) with the form,
j=1∑2(ajxj2+bjxj)+c=0
Naive Bayes for Boolean Vectors
In the case we are working with boolean vectors, we can simplify the Naive Bayes classifier.
We still model each feature independently, but we can use the Bernoulli distribution to model the feature.
If there is a presence of a feature, we model it as,
p(xj=1∣y=c)=φj∣c
and in the absence of a feature, we model it as,
p(xj=0∣y=c)=1−φj∣c
Therefore, the MLE parameters are φj∣c=Mj∣c/Mc, where,
Mj∣c is the number of training examples in class c that contain feature xj.
Mc is the number of training examples in class c.
So, for an input example x, the log-probabilities of features present given y=c adds.
Smoothing
Some features may not be present in any training examples for a given class, which means, Mj∣c=0 and thus φj∣c=0.
Which is problematic, since it may appear in a real-world example.
To solve this we need to smooth our MLE, by adding a smoothing parameterα, which can be interpreted as a “pseudo-count”.
So, our new parameter is,
φj∣c=Mc+NαMj∣c+α
If α=1, then we have Laplace Smoothing (also called additive smoothing).
In general, smoothing helps to prevent overfitting of the parameters.
Multinomial Naive Bayes
What if xj has a finite set of categories?
xj∈χj={1,…,∣χj∣}
Then we need to use a multinomial (or categorical) distribution as class-conditional,
p(xj=k∣y=c)=φx,j∣cx∈χj∑φx,j∣c=1
Our MLE parameter, φx,j∣c=Mx,j∣c/Mj∣c, where,
Mx,j∣c is the number of training examples in class c that have feature xj=x.
Mj∣c is the number of training examples in class c.
Linear Discriminant Analysis (LDA)
LDA is a classification technique that dates back to the 1930s, with origins from Fisher.
It can be interpreted as a approximation to the Bayes optimal classifier, but for real-valued data.
Instead of a product of independent Gaussians, LDA assumes that p(x∣y=c)=N(x;μc,Σ).
A multivariate Gaussian with a shared covariance matrix Σ.
Which equates to,
p(x∣y=c)=∣(2π)NΣ1∣21exp(−21(x−μc)TΣ−1(x−μc))
The classification function is then,
fLDA(x)=c∈{1,…,C}argmaxp(y=c)p(x∣y=c)
Training LDA
As with Naive Bayes, the LDA parameters are learned using MLE.
The first term is the regularization term. Note that wTw=∑j=1Nwj2.
Therefore, this is a penalty term so w doesn’t get too large. C is the regularization hyperparameter.
Larger C allow large values in w, smaller C penalize large values in w.
The second term is the data-fit term. This is what wants to make (w,b) fit the data.
If we define z(i)=y(i)f(x(i)), we have a interesting observation:
z(i)>0 when sample i is correctly classified.
z(i)<0 when sample i is incorrectly classified.
z(i)=0 when sample i is on the decision boundary.
Logistic Loss Function
This leads us to define the logistic loss function,
No-closed-form solution to solve for (w,b), so we need to use numerical optimization.
For example, gradient descent.
w←w−η∂w∂ℓ
ℓ is the objective function.
η is the learning rate.
Cross-Validation
We need to use cross-validation on training set to select the best value of C.
Run many experiments on the traning set to see which parameters work on different versions of the data.
First, partition the data into folds of training and validation data.
Try a range of C values on each fold (as validation set).
Simply, pick the value that works best over all folds.
Cross-Validation Algorithm
Procedure
Select a range of C values to try.
Repeat K rounds:
Split the training set into training data and validation data.
Learn a classifier for each value of C.
Record the accuracy on the validation data for each C.
Select the value of that has the highest average accuracy over all K rounds.
Retrain the classifier using all data and the selected C.
Multiclass Classification
So far, we have only learned a classifier for 2 classes y∈{−1,+1}, also called a binary classifier.
For more than 2 classes, a good stragety is to simply, split the problem into several binary classifier problems.
One-vs-Rest
In this strategy, when we train, for each class, train a classifier for that class versus the other classes.
So for example, if we have 3 classes then we train 3 binary classifiers, 1 vs {2,3}, 2 vs {1,3}, and 3 vs {1,2}.
When predicitng, calculate the probablity for each binary classifier. Select the class with the highest probablity.
Multiclass Logistic Regression
Another way to get a multiclass classifier is to define a multiclass objective function.
One weight vector wc for each class c. We omit the bias bc for each class for notational simplicity.
So, when we had the binary classifier we used the sigmoid function, for a multiclass objective, we need to use the softmax function.
Define probabilities with the softmax function,
p(y=c∣x)=exp(w1Tx)+…+exp(wCTx)exp(wcTx)
Here we need to estimate the {wc}c=1C parameters using MLE as before.
Geometry of Multiclass Logistic Regression
In logistic regressino, it is explcitily designed to have a linear decision boundary in the binary case.
In the multiclass case, the decision boundary is piece-wise linear.
Logistic Regression VS. NB and LDA
Speed
Learning logistic regression requires iterative numerical optimization, which will be slower than NB and LDA.
Storage
The model requires O(N) parameters, the same order as NB but much less than LDA’s O(N2).
Interpretability
The “importance” of feature xj can be understood in terms of the corresponding learned weight wj.
Logistic Regression Summary
Classifier
Linear function f(x)=wTx+b.
Given a feature vector x the probability of a class is
p(y=+1∣x)=σ(f(x)),p(y=−1∣x)=σ(−f(x)).
Sigmod function σ(z)=1+e−z1.
Logistic loss function L(z)=log(1+exp(−z)).
Training
Maximize the likelihood of the training data
Use regularization to prevent overfitting.
Use cross-validation to pick the regularization parameter C.
Classification
Given a new sample x⋆, pick class with the highest p(y∣x⋆).