Part 6 - Robust, Non-Linear Regression and Clustering

CS4487_6

Outliers

Outliers in the data can affect the squared-error term.

The regression function will try to reduce large prediction errors for outliers, at the expense of worsening predictions for other points.

RANSAC

RANdom SAmple Consensus attempts to robustly fit a regression model in the presence of corruped data (outliers).

This works with any regression model, the idea is to split the data into inliers (good data) and outliers (bad data). Therefore, we will learn the model only from the inliers.

Random Sampling

As the name suggests, we repeat this process many times.

Randomly sample a subset of points from the data. Typically, just enough to learn the regression model.
Fit a model to the subset.
Determine all data as inlier or outlier by calculating the residuals (prediction errors) and comparing to a threshhold. The set of inliers is called the consensus set.
Save the model with the highest number of inliers.

Finally, we use the largest consensus set to learn the final model.

How to Choose Parameters?

With more iterations, we increase the chance to find the correct function. Ergo, we have higher probability to select a subset of points that contains all the inliers.

Let $T$ be the number of iterations (what we would like to determine).

Let $s$ be the number of points (just enough) to fit the model.

Let $e$ be the proportion of outliers (therefore, $1-e$ is the proportion of inliers).

Therefore,

$p(\text{training subset with all inliers}) = (1-e)^s$.

$p(\text{training subset with at least one outlier}) = 1 - (1-e)^s$.

$p(\text{all } T \text{ training repeats have outliers}) = (1 - (1-e)^s)^T$.

Let $\delta$ be the probability of success (so probability of failure is $1-\delta$).

Finally, we want,

$$ (1 -(1 -e)^s)^T < 1 - \delta \newline T > \frac{\log(1 - \delta)}{\log(1 - (1 - e)^s)} $$

Threshold typically set as the median absolute deviation of $y$.

Non-Linear Regression

So far we have only considered linear regression in the form $f(\mathbf{x}) = \mathbf{w}^T \mathbf{x} + b$.

Similar to classification, we can do non-linear regression by performing feature mapping $\phi(\mathbf{x})$, followed by linear regression on the new feature vector.

Polynomial Regression

Let’s consider the $p$-th order polynomial regression.

$$ f(x) = w_0 + w_1 x + w_2 x^2 + \ldots + w_p x^p $$

We can collect the terms into a vector,

$$ f(x) = \begin{bmatrix} w_0 & w_1 & w_2 & \ldots & w_p \end{bmatrix} \begin{bmatrix} 1 \newline x \newline x^2 \newline \vdots \newline x^p \end{bmatrix} = \mathbf{w}^T \phi(x) $$

Now it is a linear function, so we can use the same linear regression.

Select Degree Using Cross-Validation

Minimizing the MSE on the training set will overfit, more complex function always has lower MSE on training set.

Due to this, we need to use cross-validation to select the proper model, the parameter to adjust is in feature transformation step.

Kernel Trick for Machine Learning Methods

Recall the kernel trick,

$$ \mathcal{K}(\mathbf{x}, \mathbf{z}) = \phi(\mathbf{x})^T \phi(\mathbf{x^{\prime}}) $$

Many learning methods depend on computing inner products between features (e.g., linear regression).

For those methods, we can use a kernel function in place of inner product (i.e., “kernelizing” the methods), and thus introduce non-linearity.

Instead of first transforming the original features into the new feature space and then computing the inner product, we can compute the inner product in the new feature space directly through the kernel function.

Kernel Ridge Regression

Firstly, recall ridge regression,

$$ \underset{\mathbf{w}}{\min} \frac{1}{2} \Vert \mathbf{X} \mathbf{w} - \mathbf{y} \Vert_2^2 + \frac{\alpha}{2} \Vert \mathbf{w} \Vert_2^2 $$

$$ \mathbf{w}^{\star} = (\mathbf{X}^T \mathbf{X} + \alpha \mathbf{I}_{N+1})^{-1} \mathbf{X}^T \mathbf{y} = \mathbf{X}^T (\mathbf{X} \mathbf{X}^T + \alpha \mathbf{I}_M)^{-1} \mathbf{y}, $$

This is achieved by making use of the matrix identity,

$$ (\mathbf{P}^{-1} + \mathbf{B}^T \mathbf{R}^{-1} \mathbf{B})^{-1} \mathbf{B}^T \mathbf{R}^{-1} = \mathbf{P} \mathbf{B}^T (\mathbf{B} \mathbf{P} \mathbf{B}^T + \mathbf{R})^{-1} $$

In our case, $\mathbf{P} = \frac{1}{\alpha} \mathbf{I}_{N+1}$, $\mathbf{B} = \mathbf{X}$, and $\mathbf{R} = \mathbf{I}_M$.

This equation can be rewritten as,

$$ \mathbf{w}^{\star} = \sum_{i=1}^M \lambda_i \mathbf{x}^{(i)} $$

with $\lambda = (\mathbf{X} \mathbf{X}^T + \alpha \mathbf{I}_M)^{-1} \mathbf{y}$.

I.e., the solution $\mathbf{w}^{\star}$ must lie in the span of the data cases.

We can arrive at the same solution using Lagrange duality, similar as SVM.

For any new point $\mathbf{x}^{\star}$, we make prediction by,

$$ y^{\star} = (\mathbf{x}^{\star})^T \mathbf{w} = (\mathbf{x}^{\star})^T \mathbf{X}^T (\mathbf{X} \mathbf{X}^T + \alpha \mathbf{I}_M)^{-1} \mathbf{y} = (\mathbf{X} \mathbf{x}^{\star})^T (\mathbf{X} \mathbf{X}^T + \alpha \mathbf{I}_M)^{-1} \mathbf{y} $$

Kernelizing the method, we have,

$$ y^{\star} = (\mathbf{x}^{\star})^T \mathbf{w} = (\mathbf{k}^{\star})^T (\mathbf{K} + \alpha \mathbf{I}_M)^{-1} \mathbf{y} $$

$K_{ij} = \mathcal{K}(\mathbf{x}^{(i)}, \mathbf{x}^{(j)}) = \phi(\mathbf{x}^{(i)})^T \phi(\mathbf{x}^{(j)})$, the kernel matrix $(M \times M)$.

$k^{\star}_i = \mathcal{K}(\mathbf{x}^{(i)}, \mathbf{x}^{\star})$, vector containing the kernel values between $\mathbf{x}^{\star}$ and all training points $\mathbf{x}^{(i)}$.

No extra computational burden is incurred, but it now becomes a (much more powerful) non-linear regression model.

Support Vector Regression

Borrow ideas from classification, suppose we form a “tube” around the function.

If a point is inside, then it is “correctly” predicted. If a point is outside, then it is “incorrectly” predicted.

We can allow some points to be outside the “tube”. Penalty of point outside tube can be controlled by the hyperparameter C.

SVR (Primal Form. Soft-Margin)

$$ \underset{\mathbf{w}, b}{\min} \frac{1}{2} \Vert \mathbf{w} \Vert_2^2 + C \sum_{i=1}^M \vert y^{(i)} - (\mathbf{w}^T \mathbf{x}^{(i)} + b) \vert_{\epsilon} $$

Epsilon-insensitive error, $$ \vert z \vert_{\epsilon} = \begin{cases} 0, & |z| \leq \epsilon \newline |z| - \epsilon, & |z| > \epsilon \end{cases} $$

Equivalently, we have the SVR objective function,

$$ \begin{aligned} \underset{\mathbf{w}, b, \mathbf{\xi}, \mathbf{\xi}^{\star}}{\min} & \quad \frac{1}{2} \Vert \mathbf{w} \Vert_2^2 + C \sum_{i=1}^M (\xi_i + \xi_i^{\star}) \newline \text{subject to} & \quad y^{(i)} - \mathbf{w}^T \mathbf{x}^{(i)} - b \leq \epsilon + \xi_i, \quad i = 1, \ldots, M \newline & \quad \mathbf{w}^T \mathbf{x}^{(i)} + b - y^{(i)} \leq \epsilon + \xi_i^{\star}, \quad i = 1, \ldots, M \newline & \quad \xi_i, \xi_i^{\star} \geq 0, \quad i = 1, \ldots, M \end{aligned} $$

Kernel SVR

SVR can also be kernelized through Lagrange duality. Thus, turning the linear regression to non-linear regression.

SVR (dual form),

$$ \begin{aligned} \underset{\alpha, \alpha^{\star}}{\max} & \quad -\frac{1}{2} \sum_{i,j=1}^M (\alpha_i - \alpha_i^{\star})(\alpha_j - \alpha_j^{\star})(\mathbf{x}^{(i)})^T \mathbf{x}^{(j)} \newline & \quad - \epsilon \sum_{i=1}^M (\alpha_i + \alpha_i^{\star}) + \sum_{i=1}^M y^{(i)}(\alpha_i - \alpha_i^{\star}) \newline \text{subject to} & \quad \sum_{i=1}^M (\alpha_i - \alpha_i^{\star}) = 0 \newline & \quad 0 \leq \alpha_i, \alpha_i^{\star} \leq C, \quad i = 1, \ldots, M \end{aligned} $$

where $\alpha_i$ and $\alpha_i^{\star}$ are dual variables.

Regression Summary

Regression Task
- Observation $\mathbf{x}$, typically a real vector of feature values, $\mathbf{x} \in \mathbb{R}^N$.
- $y \in \mathbb{R}$, a real-valued target variable.
- Goal: given an observation $\mathbf{x}$, predict the corresponding target variable $y$.
Ordinary Least Squares
- Function: Linear.
- Training: Minimize Squared Error.
- Pros: Closed-form solution.
- Cons: Sensitive to outliers and overfitting.
Ridge Regression
- Function: Linear.
- Training: Minimize Squared Error with $\Vert \mathbf{w} \Vert_2^2$ regularization term.
- Pros: Closed-form solution, shrinkage to prevent overfitting.
- Cons: Sensitive to outliers.
LASSO
- Function: Linear.
- Training: Minimize Squared Error with $\Vert \mathbf{w} \Vert_1$ regularization term.
- Pros: Feature selection (by forcing weights to zero).
- Cons: Sensitive to outliers.
RANSAC
- Function: Same as the base model.
- Training: Randomly sample subset of training data and fit model, keep model with most inliers.
- Pros: Ignore outliers.
- Cons: Requires enough iterations to find good consensus set.
Kernel Ridge Regression
- Function: Non-linear (kernel function).
- Training: Apply “kernel trick” to ridge regression.
- Pros: Non-linear regression, closed-form solution.
- Cons: Require calculating kernel matrix $(\mathcal{O}(M^2))$. Cross-validation to select hyperparameters.
Kernel Support Vector Regression
- Function: Non-linear (kernel function).
- Training: Minimize epsilon-error.
- Pros: Non-linear regression, faster predictions than kernel ridge regression.
- Cons: Require calculating kernel matrix $(\mathcal{O}(M^2))$. Iterative (slow) and cross-validation to select hyperparameters.
Feature Normalization
- Feature normalization is typically required for regression methods with regularization.
- Makes ordering of weights more interpretable (LASSO, RR).
Output transformations
- Sometimes the output values $y$ have a large dynamic range (e.g., $10^{-1}$ to $10^5$)
  - Large output values will have large error, which will dominate the traning error.
- In this case, it is better to transform the output values using the logarithm function
  - $\hat{y} = \log_{10}(y)$.

Supervised Learning VS. Unsupervised Learning

Supervised learning considers input-output pairs $(\mathbf{x}, y)$.
- Learn a mapping $f$ from input to output.
- Classification: output $y \in \{-1, 1\}$ (binary), or $y \in \{1, 2, \ldots, K\}$ (multi-class).
- Regression: output $y \in \mathbb{R}$.
- “Supervised” here means that the algorithm is learning the mapping that we want.
Unsupervised learning only considers the input data $\mathbf{x}$
- There is no output value
- Goal: try to discover inherent properties in the data.
  - Density estimation
  - Clustering
  - Dimensionality reduction
  - Manifold embedding

Density Estimation

From $\mathcal{D} = \{\mathbf{x}^{(i)}\}_{i=1}^M$, estimate a probability distribution $p(\mathbf{x})$.

This is the mother of all unsupervised learning problems. This is a key technique underpinning AIGC (Artificial Intelligence Generated Content).

This can be conditional, i.e., to estimate $p(\mathbf{x} | y)$. (Here, we don’t learn a mapping to predict $y$ from $\mathbf{x}$, we just use $y$ as conditioning).

Clustering

Find clusters of similar items in the data.

Find a representative item that “summarizes” all items in the cluster.

A typical example, group iris flowers by their measurements (sepal width and petal length).

Dimensionality Reduction

Transform high-dimensional vectors into low-dimensional vectors. Dimensions in the low-dim data may have semantic meaning.

Typical example is, document analysis. In high-dim, we have bag-of-words, vectors of documents.

In low-dim, each dimension represents similarity to a topic.

Manifold Embedding

Project high-dimensional vectors into 2- or 3-dimensional space for visualization. Points in the low-dim space have similar pair-wise distances as in the high-dim space.

Typical example is, visualize a collection of handwritten digits (images).

Defining a Clustering

Suppose we have $M$ data points $\mathcal{D} = \{\mathbf{x}^{(i)}\}_{i=1}^M$, where $\mathbf{x}^{(i)} \in \mathbb{R}^N$.

A clustering of the $M$ points into $K$ clusters is a partitioning of $\mathcal{D}$ into $K$ mutually disjoint groups $\mathcal{C} = \{\mathcal{C}_1, \mathcal{C}_2, \ldots, \mathcal{C}_K\}$ such that $\mathcal{C}_1 \cup \mathcal{C}_2 \cup \ldots \cup \mathcal{C}_K = \mathcal{D}$.

Groups in these cases are called clusters. Therefore, $K$ is the number of clusters.

Each data point is assigned with a cluster index $(y \in \{1, \ldots, K\})$.

Exhastive Clustering

Suppose we have a function $f(\mathcal{C})$ that takes a clustering of $\mathcal{C}$ of the data set $\mathcal{D}$ as input, and returns a score with lower scores indicating better clustering.

The optimal clustering according to $f$ is simply,

$$ \underset{\mathcal{C}}{\arg \min} f(\mathcal{C}) $$

But what’s the complexity of exhaustive clustering?

Number of Clusterings

The total number of clusterings of a data set with $M$ elements is the Bell number $B_M$. Where $B_0 = 1$ and,

$$ B_{M + 1} = \sum_{k=0}^M \binom{M}{k} B_k $$

The first few Bell numbers are: 1, 1, 2, 5, 15, 52, 203, 877, 4140, 21147, 115975, $\ldots$.

The complexity of exhaustive clustering scales with $B_M$ and is thus computationally totally intractable for general scoring functions.

We will need either approximation algorithms or scoring functions with special properties.

K-Means Clustering

Assume we have $K$ clusters, each cluster is represented by a cluster center, $\mathbf{c}_j \in \mathbb{R}^N, j \in \{1, \ldots, K\}$.

Assign each data point to the closest cluster center, according to Euclidean distance $\Vert \mathbf{x} - \mathbf{c}_j \Vert_2$.

K-Means Clustering Problem

But, how do we pick the cluster centers?

Again, assume there are $K$ clusters, we pick the cluster centers that minimize the squared distance to all its cluster member,

$$ \underset{\mathbf{c_1}, \ldots, \mathbf{c_K}}{\min} \sum_{i=1}^M \Vert \mathbf{x}^{(i)} - \mathbf{c}_{z^{(i)}} \Vert_2^2 $$

where $z^{(i)}$ is the index of the closest cluster center to $\mathbf{x}^{(i)}$,

$$ z^{(i)} = \underset{j=\{1, \ldots, K\}}{\arg \min} \Vert \mathbf{x}^{(i)} - \mathbf{c}_j \Vert_2^2 $$

Thus, this is our solution, if the assignments $\{z^{(i)}\}_{i=1}^M$ are known…

Let $C_j$ be the set of points assigned to cluster $j$,

$$ C_j = \{\mathbf{x}^{(i)} | z^{(i)} = j\} $$

Cluster center is the mean of the points in that cluster,

$$ \mathbf{c_j} = \frac{1}{\vert C_j \vert} \sum_{\mathbf{x}^{(i)} \in C_j} \mathbf{x}^{(i)} $$

Chicken and Egg Problem

Cluster assignment of each point depends on the cluster centers.

Location of cluster center depends on which points are assigned to it.

How do we solve this?

K-Means Algorithm

Pick an initial cluster center (randomly or using some heuristic).

Then we repeat,

Assignment step: calculate assignment $z^{(i)}$ for each point $\mathbf{x}^{(i)}$, closest cluster center using Euclidean distance,

$$ z^{(i)} = \underset{j=\{1, \ldots, K\}}{\arg \min} \Vert \mathbf{x}^{(i)} - \mathbf{c}_j \Vert_2 $$

Update step: Calculate cluster center as average of points assigned to cluster $j$,

$$ \mathbf{c_j} = \frac{\sum_{i=1}^M \mathbb{I}[z^{(i)} = j] \mathbf{x}^{(i)}}{\sum_{i=1}^M \mathbb{I}[z^{(i)} = j]} $$

This procedure will eventually converge.

The K-means Objective

$K$-means attempts to minimize the sum of within-cluster variation over all clusters (also called the within-cluster sum of squares),

$$ \min \ell(\mathbf{z}, \{\mathbf{c_j} \}_{j=1}^K) $$

$$ \underset{\mathbf{z}, \{\mathbf{c_1}, \ldots, \mathbf{c_K\}}}{\min} \sum_{i=1}^M \Vert \mathbf{x}^{(i)} - \mathbf{c}_{z^{(i)}} \Vert_2^2 $$

where $\mathbf{z} = [z^{(1)}, z^{(2)}, \ldots, z^{(M)}]^T$.

$K$-means is exactly coordinate descent on $\ell$, where assignment step minimizes $\ell$ with respect to $\mathbf{z}$ while holding $\{\mathbf{c_j}\}$ fixed, and the update step minimized $\ell$ with respect to $\{\mathbf{c_j}\}$ while holding $\mathbf{z}$ fixed.

Thus, $\ell$ is monotonically decreasing. As $\ell$ is also lower bounded by 0, the value of $\ell$ must converge.

Note that $K$-means has many local optimas in general, each corresponding to a different clustering of the data. Finding the global optimum is not computionally tractable.

Thus, the final results can be highly sensitive to initialization, some bad initial cluster centers will yield suboptimal clustering results.

Solution to Initialization

It is common to perform multiple random re-starts of the algorithm, and take the clustering with the best result.

Common initialization strategies include,

Setting the initial centers to be randomly selected data points.
Setting the initial partition to a random partition
Selecting centers using a “furthest-first”-style heuristic (more formally known as $K$-means++).

It often helps to initially to run with $K \log(K)$ clusters, then merge clusters to get down to $K$ and run the algorithm from that initialization.

Choosing $K$

Clustering results depend on the number of clusters $K$ used. We do not typically know this information beforehand.

One of the most reliable and widely used methods is the elbow method. The elbow method involves running $K$-means for a range of $K$ values and plotting the within-cluster sum of squares as a function of $K$. Then choose the “elbow” point, where the rate of decrease slows down.

Circular Clusters

One problem with $K$-means is that it assumes that each cluster has a circular shape.

Since it is based on Euclidean distance to each center, this means that $K$-means can not handle skewed (elliptical) clusters.

Gaussian Mixture Models (GMM)

A multivariate Gaussian can model a cluster with an elliptical shape. The ellipse shape is controlled by the covariance matrix of the Gaussian and the location of the cluster is controlled by the mean.

Gaussian mixture model is a weighted sum of Gaussians,

$$ p(\mathbf{x}) = \sum_{j=1}^K \pi_j \mathcal{N}(\mathbf{x}; \mathbf{\mu}_j, \mathbf{\Sigma}_j) $$

Each Gaussian represents one elliptical cluster:

$\mathbf{\mu}_j$ is mean of $j$-th cluster (the location).
$\mathbf{\Sigma}_j$ is the covariance matrix of $j$-th cluster (the ellipse shape).
$\pi_j$ is the prior weight of $j$-th cluster (how likely the cluster is).

Clustering with GMM

Given a data set $\mathcal{D} = \{\mathbf{x}^{(i)}\}_{i=1}^M$, learn a GMM using maximum likelihood estimation,

$$ \underset{\pi, \mathbf{\mu}, \mathbf{\Sigma}}{\max} \sum_{i=1}^M \log \sum_{j=1}^K \pi_j \mathcal{N}(\mathbf{x}^{(i)}; \mathbf{\mu_j}, \mathbf{\Sigma_j}) $$

While we can do this directly using gradient-based optimization, it is often faster to use a special algorithm called Expectation-Maximization (EM).

Expectation-Maximization (EM) for GMM

EM results in an algortihm similar to $K$-means.

E-step: Calculate cluster membership with “soft” assignment, a data point can have a fractional contribution to different clusters. Contribution of point $i$ to cluster $j$ is defined by the posterior probability that $\mathbf{x}^{(i)}$ belongs to cluster $j$ using Bayes’ rule,

$$ \begin{aligned} z_j^{(i)} = p(z^{(i)} = j | \mathbf{x}^{(i)}) & = \frac{p(\mathbf{x}^{(i)}, z^{(i)} = j)}{p(\mathbf{x}^{(i)})} \newline & \quad = \frac{p(\mathbf{x}^{(i)} | z^{(i)} = j) p(z^{(i)} = j)}{\sum_{k=1}^K p(\mathbf{x}^{(i)} | z^{(i)} = k) p(z^{(i)} = k)} \newline & \quad = \frac{\pi_j \mathcal{N}(\mathbf{x}^{(i)}; \mathbf{\mu_j}, \mathbf{\Sigma_j})}{\sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x}^{(i)}; \mathbf{\mu_k}, \mathbf{\Sigma_k})} \end{aligned} $$

M-step: Update each Gaussian cluster (mean, covariance, and weight) using “soft” weighting. “Soft” count of points in cluster $j$,

$$ M_j = \sum_{i=1}^M z_j^{(i)} $$

Weight,

$$ \pi_j = \frac{M_j}{M} $$

Mean,

$$ \mathbf{\mu_j} = \frac{1}{M_j} \sum_{i=1}^M z_j^{(i)} \mathbf{x}^{(i)} $$

Covariance,

$$ \mathbf{\Sigma_j} = \frac{1}{M_j} \sum_{i=1}^M z_j^{(i)} (\mathbf{x}^{(i)} - \mathbf{\mu_j})(\mathbf{x}^{(i)} - \mathbf{\mu_j})^T $$

GMM: A Special Case

Suppose we fix $\pi_j = \frac{1}{K}$ and $\mathbf{\Sigma}_j = \mathbf{I}$. In this case we have,

$$ p(\mathbf{x}^{(i)} | z^{(i)} = j) = \mathcal{N}(\mathbf{x}^{(i)}; \mathbf{\mu}_j, \mathbf{I}) = \frac{1}{(2\pi)^{N/2}} \exp \left( -\frac{1}{2} \Vert \mathbf{x}^{(i)} - \mathbf{\mu}_j \Vert_2^2 \right) $$

We obtain a special case of the EM algorithm for GMM,

$$ z_j^{(i)} = \frac{\exp \left( -\frac{1}{2} \Vert \mathbf{x}^{(i)} - \mathbf{\mu_j} \Vert_2^2 \right)}{\sum_{k=1}^K \exp \left( -\frac{1}{2} \Vert \mathbf{x}^{(i)} - \mathbf{\mu_k} \Vert_2^2 \right)} $$

$$ \mathbf{\mu_j} = \frac{1}{M_j} \sum_{i=1}^M z_j^{(i)} \mathbf{x}^{(i)} $$

This is often referred to as soft $K$-means.

Covariance Matrix

The covariance matrix is an $N \times N$ matrix,

$$ \begin{bmatrix} a_{11} & a_{12} & a_{13} \newline a_{21} & a_{22} & a_{23} \newline a_{31} & a_{32} & a_{33} \end{bmatrix} $$

For high-dimensional data, it can be very large, thus requires a lot of data to learn effectively.

One solution is using a diagonal covariance matrix ($N$ parameters) or spherical covariance matrix (1 parameter),

$$ \begin{bmatrix} a_{11} & 0 & 0 \newline 0 & a_{22} & 0 \newline 0 & 0 & a_{33} \end{bmatrix} \begin{bmatrix} a & 0 & 0 \newline 0 & a & 0 \newline 0 & 0 & a \end{bmatrix} $$

Trade-Offs

The original $K$-means algorithm performs hard assignments during clustering, and implicitly assumes all clusters will have an equal number of points assigned as well as a unit covariance matrix.

GMM for clustering relaxes all of these assumptions. The objective still has multiple local optima.

EM can also be used with any component densities/distributions to customize the model to a given data set.

As with $K$-means, intialization is important for GMM.