Part 12 - Unsupervised Learning II

Introduction

Last time we discussed the clustering objective and had a look at two approaches, namely K-means and Gaussian Mixture Models (GMMs). In this part, we will continue our discussion on unsupervised learning by exploring dimensionality reduction and introducing Principal Component Analysis (PCA) as a popular method for this task.

Dimensionality Reduction

Intuition: Dimensionality Reduction

The goal of dimensionality reduction is to learn a mapping from the high-dimensional (visible) space $\mathbf{x} \in \mathbb{R}^d$ to a low-dimensional (latent) space $\mathbf{z} \in \mathbb{R}^m$ with $m \ll d$.

Principal Component Analysis (PCA)

Intuition: Principal Component Analysis (PCA)

PCA finds a linear (and orthogonal, but more formally on this later) projection of high-dimensional data $\mathbf{x} \in \mathbb{R}^d$ to a low-dimensional subspace $\mathbf{z} \in \mathbb{R}^m$ such that $\mathbf{z}$ is a good approximation of the data (i..e, retains as much valuable information as possible).

PCA accomplishes this by finding the directions (principal components) that capture the maximum variance in the data.

The intuition is that, our complex data may appear high-dimensional, but there may only be a few degrees of variability (latent factors) that explain most of the variability in the data. For example, in face recognition, there are only a few latent factors that are able to describe most of the variability (lightning, pose, identity).

Derivation: Principal Component Analysis (PCA)

Thus, we understand that a good approximation of the data → reconstruction error made by projection is minimum, which means,

We project $\mathbf{x}$ onto $\mathbf{z}$.
We want our approximation $\hat{\mathbf{x}}$ (arising from the projection onto $\mathbf{z}$) to be as close as possible to the original data $\mathbf{x}$.
Distortion measure: Thus, we can take the squared distance between our original data point $\mathbf{x}_n$ and its (approximate) reconstruction $\hat{\mathbf{x}}$, averaged over all data points, $$ \min \mathcal{L} \coloneqq \frac{1}{N} \sum_{i = 1}^N \Vert \mathbf{x}_i - \hat{\mathbf{x}}_i \Vert^2 $$ But, how can we find a linear subspace that minimizes the reconstruction error? Firstly, we have to choose a 1-dimensional subspace, which can be specified by a unit vector $\mathbf{u}$.

Secondly, we can project 2-dimensional data points onto it by $z = \mathbf{u}^T \mathbf{u}$. Lastly, we can coordinate in the original space that represents the projection onto $\mathbf{u}$ by $(\mathbf{u}^T \mathbf{x}) \mathbf{u}$.

Thus, we can write the reconstruction error as, $$ \mathcal{L}(\mathbf{u}) = \frac{1}{N} \sum_{i = 1}^N \Vert \mathbf{x}_i - \hat{\mathbf{x}}_i \Vert^2, \quad \hat{\mathbf{x}}_i = (\mathbf{u}^T \mathbf{x}_i) \mathbf{u} $$ Now, we can investigate a few things; Firstly, let $d$ be the distance of $\mathbf{x}$ from the origin. Further, let $r$ and $v$ be the distance of $\mathbf{x}$ from $\hat{\mathbf{x}}$ in the subspace and distance of $\hat{\mathbf{x}}$ from the origin, respectively. Using the Pythagorean theorem, we have, $$ r^2 + v^2 = d^2, $$ $r^2$ contributes to the reconstruction error, while $v^2$ contributes to the variance of the projected data within the subspace.

This means that, minimizing the reconstruction error is equivalent to maximizing the variance of the projected data!

Note

However, note that this can be a double-edged sword. PCA can be “misled” by directions in which variance is high because of the measurement scale → Always standardize the data first.

Definition: Principal Component Analysis (PCA)

Formally, principal components are a sequence of $m$ (unit) vectors where the $i$-th vector is the direction of a line that best fits the data while being orthogonal to the first $i - 1$ vectors.

Thus, the best-fitting line is equivalent with the direction along which the data shows maximum variance.

1st PC: Direction in space along which projections have largest variance.
2nd PC: Direction maximizing variance among all directions orthogonal to 1st PC.
$k$-th PC: Variance-maximizing direction orthogonal to previous $k - 1$ components.

Derivation: Reconstruction Error and Variance

Let $\{\mathbf{x}_i \in \mathbb{R}^d\}$ be the high-dimensional observations. Further, let $\mathbf{u}_j \in \mathbb{R}^d$ and $z_i = \mathbf{u}_1^T \mathbf{x}_i$ be the $j$-th principal direction and the projection of $\mathbf{x}_i$ onto $\mathbf{u}_j$, respectively.

Then, we can consider the best 1-dimensional space $\mathbf{u}_1$ as, $$ \begin{align*} \mathcal{L}(\mathbf{u}_1) & = \frac{1}{N} \sum_{i = 1}^N \Vert \mathbf{x}_i - \hat{\mathbf{x}}_i \Vert^2 \newline & = \frac{1}{N} \sum_{i = 1}^N \Vert \mathbf{x}_i - z_i \mathbf{u}_1 \Vert^2 \newline & = \frac{1}{N} \sum_{i = 1}^N \left(\mathbf{x}_i - z_i \mathbf{u}_1\right)^T \left(\mathbf{x}_i - z_i \mathbf{u}_1\right) \newline & = \frac{1}{N} \sum_{i = 1}^N \left(\mathbf{x}_i^T \mathbf{x}_i - 2 z_i \mathbf{u}_1^T \mathbf{x}_i + z_i^2 \mathbf{u}_1^T \mathbf{u}_1\right) \newline & = \frac{1}{N} \sum_{i = 1}^N \left(\mathbf{x}_i^T \mathbf{x}_i - 2 z_i \mathbf{u}_1^T \mathbf{x}_i + z_i^2\right) \newline & = \frac{1}{N} \sum_{i = 1}^N \left(\mathbf{x}_i^T \mathbf{x}_i - \left(\mathbf{u}_1^T \mathbf{x}_i\right)^2\right) \newline \end{align*} $$ Thus, minimizing $\mathcal{L}(\mathbf{u}_1)$ is equivalent to maximizing, $$ \frac{1}{N} \sum_{i = 1}^N \left(\mathbf{u}_1^T \mathbf{x}_i\right)^2 $$ which is the sample mean of $\mathbf{z}^2$, since, $$ \mathrm{Var}(z) \approx \frac{1}{N} \sum_{i = 1}^N (\mathbf{u}_1^T \mathbf{x}_i)^2 - \left(\frac{1}{N} \sum_{i = 1}^N \mathbf{u}_1^T \mathbf{x}_i\right)^2 = \frac{1}{N} \sum_{i = 1}^N (\mathbf{u}_1^T \mathbf{x}_i)^2 $$ Which means that for the first principal component, $$ \begin{align*} \mathbf{u}_1^{\star} & = \underset{\mathbf{u}_1}{\arg\min} \ \mathcal{L}(\mathbf{u}_1) \newline & = \underset{\mathbf{u}_1}{\arg\max} \ \mathrm{Var}(z) \end{align*} $$ We can use vector-matrix notation by introducing the empirical covariance matrix $\mathbf{S}$ defined as, $$ \mathbf{S} \coloneqq \frac{1}{N} \sum_{i = 1}^N \mathbf{x}_i \mathbf{x}_i^T $$ Thus, our objective is to choose a direction that maximizes variance → Need to maximize variance $\mathbf{u}_1^T \mathbf{S} \mathbf{u}_1$ with respect to $\mathbf{u}_1$. However, we need to prevent $\Vert \mathbf{u}_1 \Vert \to \infty$, we can add the constraint that $\Vert \mathbf{u} \Vert = \mathbf{u}_1^T \mathbf{u} = 1$, $$ \mathbf{u}_1^{\star} = \underset{\mathbf{u}_1}{\arg\max} \ \mathbf{u}_1^T \mathbf{S} \mathbf{u}_1 + \lambda_1 (1 - \mathbf{u}_1^T \mathbf{u}_1) $$ where $\lambda_1$ is a Lagrange multiplier.

By taking the derivative with respect to $\mathbf{u}_1$ and setting it to zero, we have, $$ \mathbf{S} \mathbf{u}_1 = \lambda_1 \mathbf{u}_1 $$ which corresponds to the eigenvector equation of the matrix $\mathbf{S}$.

Recall: Eigenvalues and Eigenvectors

Let $\mathbf{A}$ be a (real-valued) $n \times n$ matrix. Then $\lambda \in \mathbb{R}$ and a nonzero vector $\mathbf{v}$ are said to be an eigenvalue and corresponding eigenvector of $\mathbf{A}$ if, $$ \mathbf{A} \mathbf{v} = \lambda \mathbf{v} $$ The linear transformation of $\mathbf{A}$ on the eigenvector $\mathbf{v}$ only changes its magnitude by the factor $\lambda$ (i.e., scaling) but not its direction.

A $n \times n$ matrix $\mathbf{A}$ with $n$ linearly independent eigenvectors $\mathbf{v}_i$ can be factorized as, $$ \mathbf{A} = \mathbf{V} \boldsymbol{\Lambda} \mathbf{V}^{-1} $$ where $\mathbf{V} = (\mathbf{v}_1, \ldots, \mathbf{v}_n)$ is the $n \times n$ matrix whose columns are the eigenvectors of $\mathbf{A}$ and $\boldsymbol{\Lambda} = \mathrm{diag}(\lambda_1, \ldots, \lambda_n)$ is the diagonal matrix whose entries are the corresponding eigenvalues.

Thus, $\mathbf{u}_1$ is an eigenvector of $\mathbf{S}$. By left-multiplying the equation above by $\mathbf{u}_1^T$ and using the fact that $\mathbf{u}_1^T \mathbf{u}_1 = 1$, we have, $$ \begin{align*} \mathbf{u}_1^T \mathbf{S} \mathbf{u}_1 & = \lambda_1 \mathbf{u}_1^T \mathbf{u}_1 \newline & = \lambda_1 \end{align*} $$ Thus, $\mathrm{Var}[\mathbf{z}_1 \approx \mathbf{u}_1^T \mathbf{S} \mathbf{u}_1$ is maximized when we set $\mathbf{u}_1$ equal to the eigenvector with the largest eigenvalue $\lambda_1$.

We can define additional principal components in an incremental fashion by choosing each new direction to be that which maximizes the projected variance amongst all possible directions orthogonal to those already considered.

The optimal projection onto a $m$-dimensional space is defined by the $m$ eigenvectors $\mathbf{u}_1, \ldots, \mathbf{u}_m$ of the data covariance matrix $\mathbf{S}$ corresponding to the $m$ largest eigenvalues $\lambda_1 \geq \lambda_2 \geq \ldots \geq \lambda_m$.

Probabilistic PCA

Intuition: Probabilistic PCA

Probabilistic PCA (PPCA) is a probabilistic reformulation of PCA that models the data generation process using a latent variable model. It introduces an explicit $m$-dimensional latent variable $\mathbf{z}$ corresponding to the principal-component subspace.

It defines the probabilistic model, $$ \begin{align*} p(\mathbf{z}) & = \mathcal{N}(\mathbf{z} \mid \mathbf{0}, \mathbf{I}), \newline p(\mathbf{x} \mid \mathbf{z}) & = \mathcal{N(\mathbf{x} \mid \mathbf{W} \mathbf{z} + \boldsymbol{\mu}, \sigma^2 \mathbf{I})}, \end{align*} $$ where $\mathbf{W}$ is a $d \times m$ matrix with columns spanning a linear subspace within the data space that corresponds to the principal subspace.

Note

Here, $\mathbf{W}, \boldsymbol{\mu}, \sigma^2$ are the model parameters to be learned from data.

By ML learning, we know that, $$ \begin{align*} \{\mathbf{W}, \boldsymbol{\mu}, \sigma^{2}\}^{\star} & \coloneqq \underset{\{\mathbf{W}, \boldsymbol{\mu}, \sigma^2\}}{\arg\max} \ \log p(\mathcal{D} \mid \mathbf{W}, \boldsymbol{\mu}, \sigma^2) \newline & = \underset{\{\mathbf{W}, \boldsymbol{\mu}, \sigma^2\}}{\arg\max} \ \sum_{n = 1}^N \log p(\mathbf{x}_n \mid \mathbf{W}, \boldsymbol{\mu}, \sigma^2) \end{align*} $$ We need the marginal $p(\mathbf{x})$, $$ \begin{align*} p(\mathbf{x}) & = \int p(\mathbf{x} \mid \mathbf{z}) p(\mathbf{z}) \ d\mathbf{z} \newline & = \int \mathcal{N}(\mathbf{x} \mid \mathbf{W} \mathbf{z} + \boldsymbol{\mu}, \sigma^2 \mathbf{I}) \mathcal{N}(\mathbf{z} \mid \mathbf{0}, \mathbf{I}) \ d\mathbf{z} \newline & = \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}, \mathbf{C}) \end{align*} $$ where $\mathbf{C} = \mathbf{W} \mathbf{W}^T + \sigma^2 \mathbf{I}$.

Once we fit the model, we can compute the probabilistic latent embeddings using $p(\mathbf{z} \mid \mathbf{x})$, $$ p(\mathbf{z} \mid \mathbf{x}) \coloneqq \frac{p(\mathbf{x} \mid \mathbf{z}) p(\mathbf{z})}{p(\mathbf{x})} = \mathcal{N}(\mathbf{z} \mid \mathbf{M} \mathbf{W}^T (\mathbf{x} - \boldsymbol{\mu}), \sigma^{-2} \mathbf{M}), $$ where $\mathbf{M} = \mathbf{W}^T \mathbf{W} + \sigma^2 \mathbf{I}$

PCA corresponds to the posterior mean $\mathbb{E}[\mathbf{z} \mid \mathbf{x}]$ (using ML learning) in the limiting case $\sigma^2 \to 0$.