Part 12 - Unsupervised Learning I

Introduction

In this and final part(s), we will explore unsupervised learning methods in probabilistic machine learning. Unsupervised learning involves learning patterns from unlabeled data, which is a common scenario in real-world applications.

The Unsupervised Learning Problem

Intuition: The Unsupervised Learning Problem

A branch of machine learning that operates over unlabeled data sets $\mathcal{D} \coloneqq \{\mathbf{x}_1, \ldots, \mathbf{x}_N\}$. The goal is to find meaningful structure in the data, characterized by the presence of hidden (latent) variables that help us explain the structure of the data.

We can (somewhat) formally define this as, given the data set $\mathcal{D} \coloneqq \{\mathbf{x}_n\}_{n = 1}^N \sim_\text{i.i.d} p(\mathbf{x})$, learn some useful properties of $p(\mathbf{x})$. Thus, the key difference to supervised learning is that we want to model $p(\mathbf{x})$ (unsupervised) instead of $p(y \mid \mathbf{x})$ (supervised).

Example

Some common unsupervised learning tasks include:

Discovering clusters: Clustering data into groups:
- Assume presence of an unobserved label $y_n$ (hidden, latent variable) associated to each $\mathbf{x}_n$.
- Goal: Recovering labels $y_n$ for all $\mathbf{x}_n \in \mathcal{D}$.
Dimensionality reduction: Reduce dimensionality of data by projecting it to lower dimensional subspace which captures its essence.
- Data may appear high-dimensional, but there may be a few of degrees in variability (latent factors).
- Thus, it might be possible to highlight independent explanatory factors; easier to visualize, analyze, and use for downstream tasks.
Generation of new samples: Learn a model that produces samples approximately distributed according to $p(\mathbf{x})$.
- Examples: Computer graphics for gaming; Software that can produce artificial scenes based on a given description;

Clustering

We will discuss clustering in more detail, as it is one of the most fundamental unsupervised learning tasks. We will firstly go through $K$-means clustering, which is a simple and popular clustering algorithm. Then, we will discuss Gaussian Mixture Models (GMMs), which are a more powerful probabilistic clustering method.

$K$-Means Clustering

Intuition: $K$-Means Clustering

Given a data set $\mathcal{D} \coloneqq \{\mathbf{x}_n\}_{n = 1}^N$, we want to partition it into $K$ clusters.

Here, cluster indices are encoded by categorical variables $\mathbf{y}_n$ via one-hot encoding.
$y_{nk}$ is the $k$-th component of $\mathbf{y}_n$.
$y_{nk} = 1$ if $\mathbf{x}_n$ assigned to cluster $k$, $y_{nk} = 0$ otherwise.

But how should we perform the clustering?

The “clusters” should comprise points closer in distance → cluster together points that are mutually close in Euclidean distance (could be other distance metrics as well). $K$-means assigns all points in same cluster to a prototype representative $\boldsymbol{\mu}_k \rightarrow \{\boldsymbol{\mu}_k\}$ represent centers of clusters.

Thus, the goal is to find assignment of data points to clusters and $\{\boldsymbol{\mu}_k\}$ such that all points within a given cluster can be quantized to $\boldsymbol{\mu}_k$ with minimum quadratic loss, $$ \begin{align*} \{\mathbf{y}_n^\star\}, \{\boldsymbol{\mu}_k^\star\} & = \underset{\{\mathbf{y}_n\}, \{\boldsymbol{\mu}_k\}}{\arg\min} \ \sum_{n = 1}^N \sum_{k = 1}^K y_{nk} d(\mathbf{x}_n, \boldsymbol{\mu}_k) \newline & = \underset{\{\mathbf{y}_n\}, \{\boldsymbol{\mu}_k\}}{\arg\min} \ \sum_{n = 1}^N \sum_{k = 1}^K y_{nk} \Vert \mathbf{x}_n - \boldsymbol{\mu}_k \Vert^2, \end{align*} $$ where $d(\cdot, \cdot)$ is a general distance metric, then we call it $K$-medoids.

Thus, $K$-means solves the minimization problem by alternately optimizing $\mathbf{y}_n$ and $\boldsymbol{\mu}_k$.

Algorithm: $\ K$-Means Clustering

1. Initialization: Initialize $\{\boldsymbol{\mu}_k^{(0)}\}$ (e.g., randomly select $K$ data points as initial cluster centers).
For $\ell = 1, \ldots$:
1. Assignment step: For fixed $\{\boldsymbol{\mu}_k^{(\ell - 1)}\}$, solve the optimization problem for each data point $\mathbf{x}_n$, $$ \mathbf{y}_n^{(\ell)} = \begin{cases} 1, & \text{for } k = \arg \min_j \Vert \mathbf{x}_n - \boldsymbol{\mu}_j^{(\ell - 1)} \Vert^2 \newline 0, & \text{otherwise} \newline \end{cases} $$
1. Refitting step: For fixed $\{\mathbf{y}_n^{(\ell)}\}$, solve the optimization problem for each cluster center $\boldsymbol{\mu}_k$, $$ \boldsymbol{\mu}_k^{(\ell)} = \frac{\sum_{n = 1}^N y_{nk}^{(\ell)} \mathbf{x}_n}{\sum_{n = 1}^N y_{nk}^{(\ell)}} $$
Repeat until convergence (i.e., cluster assignments do not change or change in loss function is below a threshold).

Gaussian Mixture Models (GMMs)

Intuition: Gaussian Mixture Models (GMMs)

In the $K$-means case we (try to) place a hyper-sphere at the center of each cluster. But, if the clusters are far from circular, $K$-means will not perform well.

The Gaussian Mixture Model (GMM) is a probabilistic model that assumes that the data is generated from a mixture of several Gaussian distributions. Formally, we have, $$ p(\mathbf{x}) = \sum_{k = 1}^K \pi_k \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k), \quad \sum_{k = 1}^K \pi_k = 1, $$ where $\pi_k$ are the mixing coefficients and $\mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$ are the proportion of Gaussian in the data.

Thus, the goal is to assign each data point to a cluster based on the likelihood of belonging to one of the Gaussian components.

Derivation: Gaussian Mixture Models (GMMs)

Firstly, we have $K$ clusters, we introduce $\mathbf{\mathsf{z}} \in [1, \ldots, K]$ which is the hidden (categorical) random variables, representing which Gaussian generated our observation $\mathbf{x}$, with some probability. We can write $p(\mathbf{x})$ as, $$ \begin{align*} p(\mathbf{x}) & = \sum_{k = 1}^K p(\mathbf{x}, \mathbf{\mathsf{z}} = k) \newline & = \sum_{k = 1}^K p(\mathbf{\mathsf{z}} = k) p(\mathbf{x} \mid \mathbf{\mathsf{z}} = k) \newline & = \sum_{k = 1}^K \pi_k \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k). \end{align*} $$ $\mathbf{\mathsf{z}}$ can be represented by a one-hot vector which means, $$ \begin{align*} p(\mathbf{\mathsf{z}} = k) & = p(z_k = 1) = \pi_k, \newline p(\mathbf{x} \mid \mathbf{\mathsf{z}} = k) & = p(\mathbf{x} \mid z_k = 1) = \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k). \end{align*} $$ Thus, we can write, $$ \begin{align*} p(\mathbf{\mathsf{z}}) & = \prod_{k = 1}^K \pi_k^{z_k}, \newline p(\mathbf{x} \mid \mathbf{\mathsf{z}}) & = \prod_{k = 1}^K \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)^{z_k}. \end{align*} $$ Since we are interested in the posterior, $$ \begin{align*} p(z_k = 1 \mid \mathbf{x}) & = \frac{p(z_k = 1) p(\mathbf{x} \mid z_k = 1)}{p(\mathbf{x})} \newline & = \frac{\pi_k \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j = 1}^K \pi_j \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)} \newline \end{align*} $$ $p(z_k = 1 \mid \mathbf{x})$ is the posterior probability that the data point $\mathbf{x}$ belongs to cluster $k$.

Note

In the $K$-means case, we hard assign each data point to a cluster. In GMMs, we soft assign each data point to clusters based on the posterior probabilities.

Intuition: Learning Model Parameters in GMMs

We want to learn $\boldsymbol{\theta} = \{\boldsymbol{\pi}, \boldsymbol{\mu}, \boldsymbol{\Sigma}\}$ from data set $\mathcal{D} \coloneqq \{\mathbf{x}_n\}_{n = 1}^N$. Using MLE we find that, $$ \begin{align*} \boldsymbol{\theta}^{\star} & = \underset{\boldsymbol{\theta}}{\arg\max} \ \log p(\mathcal{D} \mid \boldsymbol{\theta}) \newline & = \underset{\boldsymbol{\theta}}{\arg\max} \ \log \left(\prod_{n = 1}^N p(\mathbf{x}_n \mid \boldsymbol{\theta})\right) \newline & = \underset{\boldsymbol{\theta}}{\arg\max} \ \sum_{n = 1}^N \log p(\mathbf{x}_n \mid \boldsymbol{\theta}) \newline & = \underset{\boldsymbol{\theta}}{\arg\max} \ \sum_{n = 1}^N \log \left(\sum_{k = 1}^K \pi_k \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)\right). \end{align*} $$ However, this optimization problem does not have a closed-form solution. We can use the Expectation-Maximization (EM) algorithm to iteratively find an approximate solution.

Derivation: Expectation-Maximization (EM) for GMMs

The EM algorithm gets around our problem by using a common trick in machine learning. Introduce a new distribution $q_n(\mathbf{z}_n)$ over the hidden variables $\mathbf{\mathsf{z}}_n$, $$ \begin{align*} \ell(\boldsymbol{\theta}) & \coloneqq \sum_{n = 1}^N \log \sum_{\mathbf{z}_n} q_n(\mathbf{z}_n) \frac{p(\mathbf{x}_n, \mathbf{z}_n \mid \boldsymbol{\theta})}{q_n(\mathbf{z}_n)} \newline & = \sum_{n = 1}^N \log \mathbb{E}_{q_n(\mathbf{z}_n)} \left[\frac{p(\mathbf{x}_n, \mathbf{z}_n \mid \boldsymbol{\theta})}{q_n(\mathbf{z}_n)}\right] \newline \end{align*} $$ Using Jensen’s inequality (i.e., $f(\mathbb{E}[X]) \geq \mathbb{E}[f(X)]$ for concave $f$), we have, $$ \begin{align*} \ell(\boldsymbol{\theta}) & \geq \sum_{n = 1}^N \mathbb{E}_{q_n} \left[\log \frac{p(\mathbf{x}_n, \mathbf{z}_n \mid \boldsymbol{\theta})}{q_n(\mathbf{z}_n)}\right] \newline & = \sum_{n = 1}^N \left(\mathbb{E}_{q_n} [\log p(\mathbf{x}_n, \mathbf{z}_n \mid \boldsymbol{\theta})] - \sum_{\mathbf{z}_n} q_n(\mathbf{z}_n) \log q_n(\mathbf{z}_n)\right) \newline & = \sum_{n = 1}^N \left(\mathbb{E}_{q_n} [\log p(\mathbf{x}_n, \mathbf{z}_n \mid \boldsymbol{\theta})] + \mathsf{H}(q_n)\right) \newline & = \mathcal{L}(\boldsymbol{q}, \boldsymbol{\theta}). \end{align*} $$ $\mathcal{L}(\boldsymbol{q}, \boldsymbol{\theta})$ is the lower bound on $\log p(\mathcal{D} \mid \boldsymbol{\theta})$ (i.e., evidence lower bound, ELBO).

Thus, we can maximize $\mathcal{L}(\boldsymbol{q}, \boldsymbol{\theta})$ instead of $\ell(\boldsymbol{\theta})$. Maximizing the ELBO will force the log likelihood $\ell(\boldsymbol{\theta})$ to increase as well. The EM algorithm alternates between two steps. Firstly, the E-step. $$ \begin{align*} \mathcal{L}(\boldsymbol{q}, \boldsymbol{\theta}^{(\ell - 1)}) & = \mathbb{E}_{q_n} \left[\log \frac{p(\mathbf{x}_n, \mathbf{z}_n \mid \boldsymbol{\theta}^{(\ell - 1)})}{q_n(\mathbf{z}_n)}\right] \newline & = \sum_{\mathbf{z}_n} q_n(\mathbf{z}_n) \log \frac{{p(\mathbf{x}_n, \mathbf{z}_n \mid \boldsymbol{\theta}^{(\ell - 1)})}}{q_n(\mathbf{z}_n)} \newline & = \sum_{\mathbf{z}_n} q_n(\mathbf{z}_n) \log \frac{p(\mathbf{z}_n \mid \mathbf{x}_n, \boldsymbol{\theta}^{(\ell - 1)}) p(\mathbf{x}_n \mid \boldsymbol{\theta}^{(\ell - 1)})}{q_n(\mathbf{z}_n)} \newline & = \sum_{\mathbf{z}_n} q_n(\mathbf{z}_n) \log \frac{p(\mathbf{z}_n \mid \mathbf{x}_n, \boldsymbol{\theta}^{(\ell - 1)})}{q_n(\mathbf{z}_n)} + \sum_{\mathbf{z}_n} q_n(\mathbf{z}_n) \log p(\mathbf{x}_n \mid \boldsymbol{\theta}^{(\ell - 1)}) \newline & = - \ \mathrm{KL}(q_n(\mathbf{z}_n) \mid \mid p(\mathbf{z}_n \mid \mathbf{x}_n, \boldsymbol{\theta}^{(\ell - 1)})) + \log p(\mathbf{x}_n \mid \boldsymbol{\theta}^{(\ell - 1)}). \end{align*} $$ Thus, the E-step fixes $\boldsymbol{\theta}$ and optimizes $q_n$ → For a fixed $\boldsymbol{\theta}$ $(\boldsymbol{\theta}^{(\ell - 1)})$, we can maximize $\mathcal{L}(\boldsymbol{q}, \boldsymbol{\theta})$ by setting each term to $q_n^{\star}(\mathbf{z}_n) = p(\mathbf{z}_n \mid \mathbf{x}_n, \boldsymbol{\theta})$.

Secondly, the M-step. Here, we fix $q_n(\mathbf{z}_n)$ and optimize $\boldsymbol{\theta}$. $$ \begin{align*} \mathcal{L}(\boldsymbol{\theta})^{\ell} & = \sum_{n = 1}^N \mathbb{E}_{q_n^{\ell}} [\log p(\mathbf{x}_n, \mathbf{z}_n \mid \boldsymbol{\theta})] + \mathsf{H}(q_n^{\ell}) \newline & \propto \sum_{n = 1}^N \mathbb{E}_{q_n^{\ell}} [\log p(\mathbf{x}_n, \mathbf{z}_n \mid \boldsymbol{\theta})], \end{align*} $$ where $q_n^{\ell}(\mathbf{z}_n) = p(\mathbf{z}_n \mid \mathbf{x}_n, \boldsymbol{\theta}^{(\ell - 1)})$ from the E-step. We see that this is the expected complete data log-likelihood. Thus, $$ \boldsymbol{\theta}^{(\ell + 1)} = \underset{\boldsymbol{\theta}}{\arg\max} \ \sum_{n = 1}^N \mathbb{E}_{q_n^{\ell}} [\log p(\mathbf{x}_n, \mathbf{z}_n \mid \boldsymbol{\theta})] $$ Let’s observe what we have derived. EM based on observation that solving $\boldsymbol{\theta}^{\star} = \underset{\boldsymbol{\theta}}{\arg\max} \ \log p(\mathcal{D} \mid \boldsymbol{\theta})$ would be easy to solve given the latent variables $\mathbf{z}_1, \ldots, \mathbf{z}_n$ having been used implicitly to generate the data samples $\mathbf{x}_1, \ldots, \mathbf{x}_n$. Thus, we can consider the maximization of the complete data log-likelihood, $$ \ell_c(\boldsymbol{\theta}) \coloneqq \log p(\mathcal{D}, \mathbf{Z} \mid \boldsymbol{\theta}). $$ However, as we do not have values for $\mathbf{Z}$, we consider the expectation, $$ \begin{align*} \mathbb{E}_{\mathbf{\mathsf{Z}} \sim p(\mathbf{Z} \mid \mathcal{D}, \boldsymbol{\theta}^{(\ell - 1)})} [\ell_c(\boldsymbol{\theta})] & = \mathbb{E}_{\mathbf{\mathsf{Z}} \sim p(\mathbf{Z} \mid \mathcal{D}, \boldsymbol{\theta}^{(\ell - 1)})} [\log p(\mathcal{D}, \mathbf{Z} \mid \boldsymbol{\theta})] \newline & = \mathbb{E}_{\mathbf{\mathsf{Z}} \sim p(\mathbf{Z} \mid \mathcal{D}, \boldsymbol{\theta}^{(\ell - 1)})} \left[\sum_{n = 1}^N \log p(\mathbf{x}_n, \mathbf{z}_n \mid \boldsymbol{\theta})\right] \newline & = \sum_{n = 1}^N \mathbb{E}_{\mathbf{\mathsf{z}}_n \sim p(\mathbf{z}_n \mid \mathbf{x}_n, \boldsymbol{\theta}^{(\ell - 1)})} [\log p(\mathbf{x}_n, \mathbf{z}_n \mid \boldsymbol{\theta})] \newline \end{align*} $$ We don’t know $\mathbf{\mathsf{Z}}^{(\ell)}$, so we average them out given the current model $\boldsymbol{\theta}^{(\ell - 1)}$. In practice, we can define a function, $$ Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{(\ell - 1)}) \coloneqq \mathbb{E}_{q_n^{\ell}} [\log p(\mathbf{x}_n, \mathbf{z}_n \mid \boldsymbol{\theta})] $$ that lower bounds the desired function.

If we now optimize for $\boldsymbol{\theta}$, we will get a better lower bound, $$ \mathcal{L}^{\star}(\boldsymbol{\theta}^{(\ell - 1)}) \leq \mathcal{L}(\boldsymbol{\theta}^{(\ell)}) \leq \ell(\boldsymbol{\theta}^{(\ell)}) \eqqcolon \log p(\mathcal{D} \mid \boldsymbol{\theta}^{(\ell)}) $$ Finally, we can iterate between the E-step and the M-step and our lower bound will always improve until convergence.

Algorithm: Expectation-Maximization (EM) for GMMs

1. Initialization: Initialize $\boldsymbol{\theta}^{(0)}$.
For $\ell = 1, \ldots$:
1. E-step: Evaluate $p(\mathbf{Z} \mid \mathcal{D}, \boldsymbol{\theta}^{(\ell - 1)})$ and compute, $$ Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{(\ell - 1)}) = \sum_{\mathbf{\mathsf{Z}}} p(\mathbf{Z} \mid \mathcal{D}, \boldsymbol{\theta}^{(\ell - 1)}) \log p(\mathcal{D}, \mathbf{Z} \mid \boldsymbol{\theta}) $$
1. M-step: Solve the optimization problem, $$ \boldsymbol{\theta}^{(\ell)} = \underset{\boldsymbol{\theta}}{\arg\max} \ Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{(\ell - 1)}) $$
1. If convergence criterion not satisfied, return to step 2.