Part 8 - Principal Component Analysis

CS4487_8

Eigenvectors

Let’s do a quick linear algebra recap.

Assume $\mathbf{A} \in \mathbb{R}^{N \times N}, \mathbf{v} \in \mathbb{C}^{N \times 1}$, and $\lambda \in \mathbb{C}$.

If $\mathbf{A} \mathbf{v} = \lambda \mathbf{v}$, then $\mathbf{v}$ is a right eigenvector of $\mathbf{A}$ with eigenvalue $\lambda$.

If $\mathbf{A}^T \mathbf{v} = \lambda \mathbf{v}$, then $\mathbf{v}$ is a left eigenvector of $\mathbf{A}$ with eigenvalue $\lambda$. (Equivalently, $\mathbf{v}^T \mathbf{A} = \lambda \mathbf{v}^T$.)

If $\mathbf{A}$ is symmetric such that $\mathbf{A} = \mathbf{A}^T$, then the left and right eigenvectors of $\mathbf{A}$ are the same with the same eigenvalues.

Example

$$ \begin{bmatrix} 2 & 1 \newline 1 & 2 \end{bmatrix} \begin{bmatrix} 1 \newline 1 \end{bmatrix} = \begin{bmatrix} 3 \newline 3 \end{bmatrix} = 3 \begin{bmatrix} 1 \newline 1 \end{bmatrix} $$

$$ \begin{bmatrix} 1 & 1 \end{bmatrix} \begin{bmatrix} 2 & 1 \newline 1 & 2 \end{bmatrix} = 3 \begin{bmatrix} 1 & 1 \end{bmatrix} $$

Linear Independence

Linear independence is arguably the most important concept in linear algebra.

A set of vectors $\{\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_N\}$ is linearly independent if vector equation,

$$ a_1 \mathbf{v}_1 + a_2 \mathbf{v}_2 + \ldots + a_N \mathbf{v}_N = \mathbf{0}, $$

has only the trivial solution $a_1 = a_2 = \ldots = a_N = 0$.

$\{\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_N\}$ is linearly dependent if there exists numbers $a_1, a_2, \ldots, a_N$ not all equal to zero, such that,

$$ a_1 \mathbf{v}_1 + a_2 \mathbf{v}_2 + \ldots + a_N \mathbf{v}_N = \mathbf{0}. $$

Assuming $a_1 \neq 0$, we have $\mathbf{v}_1 = -\frac{a_2}{a_1} \mathbf{v}_2 - \ldots - \frac{a_N}{a_1} \mathbf{v}_N$.

Some Matrices

An $N \times N$ square matrix $\mathbf{A}$ is invertible if there exists an $N \times N$ square matrix $\mathbf{B}$ such that,

$$ \mathbf{A} \mathbf{B} = \mathbf{B} \mathbf{A} = \mathbf{I}_N, $$

Equivalently, the columns/rows of $\mathbf{A}$ are linearly independent.

A square matrix $\mathbf{Q}$ is orthogonal if its columns and rows are orthogonal unit vectors (orthonormal vectors),

$$ \mathbf{Q} \mathbf{Q}^T = \mathbf{Q}^T \mathbf{Q} = \mathbf{I}. $$

A square matrix $\mathbf{A}$ is diagonalizable if there exists an invertible matrix $\mathbf{P}$ and a diagonal matrix $\mathbf{D}$ such that,

$$ \mathbf{P}^{-1} \mathbf{A} \mathbf{P} = \mathbf{D}, $$

or equivalently,

$$ \mathbf{A} = \mathbf{P} \mathbf{D} \mathbf{P}^{-1}. $$

Real symmetric matrices are diagonalizable by orthogonal matrices. This can be proven by the spectral theorem.

Eigendecomposition

Let $\mathbf{V} \in \mathbb{R}^{N \times N}$ be a matrix whose columns $\mathbf{v_i}$ are $N$ linearly independent eigenvectors of $\mathbf{A}$ with $\mathbf{\Lambda}$ the corresponding diagonal matrix of eigenvalues such taht $\Lambda_{ii} = \lambda_i$.

Then,

$$ \begin{align*} \mathbf{A} \mathbf{V} & = \mathbf{V} \mathbf{\Lambda} \newline \mathbf{A} & = \mathbf{V} \mathbf{\Lambda} \mathbf{V}^{-1} \newline \mathbf{V}^{-1} \mathbf{A} \mathbf{V} & = \mathbf{\Lambda} \end{align*} $$

Eigendecomposition of Symmetric Matrix

If $\mathbf{A}$ is a real symmetric, we can choose $N$ orthonormal eigenvectors so that $\Vert \mathbf{v}_i \Vert_2^2 = 1$, $\mathbf{v}_i^T \mathbf{v}_j = 0$ and $N$ real eigenvalues $\lambda_i \in \mathbb{R}$.

As a result, we have,

$$ \begin{align*} \mathbf{A} & = \mathbf{V} \mathbf{\Lambda} \mathbf{V}^T = \sum_{i = 1}^N \lambda_i \mathbf{v_i} \mathbf{v_i}^T \newline \mathbf{V}^T \mathbf{A} \mathbf{V} & = \mathbf{\Lambda} \end{align*} $$

Principal Component Analysis

Principal Component Analysis (PCA) is an unsupervised method, given a data matrix $\mathbf{X} \in \mathbb{R}^{M \times N}$, the goal of PCA is to identify the directions of maximum variance contained in the data.

We need to choose the basis vectors along the maximum variance (longest extent) of the data. The basis vectors are called principal components (PC).

Sample Variance in a Given Direction

Let $\mathbf{v} \in \mathbb{R}^N$ such that $\Vert \mathbf{v} \Vert_2 = \mathbf{v}^T \mathbf{v} = 1$.

The variance in the direction $\mathbf{v}$ is given by the expression,

$$ \frac{1}{M} \sum_{i = 1}^M (\mathbf{v}^T \mathbf{x}^{(i)} - \mathbf{\mu})^2, $$

where $\mathbf{\mu} = \frac{1}{M} \sum_{i = 1}^M \mathbf{v}^T \mathbf{x}^{(i)}$ is the mean of the projected data.

Pre-Centering

Under the assumption that the data are pre-centered so that $\frac{1}{M} \sum_{i = 1}^M (\mathbf{v}^T \mathbf{x}^{(i)}) = 0$, this expression simplifies to,

$$ \begin{align*} \frac{1}{M} \sum_{i = 1}^M (\mathbf{v}^T \mathbf{x}^{(i)})^2 & = \frac{1}{M} \sum_{i = 1}^M \left(\mathbf{v}^T \mathbf{x}^{(i)}\right)^T \left(\mathbf{v}^T \mathbf{x}^{(i)}\right) \newline & = \frac{1}{M} \sum_{i = 1}^M \left(\mathbf{v}^T \mathbf{x}^{(i)}\right)^T \left(\mathbf{x}^{(i)}\right)^T \left((\mathbf{x}^{(i)})^T \mathbf{v}\right) \newline & = \frac{1}{M} \mathbf{v}^T \left(\sum_{i = 1}^M \mathbf{x}^{(i)} (\mathbf{x}^{(i)})^T\right) \mathbf{v} \newline & = \frac{1}{M} \mathbf{v}^T \mathbf{X}^T \mathbf{X} \mathbf{v}. \end{align*} $$

The Direction of Maximum Variance

Suppose we want to identify the direction $\mathbf{v}_1$ of maximum variance given the data matrix $\mathbf{X}$.

We can formulate this optimization problem as follows,

$$ \begin{align*} \underset{\mathbf{v}}{\max} & \quad \frac{1}{M} \mathbf{v}^T \mathbf{X}^T \mathbf{X} \mathbf{v} \newline \text{subject to} & \quad \Vert \mathbf{v} \Vert_2 = 1. \end{align*} $$

Letting $\mathbf{\Sigma} = \frac{1}{M} \mathbf{X}^T \mathbf{X}$, we form the Lagrangian,

$$ L(\mathbf{v}, \nu) = \mathbf{v}^T \mathbf{\Sigma} \mathbf{v} + \nu (1 - \Vert \mathbf{v} \Vert_2^2). $$

Taking the derivative of $L(\mathbf{v}, \nu)$ with respect to $\mathbf{v}$ and setting it to zero, we have,

$$ \begin{align*} \frac{\partial L(\mathbf{v}, \nu)}{\partial \mathbf{v}} & = 2 \mathbf{\Sigma} \mathbf{v} - 2 \nu \mathbf{v} = 0 \newline \mathbf{\Sigma} \mathbf{v} & = \nu \mathbf{v}. \end{align*} $$

As $\mathbf{v} \neq 0, \mathbf{v}$ must be an eigenvector of $\mathbf{\Sigma}$ with eigenvalue $\nu$.

Assuming $\{\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_N\}$ are the eigenvectors of $\mathbf{\Sigma}$, corresponding to eigenvalues $\sigma_1 \geq \ldots \geq \sigma_N$, respectively, we have,

$$ \begin{align*} \mathbf{v}^{\star} & = \mathbf{v}_1 \newline p^{\star} & = \mathbf{v}_1^T \mathbf{\Sigma} \mathbf{v}_1 = \mathbf{v}_1^T \nu \mathbf{v}_1 = \nu \mathbf{v}_1^T \mathbf{v}_1 = \nu = \sigma_1. \end{align*} $$

$K$ Largest Directions of Variance

Suppose instead of just the direction of maximum variance, we want the $K$ largest directions of variance that are all mutually orthogonal.

Finding the second-largest direction of variance corresponds to solving the problem,

$$ \begin{align*} \underset{\mathbf{v}}{\max} & \quad \frac{1}{M} \mathbf{v}^T \mathbf{X}^T \mathbf{X} \mathbf{v} \newline \text{subject to} & \quad \Vert \mathbf{v} \Vert_2 = 1, \newline & \quad \mathbf{v}^T \mathbf{v}_1 = 0. \end{align*} $$

We form the Lagrangian,

$$ L(\mathbf{v}, \nu, \lambda) = \mathbf{v}^T \mathbf{\Sigma} \mathbf{v} + \nu (1 - \Vert \mathbf{v} \Vert_2^2) + \lambda \mathbf{v}^T \mathbf{v}_1. $$

Taking the derivative of $L(\mathbf{v}, \nu, \lambda)$ with respect to $\mathbf{v}$ and setting it to zero, we have,

$$ \frac{\partial L(\mathbf{v}, \nu, \lambda)}{\partial \mathbf{v}} = 2 \mathbf{\Sigma} \mathbf{v} - 2 \nu \mathbf{v} + \lambda \mathbf{v}_1 = 0. $$

If we left-multiply by $\mathbf{v}_1^T$ on both sides, we have, $$ \begin{align*} 2 \mathbf{v_1}^T \mathbf{\Sigma} \mathbf{v} - 2 \nu \mathbf{v_1}^T \mathbf{v} + \lambda \mathbf{v_1}^T \mathbf{v_1} & = 0 \newline 2(\mathbf{\Sigma} \mathbf{v_1})^T \mathbf{v} - 0 + \lambda & = 0 \newline 2 \sigma_1 \mathbf{v_1}^T \mathbf{v} - 0 + \lambda & = 0 \newline \lambda = 0. \end{align*} $$

Therefore, we arrive at the eigenvalue equation again,

$$ \mathbf{\Sigma} \mathbf{v} = \nu \mathbf{v}. $$

It is easy to see that $\mathbf{v}^{\star}$ is the eigenvector corresponding to the second largest eigenvalue.

In general, the top $K$ directions of variance $\mathbf{v}_1, \ldots, \mathbf{v}_K$ are given by the $K$ eigenvectors corresponding to the $K$ largest eigenvalues of $\frac{1}{M} \mathbf{X}^T \mathbf{X}$.

PCA can also be derived by picking the principal vectors that minimize the approximation error arising from projecting the data onto the $K$-dimensional subspace spanned by these vectors,

$$ \underset{\mathbf{V}}{\min} \frac{1}{M} \sum_{i = 1}^M \Vert \mathbf{x}^{(i)} - (\mathbf{v}^T \mathbf{x}^{(i)}) \mathbf{v} \Vert_2^2. $$

Dimensionality Reduction with PCA

The informal algorithm can be described as follows,

Subtract the mean of the data.
The first PC $\mathbf{v}_1$ is the direction that explains the most variance of the data.
The second PC $\mathbf{v}_2$ is the direction perpendicular to $\mathbf{v}_1$ that explains the most variance.
The third PC $\mathbf{v}_3$ is the direction perpendicular to $\{\mathbf{v}_1, \mathbf{v}_2\}$ that explains the most variance.
$\ldots$

The formal algorithm can be described as follows,

Data preprocessing: Compute $\mathbf{\mu} = \frac{1}{M} \sum_i \mathbf{x}^{(i)}$ and replace each $\mathbf{x}^{(i)}$ with $\mathbf{x}^{(i)} - \mathbf{\mu}$.
Given pre-processed data matrix $\mathbf{X} \in \mathbb{R}^{M \times N}$, compute the sample covariance matrix $\mathbf{\Sigma} = \frac{1}{M} \mathbf{X}^T \mathbf{X}$.
Compute the $K$ leading eigenvectors $\mathbf{v}_1, \ldots, \mathbf{v}_K$ of $\mathbf{\Sigma}$ where $\mathbf{v}_i \in \mathbb{R}^N$.
Stack the eigenvectors together into an $N \times K$ matrix $\mathbf{V}$ where each column $i$ of $\mathbf{V}$ corresponds to $\mathbf{v}_i$.
Project the matrix $\mathbf{X}$ into the rank-$K$ subspace of maximum variance by computing the matrix product $\mathbf{Z} = \mathbf{X} \mathbf{V}$.
To reconstruct $\mathbf{X}$ given $\mathbf{Z}$ and $\mathbf{V}$, we use $\mathbf{\hat{X}} = \mathbf{Z} \mathbf{V}^T$.

How to Choose the Number of PCs?

We have two methods to set the number of components $K$.

Preserve some percentage of the variance, e.g., 95%.
Whatever works well for our final task (e.g., classification, regression).

Connection to SVD

We have seen that the minimum Frobenius norm linear dimensionality reduction problem could be solved using the rank-$K$ SVD of $\mathbf{X}$,

$$ \begin{align*} \underset{\mathbf{U}, \mathbf{S}, \mathbf{V}}{\arg \min} & \quad \Vert \mathbf{X} - \mathbf{U} \mathbf{S} \mathbf{V}^T \Vert_F^2, \end{align*} $$

where $\mathbf{U} \in \mathbb{R}^{M \times K}$, $\mathbf{S} \in \mathbb{R}^{K \times K}$, and $\mathbf{V} \in \mathbb{R}^{N \times K}$.

The matrix $\mathbf{Z} = \mathbf{U} \mathbf{S}$ gives the optimal rank-$K$ representation of $\mathbf{X}$ with respect to Frobenius norm minimization.

If we let $K = N$ then $\mathbf{X} = \mathbf{U} \mathbf{S} \mathbf{V}^T$ and $\mathbf{X}^T \mathbf{X} = \mathbf{V} \mathbf{S}^T \mathbf{U}^T \mathbf{U} \mathbf{S} \mathbf{V}^T$.

Due to orthogonality of $\mathbf{U}$, we get $\mathbf{X}^T \mathbf{X} = \mathbf{V} \mathbf{S}^2 \mathbf{V}^T$.

This means that the right singular vectors of $\mathbf{X}$ are exactly the eigenvectors of $\mathbf{X}^T \mathbf{X}$.

We can also see that the eigenvalues of $\mathbf{X}^T \mathbf{X}$ are the squares of the diagonal elements of $\mathbf{S}$.

This means that the $K$ largest singular values and $K$ largest eigenvalues correspond to the same $K$ basis vectors.

According to PCA, the projection operation is $\mathbf{Z} = \mathbf{X} \mathbf{V}$, therefore,

$$ \mathbf{Z} = \mathbf{X} \mathbf{V} = (\mathbf{U} \mathbf{S} \mathbf{V}^T)(\mathbf{V}) = \mathbf{U} \mathbf{S}. $$

Finally, note that if the decomposition are based only on the $K$ leading principal vectors, the projections $\mathbf{Z} = \mathbf{X} \mathbf{V}$ and $\mathbf{Z} = \mathbf{U} \mathbf{S}$ will still be identical.

These manipulations show that PCA on $\mathbf{X}^T \mathbf{X}$ and SVD on $\mathbf{X}$ identify exactly the same subspace and result in exactly the same projection of the data into that subspace.

As a result, generic linear dimensionality reduction simultaneously minimizes the Frobenius norm of the reconstruction error of $\mathbf{X}$ and maximizes the retained variance in the learned subspace.

Both SVD and PCA provide the same description of generic linear dimensionality reduction, an orthogonal basis for exactly the same optimal linear subspace.

When Does PCA Fail?

The primary motivation behind PCA is to decorrelate the dataset, i.e., remove second-order dependencies.

If higher-order dependencies exist between the features in the data, PCA may be insufficient at revealing all strucutre in the data.

PCA requires that each component must be perpendicular to the previous ones, but clearly, this requirement is overly stringent and the data might be arranged along non-orthogonal axes.

Kernel Principal Component Analysis

Kernel PCA (KPCA) is a non-linear extension of PCA that can be used to extract non-linear structure in the data.

Limitations of Linear Dimensionality Reduction

What if the data “lives” on a non-flat surface?

PCA can not capture the curvature of the data.

As usual, feature mapping can be used to overcome this.

Feature Mapping

If we apply a high-dimensional feature transformation to the data $\mathbf{x}^{(i)} \rightarrow \phi(\mathbf{x}^{(i)})$, we can project the high-dimensional data to a linear surface, i.e., run PCA on $\phi(\mathbf{X})$. In the original space, the projection will be non-linear.

Feature Mapping + SVD

Given a data set $\mathbf{X} \in \mathbb{R}^{M \times N}$ and a feature mapping function $\phi : \mathbb{R}^N \mapsto \mathbb{R}^L$ for $L > N$, we obtain the following SVD-based algorithm,

Compute $\mathbf{U}, \mathbf{S}, \mathbf{V} = \text{SVD}(\phi(\mathbf{X}))$.
Return $\mathbf{Z} = \mathbf{U} \mathbf{S}$.

Feature Mapping + PCA

Given a data set $\mathbf{X} \in \mathbb{R}^{M \times N}$ and a feature mapping function $\phi : \mathbb{R}^N \mapsto \mathbb{R}^L$ for $L > N$, we obtain the following PCA-based algorithm,

Compute $\mathbf{\Sigma} = \frac{1}{M} \sum_i (\phi(\mathbf{x}^{(i)} - \mathbf{\mu})) (\phi(\mathbf{x}^{(i)} - \mathbf{\mu}))^T$. where $\mathbf{\mu} = \frac{1}{M} \sum_i \phi(\mathbf{x}^{(i)})$.
Compute the $K$ leading eigenvectors $\mathbf{v}_1, \ldots, \mathbf{v}_K$ of $\mathbf{\Sigma}$ where $\mathbf{v}_j \in \mathbb{R}^L$ for $j = 1, \ldots, K$.
Stack the eigenvectors together to form $\mathbf{V} = [\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_K]$, where $\mathbf{V} \in \mathbb{R}^{L \times K}$.
Project the matrix $\phi(\mathbf{X})$ into the rank-$K$ subspace of maximum variance by computing the matrix product $\mathbf{Z} = \phi(\mathbf{X}) \mathbf{V}$.

Kernel PCA

As in classification, it becomes very expensive to use an explicit feature function to map data into a high-dimensional space.

In the basic SVD-based algortihm, there’s no way to avoid this problem.

However, in the PCA-based algorithm, we are able to take advantage of the kernel trick,

$$ \mathcal{K}(\mathbf{x}^{(i)}, \mathbf{x}^{(j)}) = \phi(\mathbf{x}^{(i)})^T \phi(\mathbf{x}^{(j)}). $$

Given $\phi : \mathbb{R}^N \mapsto \mathbb{R}^L$, we can compute the covariance matrix in the new feature space,

$$ \mathbf{\Sigma} = \frac{1}{M} \sum_{j = 1}^M \phi(\mathbf{x}^{(i)}) \phi(\mathbf{x}^{(j)})^T. $$

Note that we have assumed that the data is pre-centered in our new feature space. If this turns out to be false, we can center the data by subtracting the mean of the data in the new feature space.

Eigendecomposition of $\mathbf{\Sigma}$ is given by,

$$ \mathbf{\Sigma} \mathbf{v_k} = \frac{1}{M} \sum_{j = 1}^M \phi(\mathbf{x}^{(i)}) \phi(\mathbf{x}^{(j)})^T \mathbf{v}_k = \lambda_k \mathbf{v}_k, \forall k = 1, \ldots, L. $$

It is not hard to see that $\mathbf{v}_k$ can be expressed as,

$$ \mathbf{v_k} = \sum_{j = 1}^M w_k^{(j)} \phi(\mathbf{x}^{(j)}), $$

where $w_k^{(j)} = \frac{1}{M \lambda_k} \phi(\mathbf{x}^{(j)})^T \mathbf{v}_k$.

So, the kernel PCA is a linear combination of high-dimensional vectors, and $w_k^{(j)}$ are weights to be determined.

If we left-multiply $\phi(\mathbf{x}^{(i)})^T$ to both sides, we have,

$$ \phi(\mathbf{x}^{(i)})^T \mathbf{v_k} = \sum_{j = 1}^M w_k^{(j)} \phi(\mathbf{x}^{(i)})^T \phi(\mathbf{x}^{(j)}) = M \lambda_k w_k^{(i)}. $$

Defining the kernel matrix $\mathbf{K} \in \mathbb{R}^{M \times M}$, where $K_{ij} = \mathcal{K}(\mathbf{x}^{(i)}, \mathbf{x}^{(j)}) = \phi(\mathbf{x}^{(i)})^T \phi(\mathbf{x}^{(j)})$.

Then,

$$ \sum_{j = 1}^M K_{ij} w_k^{(j)} = M \lambda_k w_k^{(i)}. $$

If we consider $i = 1, \ldots, M$, the above scalar equation becomes the $i$-th component of the following vector equation, $$ \mathbf{K} \mathbf{w}_k = M \lambda_k \mathbf{w}_k, $$

where $\mathbf{w}_k = [w_k^{(1)}, w_k^{(2)}, \ldots, w_k^{(M)}]^T$ is the $k$-th eigenvector of $\mathbf{K}$.

$M \lambda_k$ is the eigenvalue of $\mathbf{K}$, which is proportional to the eigenvalue $\lambda_k$ of the covariance matrix $\mathbf{\Sigma}$ in the feature space.

Therefore, PCA on $\mathbf{\Sigma}$ is equivalent to PCA on $\mathbf{K}$.

For a new point $\mathbf{x}^{\star}$, the $k$-th kernel PC can be obtained by projection $\phi(\mathbf{x}^{\star})$ on the $k$-th eigenvector $\mathbf{v}_k$ of $\mathbf{\Sigma}$.

$$ \phi(\mathbf{x}^{\star})^T \mathbf{v_k} = \sum_{i = 1} w_k^{(i)} \phi(\mathbf{x}^{\star})^T \phi(\mathbf{x}^{(i)}) = \sum_{i = 1} w_k^{(i)} \mathcal{K}(\mathbf{x}^{\star}, \mathbf{x}^{(i)}). $$

Kernel PCA Algorithm

Given a data set $\mathbf{X} \in \mathbb{R}^{M \times N}$ and a kernel function $\mathcal{K}$, kernel PCA can be computed as follows,

Compute $K_{ij} = \mathcal{K}(\mathbf{x}^{(i)}, \mathbf{x}^{(j)})$ for all $i,j$.
Compute $\mathbf{K}^{\prime} = (\mathbf{I} - \mathbf{1}_M) \mathbf{K} (\mathbf{I} - \mathbf{1}_M)$ where $\mathbf{1}_M$ is an $M \times M$ matrix where every entry is $\frac{1}{M}$. The goal is to zero center data points in the feature space.
Compute the $K$ leading eigenvectors $\mathbf{w}_1, \ldots, \mathbf{w}_K$ of $\mathbf{K}^{\prime}$ along with their eigenvalues $M \lambda_1, \ldots, M \lambda_K$.
Compute the $k$-th PC of the projected data vector $\mathbf{z} \in \mathbb{R}^{K \times 1}$,

$$ z_k = \sum_{i = 1}^M w_k^{(i)} \mathcal{K}(\mathbf{x}, \mathbf{x}^{(i)}). $$

Summary

Kernel PCA uses the kernel trick to peform PCA in high-dimensional space.

Coeffcients are based on a non-linear projection of the data.
The type of projection is based on the kernel function selected.

Using RBF kernel, KPCA can split the data into clusters.

Kernel PCA can provide an effective pre-processing step for clustering methods as well as linear classification and regression methods.

However, exact compuation of kernel PCA can be expensive because the size of the matrix to be decomposed is $M \times M$.