Last time we discussed the clustering objective and had a look at two approaches, namely K-means and Gaussian Mixture Models (GMMs).
In this part, we will continue our discussion on unsupervised learning by exploring dimensionality reduction and introducing Principal Component Analysis (PCA) as a popular method for this task.
Dimensionality Reduction
Intuition (Dimensionality Reduction)
The goal of dimensionality reduction is to learn a mapping from the high-dimensional (visible) space x∈Rd to a low-dimensional (latent) space z∈Rm with m≪d.
Principal Component Analysis (PCA)
Intuition (Principal Component Analysis (PCA))
PCA finds a linear (and orthogonal, but more formally on this later) projection of high-dimensional data x∈Rd to a low-dimensional subspace z∈Rm such that z is a good approximation of the data (i..e, retains as much valuable information as possible).
PCA accomplishes this by finding the directions (principal components) that capture the maximum variance in the data.
The intuition is that, our complex data may appear high-dimensional, but there may only be a few degrees of variability (latent factors) that explain most of the variability in the data.
For example, in face recognition, there are only a few latent factors that are able to describe most of the variability (lightning, pose, identity).
Derivation (Principal Component Analysis (PCA))
Thus, we understand that a good approximation of the data → reconstruction error made by projection is minimum, which means,
We project x onto z.
We want our approximation x^ (arising from the projection onto z) to be as close as possible to the original data x.
Distortion measure: Thus, we can take the squared distance between our original data point xn and its (approximate) reconstruction x^, averaged over all data points,
minL:=N1i=1∑N∥xi−x^i∥2
But, how can we find a linear subspace that minimizes the reconstruction error?
Firstly, we have to choose a 1-dimensional subspace, which can be specified by a unit vector u.
Secondly, we can project 2-dimensional data points onto it by z=uTu.
Lastly, we can coordinate in the original space that represents the projection onto u by (uTx)u.
Thus, we can write the reconstruction error as,
L(u)=N1i=1∑N∥xi−x^i∥2,x^i=(uTxi)u
Now, we can investigate a few things; Firstly, let d be the distance of x from the origin.
Further, let r and v be the distance of x from x^ in the subspace and distance of x^ from the origin, respectively.
Using the Pythagorean theorem, we have,
r2+v2=d2,
r2 contributes to the reconstruction error, while v2 contributes to the variance of the projected data within the subspace.
This means that, minimizing the reconstruction error is equivalent to maximizing the variance of the projected data!
Note
However, note that this can be a double-edged sword. PCA can be “misled” by directions in which variance is high because of the measurement scale → Always standardize the data first.
Definition 1 (Principal Component Analysis (PCA))
Formally, principal components are a sequence of m (unit) vectors where the i-th vector is the direction of a line that best fits the data while being orthogonal to the first i−1 vectors.
Thus, the best-fitting line is equivalent with the direction along which the data shows maximum variance.
1st PC: Direction in space along which projections have largest variance.
2nd PC: Direction maximizing variance among all directions orthogonal to 1st PC.
k-th PC: Variance-maximizing direction orthogonal to previous k−1 components.
Derivation (Reconstruction Error and Variance)
Let {xi∈Rd} be the high-dimensional observations. Further, let uj∈Rd and zi=u1Txi be the j-th principal direction and the projection of xi onto uj, respectively.
Then, we can consider the best 1-dimensional space u1 as,
Which means that for the first principal component,
u1⋆=u1argminL(u1)=u1argmaxVar(z)
We can use vector-matrix notation by introducing the empirical covariance matrix S defined as,
S:=N1i=1∑NxixiT
Thus, our objective is to choose a direction that maximizes variance → Need to maximize variance u1TSu1 with respect to u1.
However, we need to prevent ∥u1∥→∞, we can add the constraint that ∥u∥=u1Tu=1,
u1⋆=u1argmaxu1TSu1+λ1(1−u1Tu1)
where λ1 is a Lagrange multiplier.
By taking the derivative with respect to u1 and setting it to zero, we have,
Su1=λ1u1
which corresponds to the eigenvector equation of the matrix S.
Recall (Eigenvalues and Eigenvectors)
Let A be a (real-valued) n×n matrix. Then λ∈R and a nonzero vector v are said to be an eigenvalue and corresponding eigenvector of A if,
Av=λv
The linear transformation of A on the eigenvector v only changes its magnitude by the factor λ (i.e., scaling) but not its direction.
A n×n matrix A with n linearly independent eigenvectors vi can be factorized as,
A=VΛV−1
where V=(v1,…,vn) is the n×n matrix whose columns are the eigenvectors of A and Λ=diag(λ1,…,λn) is the diagonal matrix whose entries are the corresponding eigenvalues.
Thus, u1 is an eigenvector of S. By left-multiplying the equation above by u1T and using the fact that u1Tu1=1, we have,
u1TSu1=λ1u1Tu1=λ1
Thus, Var[z1≈u1TSu1 is maximized when we set u1 equal to the eigenvector with the largest eigenvalue λ1.
We can define additional principal components in an incremental fashion by choosing each new direction to be that which maximizes the projected variance amongst all possible directions orthogonal to those already considered.
The optimal projection onto a m-dimensional space is defined by the m eigenvectors u1,…,um of the data covariance matrix S corresponding to the m largest eigenvalues λ1≥λ2≥…≥λm.
Probabilistic PCA
Intuition (Probabilistic PCA)
Probabilistic PCA (PPCA) is a probabilistic reformulation of PCA that models the data generation process using a latent variable model.
It introduces an explicit m-dimensional latent variable z corresponding to the principal-component subspace.
It defines the probabilistic model,
p(z)p(x∣z)=N(z∣0,I),=N(x∣Wz+μ,σ2I),
where W is a d×m matrix with columns spanning a linear subspace within the data space that corresponds to the principal subspace.
Note
Here, W,μ,σ2 are the model parameters to be learned from data.