Dimension Reduction by PCA

We live in the age of “big data”. Voluminous data collections are mined for information using mathematical techniques. Problems in high dimensions are hard to solve — this is called “the curse of dimensionality”. Dimension reduction is essential in big data science. Many sophisticated techniques have been developed to reduce dimensions and reveal the information buried in mountains of data.

Correlated-Variables

We will look at the method of principal component analysis (PCA). This method identifies a small number of new variables called principal components that allow us to spot patterns. It also allows us to visualize the data in a simple two-dimensional diagram that often illustrates the core factors of a problem.

PCA has many uses: an intriguing application in genetics, showing how DNA can be used to infer an individual’s geographic origin with remarkable accuracy will be described next week [Reference to follow]. Here we look at the underlying mathematics.

The Maths of Principal Component Analysis (PCA)

For a collection of points in a high-dimensional space, we can find a `line of best fit’, PC1, that minimizes the average squared distance from the points to the line. We can then choose another best-fitting line, PC2, perpendicular to the first, giving the plane that best fits the data. Continuing, we get an orthogonal basis of uncorrelated basis vectors, called principal components. Principal Component Analysis (PCA) can be done by calculating the covariance matrix ${\mathbf{C}}$ of the data and performing an eigenvalue decomposition of ${\mathbf{C}}$ . it may also be done by Singular Value Decomposition (SVD).

We begin by considering how to diagonalize the covariance matrix. Suppose that ${\mathbf{X}}$ is a vector of measurements. Normally, the components of ${\mathbf{X}}$ are not independent. For example, the temperatures at adjacent locations are often close to equal value. The covariance matrix is

$\displaystyle \mathbf{C}_X = \langle ( \mathbf{X} - \mathbf{\bar X} ) (\mathbf{X} - \mathbf{\bar X})^\mathrm{T} \rangle$

For simplicity, we move the origin so that ${\mathbf{\bar X} = \mathbf{0}}$ . Then ${\mathbf{C}_X = \langle \mathbf{X} \mathbf{X}^\mathrm{T} \rangle}$ . We wish to find a linear combination of the variables, ${\mathbf{Y} = \mathbf {M X}}$ , such that the components of ${\mathbf{Y}}$ are independent or uncorrelated, that is ${\mathbf{C}_Y = \langle \mathbf{Y} \mathbf{Y}^\mathrm{T} \rangle}$ is diagonal. Now,

$\displaystyle \mathbf{C}_Y = \langle \mathbf{Y} \mathbf{Y}^\mathrm{T} \rangle = \langle \mathbf{(MX)} \mathbf{(MX)}^\mathrm{T} \rangle = \langle \mathbf{M} (\mathbf{XX}^\mathrm{T})\mathbf{M}^\mathrm{T} \rangle$

But the matrix ${\mathbf{C}_X = \mathbf{XX}^\mathrm{T}}$ is symmetric and has eigen-structure

$\displaystyle \mathbf{C}_X \mathbf{E} = \mathbf{E}\boldsymbol{\Lambda} \quad\mbox{or}\quad \mathbf{C}_X = \mathbf{E}\boldsymbol{\Lambda}\mathbf{E}^\mathrm{T}$

with orthogonal eigenvectors ${\mathbf{E}^\mathrm{T}\mathbf{E} = \mathbf{I}}$ . Thus, if we choose ${\mathbf{M} = \mathbf{E}^\mathrm{T}}$ , or ${\mathbf{Y} = \mathbf{E}^\mathrm{T}\mathbf{X}}$ , then ${\mathbf{C}_Y = \boldsymbol{\Lambda}}$ , and the components of ${\mathbf{C}_Y}$ are uncorrelated. The columns of ${\mathbf{E}^\mathrm{T}}$ are called the principal components.

We assume the eigenvalues ${\lambda_k}$ are arranged in order of decreasing magnitude. If some of the eigenvalues are zero, we can drop them, decreasing the dimension of ${\mathbf{Y}}$ . Even if this is not the case, we can drop eigenvalues of small magnitude, keeping only the most important elements. Thus, PCA takes the covariance matrix and discards the most insignificant components. Frequently, the essential information is captured by the first two principal components.

Singular Value Decomposition (SVD)

The fundamental theorem of linear algebra concerns matrix mappings between vector spaces. It may be stated concretely in terms of the singular value decomposition of a matrix. Any ${m\times n}$ matrix ${\mathbf{X}\ :\ \mathbf{R}^{n} \longrightarrow \mathbf{R}^{m}}$ may be written as the product of three components:

$\displaystyle \mathbf{X} = \mathbf{U}{\boldsymbol{\Sigma}}\mathbf{V}^\mathrm{T} \,.$

Here, ${\mathbf{U}}$ and ${\mathbf{V}}$ are respectively ${m\times m}$ and ${n\times n}$ orthogonal matrices, and ${\boldsymbol{\Sigma}}$ is an ${m\times n}$ matrix with the same dimensions as ${\mathbf{X}}$ . The matrix ${\boldsymbol{\Sigma}}$ is diagonal, and its nonzero elements ${\sigma_k}$ are called the singular values of ${\mathbf{X}}$ . The number ${r}$ of nonzero singular values is the rank of ${\mathbf{X}}$ .

FourSubSpaces — The four fundamental subspaces involved in the action of the matrix ${\mathbf{A}}$ .

The four fundamental subspaces involved in the action of the matrix ${\mathbf{A}}$ .

Suppose now that our data matrix ${\mathbf{X}}$ has singular value decomposition ${\mathbf{X} = \mathbf{U}{\boldsymbol{\Sigma}}\mathbf{V}^\mathrm{T}}$ . Then the correlation matrix is

$\displaystyle \mathbf{X} \mathbf{X}^\mathrm{T} = (\mathbf{U}{\boldsymbol{\Sigma}}\mathbf{V}^\mathrm{T})(\mathbf{V}{\boldsymbol{\Sigma}}\mathbf{U}^\mathrm{T}) = \mathbf{U}{\boldsymbol{\Sigma}}(\mathbf{V}^\mathrm{T}\mathbf{V}){\boldsymbol{\Sigma}}\mathbf{U}^\mathrm{T} = \mathbf{U}{\boldsymbol{\Sigma}}^2\mathbf{U}^\mathrm{T}$

We see the correspondence with the above approach: ${\mathbf{E} = \mathbf{U} }$ and ${\boldsymbol{\Lambda} = \boldsymbol{\Sigma}^2}$ . There are efficient algorithms to calculate the SVD of ${\mathbf{X}}$ without having to form the covariance matrix ${\mathbf{XX}^\mathrm{T}}$ . Thus, SVD is the standard route to calculating the principal components analysis of a data matrix.

The SVD is reviewed by Gilbert Strang (1993), who describes how, associated with each matrix, there are four fundamental subspaces, two of ${\mathbf{R}^{n}}$ and two of ${\mathbf{R}^{m}}$ . The picture he uses to illustrate the action of ${\mathbf{A}}$ on those subspaces is reproduced above.

SVD has a wide range of applications in pure and applied mathematics and statistics. An application in compression of fingerprint images is described in an earlier post to this blog.

Sources

Gilbert Strang (1993): The Fundamental Theorem of Linear Algebra. Amer. Math. Mon, 100, 848–855.

Related

Published by thatsmaths