Section 14.3 Principal Component Analysis
Principal Component Analysis (PCA) is an exploratory method used across many scientific disciplines to analyze high-dimensional data. Its goal is to reduce the dimensionality of a dataset while preserving as much information as possible. PCA relies on several concepts from linear algebra studied previously in this textbook, including Matrices, Eigenvectors and Eigenvalues and Diagonalization.
The dimension \(p\) of a dataset corresponds to the number of variables measured in a study. Each of these variables is called a feature. If the experiment is repeated \(n\) times, each resulting observation is referred to as an instance. When \(p\) is large, the resulting data cloud becomes difficult or impossible to visualize directly. The goal of PCA is to summarize the information contained in the data in a form that is easier to understand and visualize. For example, if the number of variables can be reduced to two or three, the data can be represented and visualized in a low-dimensional space.
Dimensionality reduction is achieved by transforming a large set of correlated variables into a smaller set of uncorrelated ones called principal components. This transformation is performed in a way that preserves as much of the variation in the original dataset as possible.
Essentially, PCA consists of finding a new basis for \(\mathbb R^p\) such that, when the data are expressed in this basis, their projection onto the first few components captures the maximum possible variance. The variance of the data quantifies the dispersion of data points around their mean. This new basis is given by the eigenvectors of the covariance matrix of the data. Each vector in the new basis defines a principal component. The principal components are ordered such that just the first few capture most of the variation present in all the original variables.
The following section presents the steps of the PCA algorithm used to construct this basis.
Subsection 14.3.1 Algorithm
The algorithm is described in two stages. First, we present the computation of the principal components. We then describe how the original data are projected onto a two-dimensional space for visualization.
Subsubsection 14.3.1.1 Calculation of Principal Components
-
Organize the data.We organize the data in a matrix \(A\) with \(n\) rows for the instances and \(p\) columns for to the features. Write it as a matrix:\begin{equation*} A= \begin{bmatrix} x_1^1 & x_1^2 & \dots &x_1^p \\ \vdots & \vdots\\ x_n^1 & x_n^2 & \dots & x_n^p \end{bmatrix} \end{equation*}
-
Find the mean of each feature.Let \(\bar x\) be the vector whose components are the means of each of the \(p\) features: \(\bar x = (\overline {x^1}, ... , \overline {x^p})\) where each \(\overline {x^j} = \frac{1}{n} \bigg ( x^j_1 + x^j_2 + \dots + x^j_n \bigg )\text{.}\)
-
Center the data.Create a \(n \times p\) matrix, \(X\text{,}\) with rows of mean centered data using the following formula:\begin{equation*} X= \begin{bmatrix} x_1^1 - \overline{x^1} & x_1^2 - \overline{x^2} & \dots &x_1^p- \overline{x^p} \\ \vdots & \vdots\\ x_n^1- \overline{x^1} & x_n^2- \overline{x^2} & \dots & x_n^p- \overline{x^p} \end{bmatrix} \end{equation*}
-
Find the covariance matrix of the centered data.The covariance matrix is a square matrix with variances in the diagonal and covariances in the non-diagonal components. The covariance matrix of centered data can be computed with the formula:\begin{equation*} C = \frac{1}{n-1} X^T X\text{.} \end{equation*}
-
Calculate the eigenvectors and the eigenvalues.Since the matrix \(C\) is symmetric, it admits a basis of orthonormal eigenvectors. Find eigenvectors of \(C\) and their associated eigenvalues. Then normalize the eigenvectors.
-
Order eigenvectors.Arrange the normalized eigenvectors in a list \(v_1, v_2, \dots, v_p\text{,}\) ordered by decreasing eigenvalues. That is, such that the corresponding eigenvalues satisfy \(\lambda_1 \geq \lambda_2 \geq \dots \geq \lambda_p\text{.}\)
The principal components are given by the eigenvectors that form our new basis, \(B= \{ v_1, v_2, \cdots, v_p \}
\text{.}\)
At this stage, we have as many principal components as original variables. However, the components are now ranked according to the amount of variance they explain. To reduce the dimensionality of the data, we can retain only the first few principal components, as they capture most of the variance present in the original dataset.
Subsubsection 14.3.1.2 Two-dimensional Visualization
To obtain a graphical representation of the data in a plot whose axes correspond to the first two principal components:
-
Express each instance \(x_i\) in the new basis:\begin{equation*} x_i = a_{i1}v_1 + a_{i2}v_2 + \cdots + a_{in} v_p \end{equation*}where \(v_1, v_2, \ldots, v_p\) are the principal components (eigenvectors), and \(a_{i1}, a_{i2}, \ldots, a_{ip}\) are the coordinates of \(x_i\) in the new basis \(B\text{.}\)
-
Project each instance in the basis \(B\) onto the first two coordinates:\begin{equation*} (a_{i1}, a_{i2}) \end{equation*}
To perform the first operation, we multiply each data vector by the change-of-basis matrix. To perform the second, we retain only the first two coordinates of the resulting vector. These two operations can be combined by multiplying the data by the matrix formed by the first two rows of the change-of-basis matrix.
Plotting the resulting vector in a coordinate system whose axes correspond to the first two principal components provides a two-dimensional representation of the original data that preserves as much variance as possible.
Subsection 14.3.2 City Colleges of Chicago Locations
One important application of PCA is as an exploratory tool for investigating a phenomenon within a population. Researchers begin by selecting a set of variables that they believe may be related to the phenomenon of interest. PCA produces a two-dimensional visualization that facilitates interpretation.
If the resulting plot reveals distinct clustersβfor example, one cluster containing individuals affected by the phenomenon and another containing the rest of the populationβthis suggests that the selected variables may be relevant for characterizing or explaining the phenomenon. Conversely, if the data points do not exhibit any meaningful separation, the chosen variables may provide limited insight into the phenomenon under study.
Another important aspect of PCA is its ability to summarize the information contained in a dataset through a smaller set of new variables that often capture the most significant patterns of variation in the data and may provide a more informative representation than the original variables.
Letβs investigate these two aspects using the following example. Our population consists of neighborhoods in Chicago. We know that some neighborhoods have a City Colleges of Chicago (CCC) campus, while others do not. We want to study this phenomenon and determine whether certain neighborhood characteristics (variables) influence the location of City Colleges of Chicago campuses.
We consider the following neighborhoods of Chicago as our instances: Dunning, Englewood, Loop, Uptown, Sauganash, Lincoln Park, Hyde Park, and Armour Square.
We know that Dunning, Englewood, Loop, and Uptown have a CCC campus whereas Sauganash, Lincoln Park, Hyde Park, and Armour Square do not.
We consider the following four features: number of CTA train stations, average household size (by number of people), percentage of foreign born residents, and percentage of population below poverty level.
| Neighborhood | CTA | Household | Foreign | Poverty |
|---|---|---|---|---|
| Dunning | 0 | 6.8 | 31.9 | 8.8 |
| Englewood | 4 | 4.3 | 4.4 | 31.6 |
| Loop | 8 | 1.6 | 17.1 | 11.2 |
| Uptown | 3 | 3.2 | 21.6 | 15.3 |
| Sauganash | 0 | 8.4 | 20.2 | 5.9 |
| Lincoln Park | 2 | 3.6 | 13.4 | 8.3 |
| Hyde Park | 0 | 3.5 | 21.6 | 21.7 |
| Armour Square | 2 | 2.3 | 35.5 | 32.3 |
For the first part of the algorithm, we will calculate the principal components.
-
We organize the data in a matrix \(A\text{:}\)
-
We find now the mean of each feature, \(\bar x\text{:}\)
-
Next, we create a matrix of mean centered data, \(X\text{:}\)
-
We compute the covariance matrix.
-
We calculate the eigenvectors and eigenvalues by using
eigenvectors_right. Recall that the output ofeigenvectors_rightis a list with elements of the form \((e,[v_1,...,v_k],m)\) where \(e\) is the eigenvalue , \([v_1,...,v_k]\) are the corresponding eigenvectors, and \(m\) is the algebraic multiplicity of the eigenvalue. -
We order the eigenvectors by decreasing eigenvalues:
Since the principal components are the directions of the eigenvectors, we need to retrieve the eigenvectors from the previous list.
These four vectors form the new basis of principal components, \(B = \{v_1, v_2, v_3, v_4\}\text{.}\)
For the second part of the algorithm, we express each point in the original data set in the new basis and then project on to the first two coordinates.
To achieve that, we will compute the change-of-basis matrix as described in previous sections. Recall that a vector \(u\) in the standard basis \(S\) can be written in a base \(B\) by multiplying by that change-of-basis matrix:
\begin{equation*}
[u]_B = P_{B \leftarrow S} u
\end{equation*}
This matrix is the inverse of \(P_{S \leftarrow B}\) which is just the matrix whose columns are the vectors in the base \(B\text{.}\)
Letβs start by finding the matrix \(V = P_{S \leftarrow B}\) by writing the eigenvectors as columns:
The change of basis matrix \(C = P_{B \leftarrow S}\) is just the inverse of this matrix:
To obtain the coordinates of a vector in the new basis, it suffices to multiply the vector by this matrix. Since we are only interested in the first two coordinates of the result (the first two principal components), we can first remove the remaining rows of the matrix and multiply the vector only by the first two rows. This directly yields the projection of the vector onto the subspace spanned by the first two principal components.
Letβs start by finding the submatrix formed by the first two rows:
Now we can multiply this matrix by each of the vectors in the original data:
We then obtained the coordinates of each data point in the new basis, corresponding to their projections onto the subspace spanned by the first two principal components. We are now ready to visualize these points in the resulting two-dimensional subspace.
Finally, we add labels and color (blue meaning with a CCC campus and red meaning without) to each data point to visualize what instance each point represents:
Since the neighborhoods with and without a CCC campus appear to be mixed together, the variables selected may not explain the campus locations very well. You can try using different variables by replacing one of the original variables with another and then repeating the analysis. If the resulting plot shows neighborhoods with and without a campus separating into two distinct clusters, you may have identified a set of variables whose first two principal components do a good job of explaining the locations of the CCC campuses.
In terms of data summarization, we can see that the first principal component can be expressed in terms of the original variables as:
In this case, most of the weight in the linear combination is assigned to the last variable: the percentage of people living below the poverty line.
If we project the data onto the x-axis (the first principal component), we observe that Englewood and Armour Square in the southern part of the city cluster together, while neighborhoods Dunning and Sauganash in the northern part also cluster together. These two groups are located far apart along the first principal component, indicating a strong separation according to this measure.
Similarly, the second principal component can be expressed in terms of the original variables as:
In this case, most of the weight in the linear combination is assigned to the third variable: the percentage of foreign-born residents.
If we project the data onto the y-axis (the second principal component), we observe that Englewood and Armour Square are now the farthest apart according to this measure. Englewood has the lowest proportion of foreign born residents, while Armour Square has the highest.
Aside
Subsection 14.3.3 Three Mice Strains
We will now consider an example in which the data are already two-dimensional, but PCA demonstrates how cluster structure can be preserved when reducing the data to one dimension. Although the data may be projected onto any axis, the first principal component axis captures the greatest amount of variation and therefore provides the best preservation of the original cluster separation.
The following dataset shows three distinct clusters corresponding to different mouse strains.
| Mouse ID. | Mouse Length (cm) | Mouse Weight (g) |
|---|---|---|
| 1 | 5.44 | 35.46 |
| 2 | 6.35 | 36.77 |
| 3 | 6.82 | 36.49 |
| 4 | 6.75 | 34.12 |
| 5 | 6.21 | 33.95 |
| 6 | 6.67 | 35.71 |
| 7 | 7.12 | 33.85 |
| 8 | 6.39 | 36.22 |
| 9 | 9.82 | 44.71 |
| 10 | 10.56 | 46.24 |
| 11 | 10.35 | 45.11 |
| 12 | 9.21 | 46.35 |
| 13 | 10.76 | 44.89 |
| 14 | 5.34 | 43.05 |
| 15 | 7.25 | 43.44 |
| 16 | 5.61 | 43.4 |
| 17 | 5.05 | 42.52 |
| 18 | 5.15 | 42.16 |
We organize the data in a matrix:
We plot the two-dimensional data using
scatter_plot and observe the three clusters corresponding to different mouse strains.
If we project the data onto the x-axis, the separation between the clusters is largely lost.
However, if we first apply PCA and then project the data onto the first principal component, the cluster structure is preserved, and the three groups remain clearly separated. Letβs apply first PCA to the dataset:
We observe that the projection onto the first principal component retains most of the variation of the data showing the three original clusters.
