Computer scienceData scienceMachine learningData preprocessingFeature engineering

Principal component analysis

12 minutes read

Sometimes you have to deal with information deficiency problems, but in some fields, like image storing and processing or bioinformatics, there are just too many features to consider every single property. This is where dimension reduction techniques come in handy. One of them is principal component analysis, or PCA. PCA is widely used in image compression, noise filtering, data visualization, and much more.

Imagine you want to analyze the employment rate of junior data scientists and understand why some employees are more successful than others. Maybe you can divide them into some groups? You have lots of features: a number of GitHub contributions, completed tasks, projects, time spent on the educational platform, gender, age, and others. You can guess that some of them have a strong correlation — like the number of completed tasks and the time they spent on them. PCA could help you to reduce the feature set to a smaller one.

The PCA algorithm

The PCA algorithm is used to reduce the number of properties describing data. We transform the dataset by replacing features with their linear combinations. Imagine that you have a 3D model of an object and you want to show it to your friend, but you're allowed to use only one photo. You take a screenshot with the best angle that captures as much as possible. Let's have a look at the photo: it's a plane, so we can imagine the $xOy$ grid on it and trace how every point of our 3D model is projected, as well as its new coordinates. PCA works in the same way — we choose principal components, which are direction vectors for this new plane.

To complete the PCA algorithm, we need to take the following steps:

Standardization
Identifying principal components (or PCs)
Projecting our data into the PC subspace
Restoring data (if needed)

When you do standardization you assume that all features have the same importance and will make an equivalent contribution to the final result.

But how do you find the principal components and how many of them do you need? As we've already mentioned, our goal is to choose the best point of view to keep as much information as we can. In terms of math, we need to maximize variance. So, first of all, let's calculate our data's variability and see if some of the features are strongly correlated. We can do it with the help of a covariance matrix.

Covariance matrix

Covariance is the measure of how strongly two variables correlate. You can read more about covariance and covariance matrix. There are some important features of the covariance matrix (it is denoted as $\Sigma$ ):

It is an $n \times n$ square matrix for an $n$ -dimensional space (a dataset with $n$ features for each object).
$\Sigma_{ij}$ element is equivalent to $\Sigma_{ji}$ due to the symmetry of the covariance formula.

Diagonal elements of the covariance matrix show us how much each feature spreads out from its mean, while others are responsible for measuring the relationship between variables.

Due to the spectral theorem this kind of matrix can be rewritten as a diagonal matrix in the eigenvector basis with real eigenvalues on the diagonal.

$\begin{bmatrix} \lambda_{1} & & \\ & \ddots & \\ & & \lambda_{n} \end{bmatrix}$

We assume that these eigenvalues $\lambda_i$ are descending.

Finding the principal component

There are two ways to mathematically describe approaches to finding principal components. As you already know, we project our data on PCs. Let's consider the two-dimensional case.

Geometrically, we need to choose a line, which will:

have the biggest variance of the projected data (red segment);
have minimal distance between points and their projections (blue segment)

A 2D plot with x1 on the X axis and x2 on the Y axis with a scatter plot and 2 lines, corresponding to the compnents

Here, $x_1$ and $x_2$ are two features of the dataset. A' is a projection of A on the PC.

We can reformulate the first one as "let's find the line which will contain as much information as possible". So, our goal is to maximize the variance of projections. After standardization, we calculate the covariance matrix. We need to find the unit (which means that $||w|| = w^\top w = 1$ ) vector $w$ — this is our PC. The variance of the projection on this vector is calculated as follows:

$\mathrm{Var}(Xw) = \frac {w^\top X^\top X w} {n-1} =w^\top \Sigma w,$

where $\Sigma$ is a covariance matrix, as defined above.

You can read the proof here.

Details

$w^\top Sw$ is equivalent for every basis, so let's do a magic trick here. We've already mentioned that the covariance matrix is diagonal in the eigenvector basis. And the expression above can be calculated as $\sum \lambda_i w_i^2$ (the sum of eigenvalues with some weights).

Since $w$ is a unit vector, to maximize this expression we need to choose weights as (1, 0, 0, 0…), where 1 correlates with the biggest eigenvalue. Given that we are in the eigenvector basis now, it means that $w$ is eigenvector $e_1$ .

Since eigenvectors are linear combinations of given features you won't spot nonlinear correlations.

This algorithm looks at the whole picture of data, which could be merit and demerit depending on our goal — you'll find global properties, but lose local variance between neighbors. Also, outliers will significantly affect the result.

Choosing the number of PCs

Let's consider the 3D model example again. Imagine you've created your dream house in a special application and now you want to describe it. With the first line (or PC) you could tell the size of your house or the width of the windows, but you lose information about ceiling height. Then you add another line, measuring how tall all parts are. And you still can't explain everything about the house because without the third dimension (depth in this example) you can't say how big your yard is.

To measure the amount of information you have from the original object, you can calculate the variance explained with each new PC. Total variability is equal to the sum of eigenvalues. If we take only one PC, it would explain $\frac {\lambda_1}{ \sum\limits_{i=1}^{n} \lambda_i }$ of the total variability. If you take two PCs, you calculate explained variability for each one and sum them up. For most cases (for example for image compression) you can just select the percentage that PCs should contribute (usually 80%-90%) and take the required number of components from the beginning.

You can put the percentages of variance explained on the graph — it is called scree plot. Some researchers prefer to choose the number of PCs by looking at it and seeing where it becomes almost horizontal, which is equivalent to the small variance explained. It's like defining the point of the "elbow" on a graph.

Look at the examples below. These are two ways to draw scree plots. The first one represents each new component, and the second one — the sum of components — is a bit harder to implement, but it shows when you hit "90% of variance explained".

Scree plots with Variance explained on the Y axis and the Principal Component count on the X axis

The main difficulty in the whole PCA algorithm is defining the correct number of PCs. There is no clear recipe for how to do it: you need to analyze your data and take into account that this process is subjective.

Example: the Breast Cancer dataset

Let's apply what you've learned and try to visualize some real-world data! We will use the Breast Cancer dataset. It has 32 properties of cell nuclei from 569 people, and provides information about the diagnosis (whether the neoplasm was malignant or benign). For now, let's consider 4 features — mean radius, area, smoothness, and symmetry.

First of all, we standardize our data. Intuitively, you can say that radius and area correlate with each other, and so do smoothness and symmetry. To test our assumption, let's look at the covariance matrix:

	radius	smoothness	area	symmetry
radius	1.001761	0.170882	0.989095	0.148001
smoothness	0.170882	1.001761	0.177340	0.558757
area	0.989095	0.177340	1.001761	0.151559
symmetry	0.148001	0.558757	0.151559	1.001761

Indeed, there are two pairs of highly correlated features — radius/area and smoothness/symmetry. You can see that after standardization all properties have the same variance.

The next step is diagonalization with computing eigenvectors and eigenvalues. After transformation our matrix will look like this:

$\begin{bmatrix} 2.164 & 0 & 0 & 0 \\ 0 & 1.387 & 0 & 0\\ 0 & 0 & 0.443 & 0\\ 0 & 0 & 0 & 0.013 \end{bmatrix}$

Since our task is to visualize the data, the number of components is fixed, but let's calculate the percentage of explained variance for each component.

The first PC explains the $\frac {\lambda_1}{ \sum\limits_{i=1}^{n} \lambda_i } =\frac {2.164}{ 2.164 + 1.378 +0.443+0.013 } = \frac {2.164}{ 3.998 } = 0.54$ of the total variance.

The first and the second PCs explain

$\frac {\lambda_1 +\lambda_2}{ \sum\limits_{i=1}^{n} \lambda_i } =\frac {2.164 + 1.378}{ 2.164 + 1.378 +0.443+0.013 } = \frac {3.542}{ 3.998 } = 0.89$ of the variance.

We can sum up our calculations in a table, like this:

Number of PC's	1	2	3	4
Explained variance, %	54	89	99.7	100

As you can see, the first two components are enough to save almost 90% of the variance.

Let's plot the results and see how different cell types are spread in the plot. Blue dots correspond to malignant neoplasm and red are benign cells. From this simple example, we can get a very important insight: cancerous cells have big variability in their appearance. And it is true since cancerous cells lose the mechanisms that control their features and division.

Scatter plot of the chosen features with PC1 on the X axis and PC2 on the Y axis

If we consider all features

If we hadn't subtracted 4 features for analysis and reduced all 32 features to just 2 principal components instead, we'd have gotten another picture. As you can see, here we can divide the two groups more clearly. You can also see that most benign cells form compact clusters while malignant tend to vary. For such complex objects as neoplasms, we should take into account lots of data in order to make correct conclusions.

Scatter plot of the full dataset with PC1 on the X axis and PC2 on the Y axis

Conclusion

PCA is a broadly used unsupervised learning algorithm. It finds applications in feature selection, detecting patterns, object compression, visualization, and other fields. The main idea is to lower the number of features and try to save as much information as possible. To achieve this goal, we need to project our data into a subspace of eigenvectors and thereby maximize the variance of the data.

4 learners liked this piece of theory. 0 didn't like it. What about you?

Report a typo