Computer scienceData scienceInstrumentsScikit-learnData preprocessing with scikit-learn

Principal component analysis in scikit-learn

3 minutes read

Principal component analysis, or PCA, is a dimension reduction algorithm based on matrix decomposition and projecting our data into eigenvectors subspace. It is broadly used in noise filtering, detecting patterns, image compression, and much more. Today we will use it to visualize the dataset with 13 features in one plot.

Data preprocessing

In this topic, we'll consider the Wine recognition dataset. It contains 13 numeric features of drinks (e.g. alcohol, color density, concentrations of organic compounds) from 3 different classes.

Let's load the dataset from sklearn.datasets and save input features to X and the class to y :

from sklearn.datasets import load_wine
#we can make this in two steps with parameter as_frame 
data = load_wine(as_frame = True)
X = data.data
y = data.target

#or in one step with parameter return_X_y:
X, y = load_wine(return_X_y = True)

After downloading the dataset, we need to standardize features with StandardScaler(). We do this to make sure that different features have the same impact on the result:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)

Parameters of PCA

Before running the PCA algorithm, we need to determine its hyperparameters:

copy (True by default) – if False, PCA is done directly on the input data, which saves space, but you need to use .fit_transform() (not a combination of .fit() and .transform()).
whiten (False by default) – if True, PCA will remove the linear correlation across features.
svd_solver ('auto' by default) – 'auto', 'full', 'arpack', or 'randomized'. By default, this parameter is set to 'auto' – for big matrices (more than 500x500) it'll run 'randomized', for smaller ones – 'full' and return you only the n_components you set.
n_components – int, float, or 'mle'. By default, the parameter is set to the minimum between the number of features and samples –- all components are kept. If you want to keep less, you can pass the number of PC's as an integer.

svd_solver and n_components parameters are strongly related. There are some scenarios for their combinations:

svd_solver = 'full', n_components is unset – you'll run the algorithm with maximum accuracy and PCA will return all principal components.
svd_solver = 'full', n_components is an integer – you'll run the same algorithm and calculate all the PC's, but PCA will return the number of components you set.
svd_solver = 'arpack', n_components is an integer – you'll calculate only PC's you set.
svd_solver = 'full', n_components is a float between 0 and 1 – PCA will perceive this number as variance explained by principal components and return a minimum set of PC's.
svd_solver = 'full' or 'auto' and n_components = 'mle' (stands for maximum likelihood estimation) – you won't get exact values, but sklearn will automatically choose the number of components. This method is used for noise filtering in data preprocessing.
svd_solver = 'randomized' and n_components is an integer – you'll use the complex probabilistic algorithm, which may come in handy for parallel computing.

Transforming features

Let's run the PCA algorithm. Firstly, we need to import PCA from the sklearn.decomposition module:

from sklearn.decomposition import PCA

If we'll set the svd_algorithm parameter equal to 'full' (or choose n_components equal to the number of features), we'll be able to draw a scree plot and see the contribution of each component to explained variance:

pca = PCA(svd_solver = 'full')

When it's all set, let's transform scaled features with .fit_transform() method.

X_pca = pca.fit_transform(X_scaled)

Let's look at what part of the initial variance is explained with each component with .explained_variance_ratio_ attribute of PCA. You'll get an array with descending values:

print(pca.explained_variance_ratio_)

#[0.36198848 0.1920749  0.11123631 0.0706903  0.06563294 0.04935823
#0.04238679 0.02680749 0.02222153 0.01930019 0.01736836 0.01298233
#0.00795215]

This is an inconvenient way for data representation, so let's put them on a plot.

Explained variance plots

Firstly, let's create the $Ox$ axis of the plot – an array of integers from 1 to the number of the last PC. .n_components attribute returns a number of components.

import numpy as np
PC_values = np.arange(1, pca.n_components_ + 1 )

We're ready to draw a plot. We pass to plt.plot() function x and y values and linestyle (fmtparameter) – 'o' stands for small dots as markers and '-' for a solid line between data points.

import matplotlib.pyplot as plt

plt.plot(PC_values, pca.explained_variance_ratio_, 'o-')

plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Variance Explained')

plt.show()

After running the code above, you'll get this plot:

Variance explained by each principal component

But for choosing the number of PC's you often need to look at cumulative variance, so let's get another plot. Values for the $Ox$ axis are ready, so let's prepare $Oy$ values with np.cumsum()function, which returns the cumulative sum of array elements.

total_explained = np.cumsum(pca.explained_variance_ratio_)

Let's get a plot as we've done:

plt.plot(PC_values, total_explained, 'o-')
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Cumulative variance')
plt.show()

After running the code above, you'll get the plot below. Notice, that all PC's will explain all (1.0 or 100%) of the variance.

Plot of cumulative variance explained by principal components

Visualization

Our goal is data visualization. PCA is broadly used for this task because we can perceive only two or three dimensions, but datasets mostly have much more features (or dimensions).

To make further work more convenient, let's put data into another dataset with three columns: the value of the first PC, the value of the second PC, and the class of wine.

import pandas as pd
df = pd.DataFrame({"PC1": X_pca[:,0],
                   "PC2": X_pca[:,1],
                   "wine": data['target']})

Let's draw a scatterplot with the seaborn module. We pass names of columns with $x$ and $y$ values and a data frame.

import seaborn as sns

sns.scatterplot(x='PC1', y='PC2', hue='wine', data=df)

With this code, you'll get the graph below. You can see that data points form clusters that highly correlate with their classes. Because of this, PCA is often used before clusterization.

Scatter plot of data in principal component subspace. Dots form clusters on a plot.

But how much of the initial information of variance is saved with only two principal components? Let's find out:

total_variance = np.cumsum(pca.explained_variance_ratio_) 
print('Variance explained by 2 PCs:', total_variance[1])

#Variance explained by 2 PCs: 0.554063383569353

In other words, we can save 55.4% of the information about the 13-dimension dataset in only 2 dimensions.

Conclusion

Let's quickly summarize the topic by going over the main takeaways:

PCA is broadly used for visualization, feature selection, detecting patterns, image compression, and other fields.
To set the PCA algorithm, you need to determine the number of principal components and solving method.
After setting the PCA, apply the .fit_transform() method to preprocessed features.
Use the .explained_variance_ratio_ attribute to see the contribution of each component to the explained variance.

How did you like the theory?

Report a typo