Principal component analysis, or PCA, is a dimension reduction algorithm based on matrix decomposition and projecting our data into eigenvectors subspace. It is broadly used in noise filtering, detecting patterns, image compression, and much more. Today we will use it to visualize the dataset with 13 features in one plot.
Data preprocessing
In this topic, we'll consider the Wine recognition dataset. It contains 13 numeric features of drinks (e.g. alcohol, color density, concentrations of organic compounds) from 3 different classes.
Let's load the dataset from sklearn.datasets and save input features to X and the class to y :
from sklearn.datasets import load_wine
#we can make this in two steps with parameter as_frame
data = load_wine(as_frame = True)
X = data.data
y = data.target
#or in one step with parameter return_X_y:
X, y = load_wine(return_X_y = True)
After downloading the dataset, we need to standardize features with StandardScaler(). We do this to make sure that different features have the same impact on the result:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)
Parameters of PCA
Before running the PCA algorithm, we need to determine its hyperparameters:
copy(Trueby default) – ifFalse, PCA is done directly on the input data, which saves space, but you need to use.fit_transform()(not a combination of.fit()and.transform()).whiten(Falseby default) – ifTrue, PCA will remove the linear correlation across features.svd_solver('auto'by default) –'auto','full','arpack', or'randomized'. By default, this parameter is set to'auto'– for big matrices (more than 500x500) it'll run'randomized', for smaller ones –'full'and return you only then_componentsyou set.n_components– int, float, or'mle'. By default, the parameter is set to the minimum between the number of features and samples –- all components are kept. If you want to keep less, you can pass the number of PC's as an integer.
svd_solver and n_components parameters are strongly related. There are some scenarios for their combinations:
svd_solver = 'full',n_componentsis unset – you'll run the algorithm with maximum accuracy and PCA will return all principal components.svd_solver = 'full',n_componentsis an integer – you'll run the same algorithm and calculate all the PC's, but PCA will return the number of components you set.svd_solver = 'arpack',n_componentsis an integer – you'll calculate only PC's you set.svd_solver = 'full',n_componentsis a float between 0 and 1 – PCA will perceive this number as variance explained by principal components and return a minimum set of PC's.svd_solver = 'full'or'auto'andn_components = 'mle'(stands for maximum likelihood estimation) – you won't get exact values, but sklearn will automatically choose the number of components. This method is used for noise filtering in data preprocessing.svd_solver = 'randomized'andn_componentsis an integer – you'll use the complex probabilistic algorithm, which may come in handy for parallel computing.
Transforming features
Let's run the PCA algorithm. Firstly, we need to import PCA from the sklearn.decomposition module:
from sklearn.decomposition import PCA
If we'll set the svd_algorithm parameter equal to 'full' (or choose n_components equal to the number of features), we'll be able to draw a scree plot and see the contribution of each component to explained variance:
pca = PCA(svd_solver = 'full')
When it's all set, let's transform scaled features with .fit_transform() method.
X_pca = pca.fit_transform(X_scaled)
Let's look at what part of the initial variance is explained with each component with .explained_variance_ratio_ attribute of PCA. You'll get an array with descending values:
print(pca.explained_variance_ratio_)
#[0.36198848 0.1920749 0.11123631 0.0706903 0.06563294 0.04935823
#0.04238679 0.02680749 0.02222153 0.01930019 0.01736836 0.01298233
#0.00795215]
This is an inconvenient way for data representation, so let's put them on a plot.
Explained variance plots
Firstly, let's create the axis of the plot – an array of integers from 1 to the number of the last PC. .n_components attribute returns a number of components.
import numpy as np
PC_values = np.arange(1, pca.n_components_ + 1 )
We're ready to draw a plot. We pass to plt.plot() function x and y values and linestyle (fmtparameter) – 'o' stands for small dots as markers and '-' for a solid line between data points.
import matplotlib.pyplot as plt
plt.plot(PC_values, pca.explained_variance_ratio_, 'o-')
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Variance Explained')
plt.show()
After running the code above, you'll get this plot:
But for choosing the number of PC's you often need to look at cumulative variance, so let's get another plot. Values for the axis are ready, so let's prepare values with np.cumsum()function, which returns the cumulative sum of array elements.
total_explained = np.cumsum(pca.explained_variance_ratio_)
Let's get a plot as we've done:
plt.plot(PC_values, total_explained, 'o-')
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Cumulative variance')
plt.show()
After running the code above, you'll get the plot below. Notice, that all PC's will explain all (1.0 or 100%) of the variance.
Visualization
Our goal is data visualization. PCA is broadly used for this task because we can perceive only two or three dimensions, but datasets mostly have much more features (or dimensions).
To make further work more convenient, let's put data into another dataset with three columns: the value of the first PC, the value of the second PC, and the class of wine.
import pandas as pd
df = pd.DataFrame({"PC1": X_pca[:,0],
"PC2": X_pca[:,1],
"wine": data['target']})
Let's draw a scatterplot with the seaborn module. We pass names of columns with and values and a data frame.
import seaborn as sns
sns.scatterplot(x='PC1', y='PC2', hue='wine', data=df)
With this code, you'll get the graph below. You can see that data points form clusters that highly correlate with their classes. Because of this, PCA is often used before clusterization.
But how much of the initial information of variance is saved with only two principal components? Let's find out:
total_variance = np.cumsum(pca.explained_variance_ratio_)
print('Variance explained by 2 PCs:', total_variance[1])
#Variance explained by 2 PCs: 0.554063383569353
In other words, we can save 55.4% of the information about the 13-dimension dataset in only 2 dimensions.
Conclusion
Let's quickly summarize the topic by going over the main takeaways:
- PCA is broadly used for visualization, feature selection, detecting patterns, image compression, and other fields.
- To set the PCA algorithm, you need to determine the number of principal components and solving method.
- After setting the PCA, apply the
.fit_transform()method to preprocessed features. - Use the
.explained_variance_ratio_attribute to see the contribution of each component to the explained variance.