As you have seen in the corresponding topic, SVM is an algorithm used mainly in classification. The algorithm's goal is to find the optimal hyperplane that can separate data points of different classes. In this topic, we will go through a hands-on classification example with SVM using scikit-learn.
Available SVM implementations
scikit-learn comes with three SVM implementations for classification: SVC, NuSVC, and LinearSVC. In this topic, we will be using svm.SVC class to classify the data. Below is a brief comparison of the available implementations:
| SVM implementation | Description | Kernel | Notes |
SVC |
In the most general implementation that can apply the kernel trick, the main tunable parameter is C, which controls the trade-off between the decision boundary and the misclassification. |
Radial Basis Function (RBF) by default, accepts any custom kernel | Will be slow for large datasets |
NuSVC |
Similar to SVC, but requires the tuning of the nu parameter instead of C, nu controls the fraction of training errors and the fraction of support vectors |
Radial Basis Function (RBF) by default, accepts any custom kernel | Offers a more flexible approach w.r.t. the number of support vectors |
LinearSVC |
Similar to SVC, but a more efficient implementation for high-dimensional data with only a single (linear) kernel type. | Only linear kernel |
Can handle both sparse and dense datasets, with the underlying |
Data preparation
In this topic, we will use the breast cancer dataset provided by scikit-learn. We first load the dataset and split it into the training and testing datasets:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
X, y = load_breast_cancer(return_X_y = True)
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.33, random_state=27)
The dataset contains 30 features (stored in X), describing various characteristics of the cell nuclei, such as their radius, area, symmetry, etc. y, the target variable, contains only two classes represented by 0s and 1s. Label 0 indicates the tumor is malignant, and 1 indicates the tumor is benign.
SVM is affected by the feature scales, so we perform standardization to bring all features to the same scale:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)Fitting SVC
Initializing the default SVC object is straightforward:
from sklearn.svm import SVC
model = SVC()
Below is a review of the main parameters of SVC:
C(float, default=1.0) — the regularization parameter, controls the avoidance of misclassifying each training example. In SVM, the objective is to find a hyperplane with the largest minimum margin and a hyperplane that can correctly classify as many samples as possible, which creates a bias-variance trade-off. When the value ofCis large, the regularization strength is reduced, and a smaller margin hyperplane will be chosen if that hyperplane is better at classifying the training samples, however, bigger values might lead to overfitting. On another hand, a small value ofCwill result in a larger margin hyperplane, even if that hyperplane misclassifies more samples, which can result in underfitting. Setting a smaller value ofCis recommended if there are a lot of outliers present.kernel('linear'/'poly'/'rbf'/'sigmoid'/'precomputed'/callable, default='rbf') — The kernel function to be used. Passing acallableenables you to pass a custom kernel function.degree(int, only ifkernel='poly', default=3) — degree of the polynomial kernel function.gamma('scale'/'auto'/float, only forkernelin['rbf', 'poly', 'sigmoid'], default='scale')— the kernel coefficient. Higher values of gamma will emphasize points closer to the hyperplane, which results in a tighter decision boundary, leading to poorer generalization and potential overfitting (and smaller values have the opposite effect, by simplifying the decision boundary and improving generalization). The default value ('scale') is computed as1 / (n_features * X.var())shrinking(bool, default=True) — whether to use the shrinking heuristic, can lead to an increase in the training speed, but hurt the precision.
Then, we fit the model on the training set:
model.fit(X_train, y_train)
Once SVC is fitted, we can move to the performance evaluation.
Inspecting the model
Let's quickly check the mean accuracy of the default SVC on the holdout set:
model.score(X_test, y_test)
# 0.9840425531914894
We can also build a confusion matrix as:
from sklearn.metrics import confusion_matrix
import seaborn as sns
import numpy as np
y_pred = model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
group_names = ['True Negative', 'False Positive', 'False Negative', 'True Positive']
group_counts = cm.flatten()
labels = [f'{v1}\n{v2}' for v1, v2 in zip(group_names, group_counts)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(cm, annot=labels, fmt='')
Hyperparameter tuning with GridSearchCV
Judging by the test score, looks like we got lucky with the default hyperparameter set. Let's see if we get a different hyperparameter configuration by the grid search, which will split the training dataset into multiple sets and evaluate each combination via cross-validation.
from sklearn.model_selection import GridSearchCV
kernel = ['rbf', 'linear', 'rbf', 'poly']
gamma = ['scale', 'auto']
C = [0.1, 1.0, 10, 100, 1000]
param_grid = {"kernel": kernel, "gamma": gamma, "C": C}
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs = -1)
grid_search.fit(X_train, y_train)
grid_search.best_params_
The last line returns {'C': 0.1, 'gamma': 'scale', 'kernel': 'linear'}. The default fit had {'C': 1.0, 'gamma': 'scale', 'kernel': 'rbf'}.
grid_search.score(X_test, y_test)
will result in approximately 0.968, which is a drop from the default configuration. We can inspect the results further with grid_search.cv_results_:
param_C param_gamma param_kernel mean_test_score rank_test_score
1 0.1 scale linear 0.966029 1
5 0.1 auto linear 0.966029 1
...
8 1.0 scale rbf 0.963397 3
We can see that the mean_test_score (the mean score over the cross-validation folds) is only slightly higher for the chosen configuration, which is negotiable, however, the stars generally do not align so well. A lower score on the test occurred due to the different data split via cross-validation (so SVC was refit multiple times on different sets). If you look further into the grid_search.cv_results_, you can see the scores ranging across different cross-validation splits. In any case, using the test set for tuning the values is likely to lead to overfitting on the test data, thus, a technique such as grid search should be applied instead.
Conclusion
In summary, in this topic, we learned about:
-
how to use the
sklearn.svm.SVCclass; -
how to perform
SVC's hyperparameter tuning.