Computer scienceData scienceInstrumentsScikit-learnTraining ML models with scikit-learnClassification in scikit-learn

KNeighborsClassifier in scikit-learn

7 minutes read

KNN models every data object as a point in a feature space, where every coordinate corresponds to the value of a specific feature. The underlying idea is that close points are very similar, and therefore we can classify a point by taking the most common class label among the K-closest points. Here, K is a hyperparameter.

We are going to learn how to use the KNN Classifier from the scikit-learn library.

Loading data

We'll use the Iris dataset, which contains 150 samples of iris flowers, of 3 different subspecies. The subspecies are Iris Setosa, Iris Versicolour, and Iris Virginica. Every sample has 4 features: sepal length, sepal width, petal length, and petal width (all measurements are in centimeters). Our task is to classify each flower into one of the 3 subspecies.

First, let's load the data and split it into train and test sets:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=98)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

# >>> ((112, 4), (112,), (38, 4), (38,))

We see that there are 112 points for the training and 38 for the testing. We then fix the value for random_state so that the seed is fixed, and we're able to reproduce the experiments.

Fitting a model

To train the model, we'll use the KNeighborsClassifier class from sklearn.

Let's create a K-Nearest-Neighbours model.

from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier()
clf.fit(X_train, y_train)

# >>> KNeighborsClassifier()

After the model is fitted, here are some of the attributes that could be accessed:

classes_: The labels of the classes in the classification task. In our case, those are '0, 1, 2'.
effective_metric_: The metric used to measure the closest points after the .fit(). It could be euclidean or manhattan, among others.

You can read about all the other attributes of the model in the scikit-learn documentation.

Making predictions

Let's pick a point from the dataset, and compare the predicted class with the original class:

sample_point = X_train[0].reshape(1, 4)
sample_y = y_train[0]
print(clf.predict(sample_point), sample_y)

# >>> [2] 2

The predicted label and the actual label are the same. That doesn't tell us anything about how good our model is, as this is just the prediction for a single point, and the model could predict incorrectly all the other points.

To evaluate it, let's use the .score() method, which computes the accuracy by default:

train_accuracy = clf.score(X_train, y_train)
test_accuracy = clf.score(X_test, y_test)

print(f"Accuracy on the training set: {round(train_accuracy, 4)}")
print(f"Accuracy on the testing set: {round(test_accuracy, 4)}")

# >>> Accuracy on the training set: 0.9821
# >>> Accuracy on the testing set: 0.9211

We see 98% accuracy in the training and 92% in the testing, which is quite good. We could use other classification metrics, but in this dataset, there is no such need.

Visualizing the decision regions

We can visualize the surface areas on the decisions made by the model. You might need to install the mlxtend package before running the following code.

from mlxtend.plotting import plot_decision_regions

import matplotlib.pyplot as plt

# Loading some example data
X_train_plot = X_train[:, [0, 2]]


# Training a classifier
model = KNeighborsClassifier(n_neighbors = 8)
model.fit(X_train_plot, y_train)

# Plotting decision regions
plot_decision_regions(X_train_plot, y_train, clf=model, legend=2)

# Adding axes annotations
plt.xlabel('sepal length [cm]')
plt.ylabel('petal length [cm]')
plt.title('KNN on Iris Dataset')

plt.show()

The 3 separated cluster regions after running the KNN

Notice that we need to select and fix two specific features to be able to visualize the points on the 2D plane.

Playing with hyperparameters

We created a KNeighborsClassifier object with the default hyperparameters. However, the parameters can be fine-tuned, which could lead to performance improvement.

Here there are the main hyperparameters to tune:

n_neighbors (int, default: 5): The number of neighbors to take into account.

If we pick too few neighbors, the model will underfit, and it won't perform well on the training dataset. Picking too many neighbors will lead to overfitting, and the model won't perform well on the test.

Let's find the best number of neighbors by fitting models with different values of n_neighbors:

for k in [2, 3, 5, 8, 13]:
   clf = KNeighborsClassifier(n_neighbors=k)
   clf.fit(X_train, y_train)

   train_score = round(clf.score(X_train, y_train), 4)
   test_score = round(clf.score(X_test, y_test), 4)

   print(f"k = {k} -- Training score: {train_score} -- Testing score: {test_score}")


# >>> k = 2 -- Training score: 0.9821 -- Testing score: 0.9211
# >>> k = 3 -- Training score: 0.9732 -- Testing score: 0.9211
# >>> k = 5 -- Training score: 0.9821 -- Testing score: 0.9211
# >>> k = 8 -- Training score: 0.9821 -- Testing score: 0.9474
# >> k = 13 -- Training score: 0.9821 -- Testing score: 0.9211

We can see that it is best to consider 8 closest neighbors.

Depending on the case, a particular metric might show better performance.

metric (str/callable, default: minkowski): The metric used to compute the distance between two points. callable refers to any callable Python object, i.e., a user-defined function or a class method, which might be convenient for further customization.

A minor note on the distance metrics

Euclidean distance is the most widely used metric in KNN. However, if the dataset contains outliers, it might be better to use the Manhattan distance instead of the Euclidean (Euclidean might overfit since it pays more attention to the noise, while Manhattan suppresses the outliers). Minkowski distance is the generalization of the Euclidean and the Manhattan distances.

Let's find the best metric by fitting several instances with different metrics:

for metric in ['euclidean', 'manhattan', 'cosine']:
   clf = KNeighborsClassifier(metric=metric)
   clf.fit(X_train, y_train)

   train_score = round(clf.score(X_train, y_train), 4)
   test_score = round(clf.score(X_test, y_test), 4)

   print(f"metric = {metric} -- Training score: {train_score} -- Testing score: {test_score}")


# >>> metric = euclidean -- Training score: 0.9821 -- Testing score: 0.9211
# >>> metric = manhattan -- Training score: 0.9821 -- Testing score: 0.9211
# >>> metric = cosine -- Training score: 0.9732 -- Testing score: 0.9474

We see that cosine performs better on the test.

There is another hyperparameter to tune:

weights ({'uniform', 'distance'}/ callable or None, default:uniform): assign weights to each neighbor when computing the distances.

uniform takes every neighbor with the same weight, distance assigns the inverse of the distance as weight, making closer neighbors of a query point have a greater influence than neighbors which are further away.

for weight in ['uniform', 'distance']:
   clf = KNeighborsClassifier(weights=weight)
   clf.fit(X_train, y_train)

   train_score = round(clf.score(X_train, y_train), 4)
   test_score = round(clf.score(X_test, y_test), 4)

   print(f"weights = {weight} -- Training score: {train_score} -- Testing score: {test_score}")



# >>> weights = uniform -- Training score: 0.9821 -- Testing score: 0.9211
# >>> weights = distance -- Training score: 1.0 -- Testing score: 0.9211

In this case, we observe no difference between the two options on the test.

Conclusions

In today's topic, we got acquainted with KNeighborsClassifier in scikit-learn. Here is a quick recap of what we have covered:

How to train the KNN Classifier.
The distance metric that selects the closest neighbors.
Limiting the number of neighbors to take into account.

How did you like the theory?

Report a typo