The Naive Bayes theorem states that there is a “naive” assumption of conditional independence between every pair of a feature and the label. Although in real-world scenarios, not all features are independent, Naive Bayes is still a sufficient method to use for classification. In this topic, we will learn how to use the Naive Bayes classification algorithm in the scikit-learn library.
Loading data
To create the model, we will first prepare the dataset. We will use the Wine dataset. It has three classes of wine. Each sample has 13 different features: alcohol, malic_acid, ash, color_intensity, and so on. Our task is to classify the samples into three classes based on these features.
First, let's load the data and split it into training and testing sets.
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
X, y = load_wine(return_X_y = True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2)
X_train.shape, y_train.shape, X_test.shape, y_test.shape
# ((133, 13), (133,), (45, 13), (45,))
Our training dataset has 133 samples, while we will use 45 samples for testing.
Fitting a model
To train the model, we will first experiment with the Gaussian Naive Bayes algorithm for classification. We will create a GaussianNB class from the scikit-learn library. Then we will use the .fit() method to train the model on our training data.
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X_train, y_train)
After the model has been fitted, let's take a look at some attributes that can be accessed:
.class_prior_— the probability of each class the model has seen..n_features_in_— the number of features seen during training..var_— the variance of each feature per class..theta_— the mean of each feature per class.
Making predictions
Let's first take one sample from the dataset and compare the predicted class with the original class. To predict the label, we first expand the dimension of our sample to make it a 2D array. Then we use the .predict() method on this sample.
import numpy as np
sample_point = np.expand_dims(X_train[0], 0)
sample_y = y_train[0]
print(model.predict(sample_point), sample_y)
# [1] 1
As we can see, our model has correctly classified the first sample. However, it has already seen this label during training, so we need a better way to evaluate performance. Let's calculate the accuracy score using the .score() method.
test_accuracy = model.score(X_test, y_test)
test_accuracy
# 0.9777777777777777
Our model has a very high accuracy score on the test dataset. However, as we have three classes, we should understand on which classes the model has lower performance. We can get the overall classification report by first getting the predictions on the test dataset and then creating the classification report.
from sklearn.metrics import classification_report
predicted_test = model.predict(X_test)
print(classification_report(y_test, predicted_test))
# precision recall f1-score support
# 0 1.00 0.95 0.97 19
# 1 0.93 1.00 0.96 13
# 2 1.00 1.00 1.00 13
# accuracy 0.98 45
# macro avg 0.98 0.98 0.98 45
# weighted avg 0.98 0.98 0.98 45Playing with hyperparameters
All the hyperparameters can be found in each model in the scikit-learn library. We have created the Gaussian Naive Bayes (GaussianNB) model with the default parameters. While Gaussian Naive Bayes (GNB) is relatively simple and doesn't have many hyperparameters to tune compared to some other algorithms, there are still a few aspects you can consider.
- Prior Probabilities (
priors)
In some cases, you might have prior knowledge about the distribution of classes in your dataset. The priors parameter allows you to specify the prior probabilities for each class.
In our dataset, we have an imbalanced distribution of labels. We can count this with the numpy.unique() function.
import numpy as np
unique, counts = np.unique(y, return_counts=True)
classes_distribution = dict(zip(unique, counts))
# {0: 59, 1: 71, 2: 48}
As we can see, class 2 appears very rarely in the dataset. We can pass the class distribution to our model as the priors parameter and assign more probability to this class. The value for the priors parameter should be the same length as the number of classes in the dataset.
model = GaussianNB(priors=[0.1, 0.3, 0.6])
- Handling Singular Covariance Matrices (
var_smoothing)
Gaussian Naive Bayes assumes that the features within each class follow a Gaussian distribution with a specific mean and variance. However, in situations where a feature has very low variance (or is constant) within a class, the variance might become extremely small or even zero, which can lead to numerical instability during calculations. The var_smoothing parameter adds a small value to the variance to prevent this.
model = GaussianNB(var_smoothing=1e-3)Types of Naive Bayes Classifiers
One of the most common types of Naive Bayes classifiers is MultinomialNB. It is used in the same way as GaussianNB. This type of model is designed for count-based data. You can use it with TF-IDF vectors or word counts of text for text classification tasks such as sentiment analysis.
In the scikit-learn library, there are a variety of different models for Naive Bayes classifiers. Let's understand the main differences between them.
| Naive Bayes Classifier Name | Description | Assumption | Suitable For |
|---|---|---|---|
| Gaussian Naive Bayes | Assumes Gaussian (normal) distribution of features within each class. | Features follow a continuous Gaussian distribution. |
Numeric features that are approximately normally distributed. |
| Multinomial Naive Bayes | Designed for discrete count-based data (e.g., text data represented by word frequency). | Features are counts of occurrences in different classes. | Text classification, document categorization, sentiment analysis. |
| Complement Naive Bayes | Assumes binary or boolean features, indicating the presence or absence of a feature. | Features are binary (0/1) indicators. | Text classification, binary data classification. |
| Categorical Naive Bayes | Handles categorical features that can take on discrete values without assuming any specific distribution. |
Features are categorical variables. |
Text classification, recommendation systems with categorical data. |
| Bernoulli Naive Bayes | Assumes binary or boolean features, indicating the presence or absence of a feature. |
Features are binary (0/1) indicators. |
Text classification, binary data classification. |
You can see that the main differences are data-dependent, so you should first understand what type of data you have and then experiment with different models.
Conclusion
Naive Bayes is a powerful and surprisingly effective classification algorithm, especially for tasks involving text or discrete data. By implementing Naive Bayes using scikit-learn, you can quickly build and train models for various classification tasks. Remember that while it's "naive" in its assumptions, it often delivers impressive results in practice.