Computer scienceData scienceMachine learningClassification

Naive Bayes classifier

10 minutes read

Sometimes you don't get lucky with the weather. Especially if you really want to visit a ski resort but managed to get a vacation only in mid-season. Fortunately, your friend has been there just before you and collected some weather observations. These data should help you plan snowboarding days from the forecast. In the table, column "Snowboarding" is a label, and "Precipitation", "Temperature" and "Wind" are features.

Day	Precipitation	Temperature	Wind	Snowboarding
1	No	5-10 °C	Calm	No
2	No	5-10 °C	Breeze	No
3	No	0-5 °C	Calm	Yes
4	Rain	0-5 °C	Calm	No
5	Snow	-10-0 °C	Calm	Yes
6	Snow	-10-0 °C	Breeze	No
7	No	-10-0 °C	Breeze	Yes
8	Rain	0-5 °C	Calm	No
9	No	-10-0 °C	Calm	Yes
10	Rain	0-5 °C	Calm	Yes

Bayes classifier

To make a decision whether to book an excursion or to try your luck in the mountains, you can use a maximum a posteriori probability (MAP) rule:

$\hat{y} = f(\mathbf{x}) = \mathrm{argmax}\, P(Y = y | \mathbf{X} = \mathbf{x})$

MAP tells us to predict the most probable label $y$ from a set $Y$ given the data point $\mathbf{x}$ from a data distribution $\mathbf{X}$ . In our case, $Y$ is a set of possible outcomes $\text{\{Yes, No\}}$ about whether we go snowboarding or not. An example of $\mathbf{x}$ would be a data entry $(\text{Precipitation = Rain, Temperature = 0-5 °C, Wind = Calm})$ .

Unfortunately, we do not have direct access to the conditional posterior probability $P(Y = y | \mathbf{X} = \mathbf{x})$ from the data. And this is where Bayes' rule helps us. In its general form, Bayes' rule shows how to calculate a conditional probability of an event B given an event A if we know a conditional probability of A given B and full probabilities of A and B:

$P(B|A) = \frac{P(A|B)P(B)}{P(A)}$

Here, we take $Y = y$ as B and $\mathbf{X} = \mathbf{x}$ as A to rewrite our MAP rule:

$\hat{y} = f(\mathbf{x}) = \mathrm{argmax}\, P(Y = y | \mathbf{X} = \mathbf{x}) = \mathrm{argmax}\, \frac{P(\mathbf{X} = \mathbf{x} | Y = y) P(Y = y)}{P(\mathbf{X} = \mathbf{x})}$

For a particular data entry $\mathbf{x}$ the denominator $P(\mathbf{X} = \mathbf{x})$ will be the same for all the values under $\mathrm{argmax}$ , so we can discard it and rewrite the rule once again and obtain Bayes classifier:

$\hat{y} = f(\mathbf{x}) = \mathrm{argmax}\, P(\mathbf{X} = \mathbf{x} | Y = y) P(Y = y) =\newline = \mathrm{argmax}\, P(X_1 = x_1, \dots, X_n = x_n | Y = y) P(Y = y)$

Let's simplify our data to a single column to see how it works in practice.

Day	Precipitation	Snowboarding
1	No	No
2	No	No
3	No	Yes
4	Rain	No
5	Snow	Yes
6	Snow	No
7	No	Yes
8	Rain	No
9	No	Yes
10	Rain	Yes

A forecast for tomorrow tells that there will be a clear sky, so we need to compare two expressions to determine whether we shall plan snowboarding for tomorrow: $P(\text{Precipitation = No} | \text{Snowboarding = Yes}) P(\text{Snowboarding = Yes})$ vs $P(\text{Precipitation = No} | \text{Snowboarding = No}) P(\text{Snowboarding = No})$ .

First, let's calculate the priors from the data. They give us information about how likely are different labels "by default".

$\begin{aligned} P(\text{Snowboarding = No}) &= \frac{5}{10} = 0.5 \\ P(\text{Snowboarding = Yes}) &= \frac{5}{10} = 0.5 \end{aligned}$

Next, the likelihoods:

$\begin{aligned} X &:= (\text{Precipitation = No} | \text{Snowboarding = No}) \\ Y &:= (\text{Precipitation = No} | \text{Snowboarding = Yes})\\ P(X) &= \frac{2}{5} = 0.4 \\ P(Y) &= \frac{3}{5} = 0.6 \end{aligned}$

If we combine them, we see that it's more likely to go snowboarding in clear weather than not:

$\begin{aligned} P(Y) P(\text{Snowboarding = Yes}) &> P(X) P(\text{Snowboarding = No}) \\ 0.6 \times 0.5 &> 0.4 \times 0.5 \\ 0.3 &> 0.2 \\ \end{aligned}$

Let's go back to our original data and try to make a prediction for a full forecast for tomorrow: $(\text{Precipitation = No, Temperature = 0-5 °C, Wind = Breeze})$ .

Day	Precipitation	Temperature	Wind	Snowboarding
1	No	5-10 °C	Calm	No
2	No	5-10 °C	Breeze	No
3	No	0-5 °C	Calm	Yes
4	Rain	0-5 °C	Calm	No
5	Snow	-10-0 °C	Calm	Yes
6	Snow	-10-0 °C	Breeze	No
7	No	-10-0 °C	Breeze	Yes
8	Rain	0-5 °C	Calm	No
9	No	-10-0 °C	Calm	Yes
10	Rain	0-5 °C	Calm	Yes

It immediately turns out that we cannot do it with our Bayes classifier, because we need to evaluate $P(\text{Precipitation = No, Temperature = 0-5 °C, Wind = Breeze} | \text{Snowboarding = Yes})$ vs $P(\text{Precipitation = No, Temperature = 0-5 °C, Wind = Breeze} | \text{Snowboarding = No})$ . For both of these expressions we simply do not have enough data! Since there is no entry $(\text{Precipitation = No, Temperature = 0-5 °C, Wind = Breeze})$ , it is impossible to calculate the joint probability of these events.

Naive Bayes classifier

We can solve our problem of insufficient data by naively assuming that all the features in the dataset are independent given the label $(X_1 \perp X_2 \perp ... \perp X_n | Y)$ . This assumption is not guaranteed to be true (for example, snow usually does not fall when it is 5-10 °C outside). However, it is an approximation we are willing to make to rewrite a classification MAP rule again simplifying the calculations:

$\begin{aligned} \hat{y} = f(\mathbf{x}) &= \mathrm{argmax}\, P(\mathbf{X} = \mathbf{x} | Y = y) P(Y = y) = \mathrm{argmax}\, P(X_1 = x_1, \dots, X_n = x_n | Y = y) P(Y = y) \\ &= \mathrm{argmax}\, P(X_1 = x_1 | Y = y) P(X_2 = x_2 | Y = y) \dots P(X_n = x_n | Y = y) P(Y = y) \\ &= \mathrm{argmax}\, \prod_{i=0}^{n} P(X_i = x_i | Y = y) P(Y = y) \end{aligned}$

Now, we can obtain $P(X_i = x_i | Y = y)$ from the data:

$\begin{aligned} P(\text{Precipitation = No} | \text{Snowboarding = No}) &= \frac{2}{5} = 0.4 \\ P(\text{Precipitation = No} | \text{Snowboarding = Yes}) &= \frac{3}{5} = 0.6 \\ P(\text{Temperature = 0-5 °C} | \text{Snowboarding = No}) &= \frac{2}{5} = 0.4 \\ P(\text{Temperature = 0-5 °C} | \text{Snowboarding = Yes}) &= \frac{2}{5} = 0.4 \\ P(\text{Wind = Breeze} | \text{Snowboarding = No}) &= \frac{2}{5} = 0.4 \\ P(\text{Wind = Breeze} | \text{Snowboarding = Yes}) &= \frac{1}{5} = 0.2 \\ \end{aligned}$

Finally, we substitute all the likelihoods in the product expression (we have already calculated the priors earlier):

$\begin{aligned} \prod_{i=0}^{n} P(X_i = x_i | \text{Snowboarding = Yes}) P(\text{Snowboarding = Yes}) <& \prod_{i=0}^{n} P(X_i = x_i | \text{Snowboarding = No}) P(\text{Snowboarding = No}) \\ 0.6 \times 0.4 \times 0.2 \times 0.5 <& 0.4 \times 0.4 \times 0.4 \times 0.5 \\ 0.024 <& 0.032 \\ \end{aligned}$

Given all the data, we are more likely to skip snowboarding in a cold breeze, albeit clear sky.

Laplace smoothing

Although we have already diminished our data requirements, there could still be a problem. If any of the likelihoods $P(X_i = x_i | Y = y)$ is 0, then the whole product turns to 0. We may avoid it by introducing Laplace smoothing in the probability calculation. In its simplest form for a binary classification task it looks like this:

$\frac{a}{b} \rightarrow \frac{a+1}{b+2}$

Intuitively, such smoothing means adding a pseudo-count for every possible event. Let's update the likelihoods:

$\begin{aligned} P(\text{Precipitation = No} | \text{Snowboarding = No}) &= \frac{3}{7}\\ P(\text{Precipitation = No} | \text{Snowboarding = Yes}) &= \frac{4}{7}\\ P(\text{Temperature = 0-5 °C} | \text{Snowboarding = No}) &= \frac{3}{7}\\ P(\text{Temperature = 0-5 °C} | \text{Snowboarding = Yes}) &= \frac{3}{7}\\ P(\text{Wind = Breeze} | \text{Snowboarding = No}) &= \frac{3}{7} \\ P(\text{Wind = Breeze} | \text{Snowboarding = Yes}) &= \frac{2}{7} \end{aligned}$

Conclusion

Bayes classifier is generally not possible to build because real data are finite.
Naive Bayes assumes conditional independence of the features given the label.
Laplace smoothing is used to take into account all possible events.

15 learners liked this piece of theory. 2 didn't like it. What about you?

Report a typo

Naive Bayes classifier

Bayes classifier

Naive Bayes classifier

Laplace smoothing

Conclusion

Related topics