Computer scienceData scienceMachine learningClassification

Naive Bayes classifier

10 minutes read

Sometimes you don't get lucky with the weather. Especially if you really want to visit a ski resort but managed to get a vacation only in mid-season. Fortunately, your friend has been there just before you and collected some weather observations. These data should help you plan snowboarding days from the forecast. In the table, column "Snowboarding" is a label, and "Precipitation", "Temperature" and "Wind" are features.

Day

Precipitation

Temperature

Wind

Snowboarding

1

No

5-10 °C

Calm

No

2

No

5-10 °C

Breeze

No

3

No

0-5 °C

Calm

Yes

4

Rain

0-5 °C

Calm

No

5

Snow

-10-0 °C

Calm

Yes

6

Snow

-10-0 °C

Breeze

No

7

No

-10-0 °C

Breeze

Yes

8

Rain

0-5 °C

Calm

No

9

No

-10-0 °C

Calm

Yes

10

Rain

0-5 °C

Calm

Yes

Bayes classifier

To make a decision whether to book an excursion or to try your luck in the mountains, you can use a maximum a posteriori probability (MAP) rule:

y^=f(x)=argmaxP(Y=yX=x)\hat{y} = f(\mathbf{x}) = \mathrm{argmax}\, P(Y = y | \mathbf{X} = \mathbf{x})

MAP tells us to predict the most probable label yy from a set YY given the data point x\mathbf{x} from a data distribution X\mathbf{X}. In our case, YY is a set of possible outcomes {Yes, No}\text{\{Yes, No\}} about whether we go snowboarding or not. An example of x\mathbf{x} would be a data entry (Precipitation = Rain, Temperature = 0-5 °C, Wind = Calm)(\text{Precipitation = Rain, Temperature = 0-5 °C, Wind = Calm}).

Unfortunately, we do not have direct access to the conditional posterior probability P(Y=yX=x)P(Y = y | \mathbf{X} = \mathbf{x}) from the data. And this is where Bayes' rule helps us. In its general form, Bayes' rule shows how to calculate a conditional probability of an event B given an event A if we know a conditional probability of A given B and full probabilities of A and B:

P(BA)=P(AB)P(B)P(A)P(B|A) = \frac{P(A|B)P(B)}{P(A)}

Here, we take Y=yY = y as B and X=x\mathbf{X} = \mathbf{x} as A to rewrite our MAP rule:

y^=f(x)=argmaxP(Y=yX=x)=argmaxP(X=xY=y)P(Y=y)P(X=x)\hat{y} = f(\mathbf{x}) = \mathrm{argmax}\, P(Y = y | \mathbf{X} = \mathbf{x}) = \mathrm{argmax}\, \frac{P(\mathbf{X} = \mathbf{x} | Y = y) P(Y = y)}{P(\mathbf{X} = \mathbf{x})}

For a particular data entry x\mathbf{x} the denominator P(X=x)P(\mathbf{X} = \mathbf{x}) will be the same for all the values under argmax\mathrm{argmax}, so we can discard it and rewrite the rule once again and obtain Bayes classifier:

y^=f(x)=argmaxP(X=xY=y)P(Y=y)==argmaxP(X1=x1,,Xn=xnY=y)P(Y=y)\hat{y} = f(\mathbf{x}) = \mathrm{argmax}\, P(\mathbf{X} = \mathbf{x} | Y = y) P(Y = y) =\newline = \mathrm{argmax}\, P(X_1 = x_1, \dots, X_n = x_n | Y = y) P(Y = y)

Let's simplify our data to a single column to see how it works in practice.

Day

Precipitation

Snowboarding

1

No

No

2

No

No

3

No

Yes

4

Rain

No

5

Snow

Yes

6

Snow

No

7

No

Yes

8

Rain

No

9

No

Yes

10

Rain

Yes

A forecast for tomorrow tells that there will be a clear sky, so we need to compare two expressions to determine whether we shall plan snowboarding for tomorrow: P(Precipitation = NoSnowboarding = Yes)P(Snowboarding = Yes)P(\text{Precipitation = No} | \text{Snowboarding = Yes}) P(\text{Snowboarding = Yes}) vs P(Precipitation = NoSnowboarding = No)P(Snowboarding = No)P(\text{Precipitation = No} | \text{Snowboarding = No}) P(\text{Snowboarding = No}).

First, let's calculate the priors from the data. They give us information about how likely are different labels "by default".

P(Snowboarding = No)=510=0.5P(Snowboarding = Yes)=510=0.5\begin{aligned} P(\text{Snowboarding = No}) &= \frac{5}{10} = 0.5 \\ P(\text{Snowboarding = Yes}) &= \frac{5}{10} = 0.5 \end{aligned}

Next, the likelihoods:

X:=(Precipitation = NoSnowboarding = No)Y:=(Precipitation = NoSnowboarding = Yes)P(X)=25=0.4P(Y)=35=0.6\begin{aligned} X &:= (\text{Precipitation = No} | \text{Snowboarding = No}) \\ Y &:= (\text{Precipitation = No} | \text{Snowboarding = Yes})\\ P(X) &= \frac{2}{5} = 0.4 \\ P(Y) &= \frac{3}{5} = 0.6 \end{aligned}

If we combine them, we see that it's more likely to go snowboarding in clear weather than not:

P(Y)P(Snowboarding = Yes)>P(X)P(Snowboarding = No)0.6×0.5>0.4×0.50.3>0.2\begin{aligned} P(Y) P(\text{Snowboarding = Yes}) &> P(X) P(\text{Snowboarding = No}) \\ 0.6 \times 0.5 &> 0.4 \times 0.5 \\ 0.3 &> 0.2 \\ \end{aligned}

Let's go back to our original data and try to make a prediction for a full forecast for tomorrow: (Precipitation = No, Temperature = 0-5 °C, Wind = Breeze)(\text{Precipitation = No, Temperature = 0-5 °C, Wind = Breeze}).

Day

Precipitation

Temperature

Wind

Snowboarding

1

No

5-10 °C

Calm

No

2

No

5-10 °C

Breeze

No

3

No

0-5 °C

Calm

Yes

4

Rain

0-5 °C

Calm

No

5

Snow

-10-0 °C

Calm

Yes

6

Snow

-10-0 °C

Breeze

No

7

No

-10-0 °C

Breeze

Yes

8

Rain

0-5 °C

Calm

No

9

No

-10-0 °C

Calm

Yes

10

Rain

0-5 °C

Calm

Yes

It immediately turns out that we cannot do it with our Bayes classifier, because we need to evaluate P(Precipitation = No, Temperature = 0-5 °C, Wind = BreezeSnowboarding = Yes)P(\text{Precipitation = No, Temperature = 0-5 °C, Wind = Breeze} | \text{Snowboarding = Yes}) vs P(Precipitation = No, Temperature = 0-5 °C, Wind = BreezeSnowboarding = No)P(\text{Precipitation = No, Temperature = 0-5 °C, Wind = Breeze} | \text{Snowboarding = No}). For both of these expressions we simply do not have enough data! Since there is no entry (Precipitation = No, Temperature = 0-5 °C, Wind = Breeze)(\text{Precipitation = No, Temperature = 0-5 °C, Wind = Breeze}), it is impossible to calculate the joint probability of these events.

Naive Bayes classifier

We can solve our problem of insufficient data by naively assuming that all the features in the dataset are independent given the label (X1X2...XnY)(X_1 \perp X_2 \perp ... \perp X_n | Y). This assumption is not guaranteed to be true (for example, snow usually does not fall when it is 5-10 °C outside). However, it is an approximation we are willing to make to rewrite a classification MAP rule again simplifying the calculations:

y^=f(x)=argmaxP(X=xY=y)P(Y=y)=argmaxP(X1=x1,,Xn=xnY=y)P(Y=y)=argmaxP(X1=x1Y=y)P(X2=x2Y=y)P(Xn=xnY=y)P(Y=y)=argmaxi=0nP(Xi=xiY=y)P(Y=y)\begin{aligned} \hat{y} = f(\mathbf{x}) &= \mathrm{argmax}\, P(\mathbf{X} = \mathbf{x} | Y = y) P(Y = y) = \mathrm{argmax}\, P(X_1 = x_1, \dots, X_n = x_n | Y = y) P(Y = y) \\ &= \mathrm{argmax}\, P(X_1 = x_1 | Y = y) P(X_2 = x_2 | Y = y) \dots P(X_n = x_n | Y = y) P(Y = y) \\ &= \mathrm{argmax}\, \prod_{i=0}^{n} P(X_i = x_i | Y = y) P(Y = y) \end{aligned}

Now, we can obtain P(Xi=xiY=y)P(X_i = x_i | Y = y) from the data:

P(Precipitation = NoSnowboarding = No)=25=0.4P(Precipitation = NoSnowboarding = Yes)=35=0.6P(Temperature = 0-5 °CSnowboarding = No)=25=0.4P(Temperature = 0-5 °CSnowboarding = Yes)=25=0.4P(Wind = BreezeSnowboarding = No)=25=0.4P(Wind = BreezeSnowboarding = Yes)=15=0.2\begin{aligned} P(\text{Precipitation = No} | \text{Snowboarding = No}) &= \frac{2}{5} = 0.4 \\ P(\text{Precipitation = No} | \text{Snowboarding = Yes}) &= \frac{3}{5} = 0.6 \\ P(\text{Temperature = 0-5 °C} | \text{Snowboarding = No}) &= \frac{2}{5} = 0.4 \\ P(\text{Temperature = 0-5 °C} | \text{Snowboarding = Yes}) &= \frac{2}{5} = 0.4 \\ P(\text{Wind = Breeze} | \text{Snowboarding = No}) &= \frac{2}{5} = 0.4 \\ P(\text{Wind = Breeze} | \text{Snowboarding = Yes}) &= \frac{1}{5} = 0.2 \\ \end{aligned}

Finally, we substitute all the likelihoods in the product expression (we have already calculated the priors earlier):

i=0nP(Xi=xiSnowboarding = Yes)P(Snowboarding = Yes)<i=0nP(Xi=xiSnowboarding = No)P(Snowboarding = No)0.6×0.4×0.2×0.5<0.4×0.4×0.4×0.50.024<0.032\begin{aligned} \prod_{i=0}^{n} P(X_i = x_i | \text{Snowboarding = Yes}) P(\text{Snowboarding = Yes}) <& \prod_{i=0}^{n} P(X_i = x_i | \text{Snowboarding = No}) P(\text{Snowboarding = No}) \\ 0.6 \times 0.4 \times 0.2 \times 0.5 <& 0.4 \times 0.4 \times 0.4 \times 0.5 \\ 0.024 <& 0.032 \\ \end{aligned}

Given all the data, we are more likely to skip snowboarding in a cold breeze, albeit clear sky.

Laplace smoothing

Although we have already diminished our data requirements, there could still be a problem. If any of the likelihoods P(Xi=xiY=y)P(X_i = x_i | Y = y) is 0, then the whole product turns to 0. We may avoid it by introducing Laplace smoothing in the probability calculation. In its simplest form for a binary classification task it looks like this:

aba+1b+2\frac{a}{b} \rightarrow \frac{a+1}{b+2}

Intuitively, such smoothing means adding a pseudo-count for every possible event. Let's update the likelihoods:

P(Precipitation = NoSnowboarding = No)=37P(Precipitation = NoSnowboarding = Yes)=47P(Temperature = 0-5 °CSnowboarding = No)=37P(Temperature = 0-5 °CSnowboarding = Yes)=37P(Wind = BreezeSnowboarding = No)=37P(Wind = BreezeSnowboarding = Yes)=27\begin{aligned} P(\text{Precipitation = No} | \text{Snowboarding = No}) &= \frac{3}{7}\\ P(\text{Precipitation = No} | \text{Snowboarding = Yes}) &= \frac{4}{7}\\ P(\text{Temperature = 0-5 °C} | \text{Snowboarding = No}) &= \frac{3}{7}\\ P(\text{Temperature = 0-5 °C} | \text{Snowboarding = Yes}) &= \frac{3}{7}\\ P(\text{Wind = Breeze} | \text{Snowboarding = No}) &= \frac{3}{7} \\ P(\text{Wind = Breeze} | \text{Snowboarding = Yes}) &= \frac{2}{7} \end{aligned}

Conclusion

  • Bayes classifier is generally not possible to build because real data are finite.

  • Naive Bayes assumes conditional independence of the features given the label.

  • Laplace smoothing is used to take into account all possible events.

15 learners liked this piece of theory. 2 didn't like it. What about you?
Report a typo