Computer scienceData scienceMachine learningClassification

Logistic regression

16 minutes read

Let's imagine that we have several observations about cats and dogs. And, we want to determine the type of animal given some features, for example, their weight, height, and length. This is called a classification task and one of the methods to solve it is logistic regression.

Furthermore, logistic regression not only predicts a class (cat or dog), it also tells us the probability of observation to belong to that class.

In general, there are two types of classification problems: binary and multiclass. The first one has only two classes to identify, and the second one has at least three choices of classes. In this topic, we shall refer to binary classification.

Stating the problem

Consider the data at first. There is a row vector $\vec{x}=(x_1, x_2, ...)$ , where $x_i$ is a feature of an object. A single vector represents one object, and we call it an observation. For instance, a bulldog named Mike was recorded in the dataset. Recall that in the beginning, there was a fixed set of features: weight, height, and length. The following vector represents Mike: $\vec{x}_{Mike}=(23, 50, 80)$ , meaning that Mike's weight is 23 kg, height is 50 cm, and length is 80 cm. The class label or target is $y$ , which takes $1$ for dogs and $0$ otherwise.

Secondly, we aim to predict how probable it is that a given observation belongs to the class $1$ . We can build a linear regression model to predict that probability:

$Pr(y=1|\vec{x}) = w_0x_0 + w_1x_1 + w_2x_2 + w_3x_3 = \sum\limits_{i=0}^3(w_i\cdot x_i)$

$Pr(y=1|\vec{x})$ is the probability of the class $1$ given $\vec{x}$ ; $\vec{w} = (w_0, w_1, w_2, w_3)$ are the coefficients of a model, also called weights. Note that $x_0$ is an intercept, so $x_0 = 1$ , and $w_0$ is its coefficient. The problem is that such an equation returns any real numbers, while we need only numbers in the range from 0 to 1 since we're working with probabilities.

Sigmoid function

To transform linear regression into logistic regression a logistic function was invented:

$f(x)= \frac{L}{1+e^{-k(x-x_0)}}$ Let's digress a little to explore this function.

Here is its plot:

A plot of the logistic function with different steepness values

Concerning the parameters:

$L$ – maximum value of the function
$x_0$ – midpoint of the function
$k$ – growth rate or the steepness of the curve

You can change these parameters depending on the task.

If we take $L =1, x_0=0, k = 1$ , we will get the sigmoid function, which is mainly used as a logistic function:

$\sigma (x) = \frac{1}{1+e^{-x}}$

Getting back to the classification problem, we substitute features into the sigmoid function:

In the case of a single feature

$p(x) = Pr(y=1|x) = \sigma (x, w) =\frac{1}{1+e^{-x \cdot w}}$

In case of multiple features $p(\vec{x}) = Pr(y=1|\vec{x}) = \sigma (\vec{x}, \vec{w}) = \frac{1}{1+e^{-\vec{x} \cdot \vec{w}}} = \frac{1}{1+e^{-(w_0x_0 + w_1x_1 + w_2x_2 + ...)}}$

Note that in binary classification problems it's enough to predict the probability of class $1$ . The probability of class $0$ can be found as follows: $Pr(y=0|\vec{x}) = 1 - Pr(y=1|\vec{x})$ .

Calculating the weights

As in linear regression, our task is to find the optimal weights $\vec{w} = (w_0, w_1, w_2, ...)$ for our model. Unfortunately, there is no closed-form solution in logistic regression, so we don't have the formula to calculate the coefficients. Therefore, we apply numerical optimization methods to solve the problem.

Estimating the weights is purely a model training process. A single observation is not enough, so we introduce $X$ and $\vec{y}$ . Recall that $X$ is a set of $m$ objects $\vec{x}$ with their features (characteristics of animals) and $\vec{y}$ is a set of object's labels $y_i$ (true types of animals).

The training set: features and target values

The most common approach is maximum likelihood estimation (MLE). The basic idea is that we have the maximum likelihood function $L(w_0, w_1, w_2, ..., w_n)$ , which describes the probability of the fact that we will get the right prediction of the class label for each of our objects with certain coefficients $w_i$ . The larger the value of $L(w_0, w_1, w_2, ..., w_n)$ is, the better the estimation of $\vec{w}$ is. Here is the formal definition of the maximum likelihood function:

$L(w_0, w_1, w_2, ..., w_n) = \prod\limits_{j=0}^m \Pr(y = y_j| \vec{x} = \vec{x}_j) = \prod\limits_{j=0}^m \sigma(\vec{x}_j \cdot \vec{w}) ^{y_j} \cdot [1 - \sigma(\vec{x}_j \cdot \vec{w})] ^ {1-y_j}$

The mathematical derivation of it is beyond this topic, but you can find it in the Stanford lecture notes.

To simplify the computations, the natural logarithm is applied to $L(w_0, w_1, w_2, ..., w_n)$ and the result is multiplied by $-1$ . We get a new function, which is called log-loss, and we wish to minimize it by finding optimal weights $\vec{w} = (w_0, w_1, w_2, ..., w_n)$ . Unlike the maximum likelihood function, the lower the log-loss is, the better the estimation of $\vec{w}$ is.

$\text{log-loss}(w_0, w_1, w_2, ..., w_n) = -\frac{1}{m} \cdot \sum_{i=0}^{m} (y_i \cdot \ln(p_i) + (1-y_i) \cdot \ln(1-p_i)) \to \min,$

where $p_i$ is the probability that the observation belongs to the class $1$ , as defined previously ( $p(\vec{x}) = Pr(y=1|\vec{x}) = \sigma (\vec{x}, \vec{w})$ ).

Once we know the loss function, we apply the gradient descent algorithm to find the set of weights, which minimizes it.

Log-loss can also be used as a metric, but only for comparing the models. The value of the function itself isn't clearly interpretable.

If you wish to understand the derivation of the functions above and how gradient descent works, read another explanation too.

Cut-off point

It's almost done since we've learned to predict the probability of belonging to class $1$ . We now need the cut-off point to transform probabilities into class labels. The cut-off point is the point that we decide whether the observation belongs to class $1$ or not. Since we solve binary classification, not belonging to class $1$ means belonging to class $0$ .

Estimating the class of a binary classification problem with the predicted probabilities and cut-off values

We decide the cut-off point based on the particular data. Then the observations with probabilities greater or equal to the cut-off are assigned to class $1$ . The ones with probabilities below the cut-off are assigned to class $0$ .

Toy example

Recall bulldog Mike and his features: $\vec{x}_{Mike}=(23, 50, 80)$ . Adding three more objects, for example, Rex, a German shepherd, Jess, a python, and a raccoon Dave, we get the following data:

	intercept	weight	height	length	class
Mike	1	23	50	80	1
Rex	1	35	70	80	?
Jess	1	12	25	89	?
Dave	1	7	70	23	?

The model was fitted earlier with the whole dataset and the weights are as follows: $\vec{w} = (3, 0.4, -0.12, -0.07)$
Look at the example of calculations for Mike: $\sigma (\vec{x}_{Mike}) = \frac{1}{1+e^{-(3 \cdot 1 + 0.4 \cdot 23 - 0.12 \cdot 50 - 0.07 \cdot 80)}} \approx 0.646$

The cut-off point equals 0.5, so Mike belongs to class $1$ (=dog) with a probability of 0.646.
Finally, we get:

	$\sigma(x)$	class
Rex	0.953	1
Mike	0.646	1
Jess	0.193	0
Dave	0.014	0

Conclusion

Logistic regression is an algorithm for solving the classification problem.
Logistic regression model is a linear regression model substituted into a logistic function.
The most commonly used logistic function is the sigmoid function.
There is no closed-form solution for logistic regression.
The most commonly used method to find optimal weights is maximum likelihood estimation (MLE).
The cut-off point is used to decide the class label.

35 learners liked this piece of theory. 2 didn't like it. What about you?

Report a typo