Computer scienceData scienceInstrumentsScikit-learnTraining ML models with scikit-learnClassification in scikit-learn

Logistic regression in scikit-learn

9 minutes read

As you already know, the Logistic regression is one of the methods for solving the classification problem.

In this topic we will consider the practical application of logistic regression using the sklearn library and dataset from it.

Data preparation

The sklearn library provides several datasets, one of which we will use in this topic - iris flower dataset. Let's load it.

from sklearn.datasets import load_iris
data = load_iris()

The dataset contains observations about irises. Each observation has 4 features (sepal and petal length and width) and a label of the type of iris (0 - iris setosa, 1 - iris versicolor, 2 - iris virginica). Here are a few rows from the dataset.

+----+---------------------+--------------------+---------------------+--------------------+----------+
|    |   sepal length (cm) |   sepal width (cm) |   petal length (cm) |   petal width (cm) |   target |
|----+---------------------+--------------------+---------------------+--------------------+----------|
|  0 |                 5.1 |                3.5 |                 1.4 |                0.2 |        0 |
|  1 |                 4.9 |                3   |                 1.4 |                0.2 |        0 |
|  2 |                 4.7 |                3.2 |                 1.3 |                0.2 |        0 |
|  3 |                 4.6 |                3.1 |                 1.5 |                0.2 |        0 |
|  4 |                 5   |                3.6 |                 1.4 |                0.2 |        0 |
+----+---------------------+--------------------+---------------------+--------------------+----------+

Our goal is to classify the type of iris flower using known data about its petal and sepal sizes.

Firstly, let's save features to X and target variable to y.

X = data.data
y = data.target

After that we have to split our dataset to train and test sets. Train set will be used to fit the model and test set will be used to calculate the final metrics.
sklearn has a special convenient function for it - train_test_split. It is located in the model_selection module of sklearn.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=27)

The test_size parameter specifies the size of the test set. This function also has a similar parameter train_size, which specify the size of the train test.

With the random_state parameter, we can fix the splitting. It is very useful when you want to repeat it and get the same splitting.

Fitting

Now we are ready to create and fit the logistic regression model.

Firstly, let's import it from module linear_model of sklearn and create the instance of it.

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

Just like in the linear regression model, there is a fit() method here. The method takes in the features and the values of the targets.

model.fit(X_train, y_train)

This process may take a little time. After that we get a fitted model and can make a prediction.

Predicting and evaluating

The predictions are made using the simple predict() method of trained model that returns class labels for objects from input.

y_pred = model.predict(X_test)

The logistic regression model in sklearn also has a method called predict_proba(), which returns the probability of each object belonging to each class.

y_proba = model.predict_proba(X_test)
print(y_proba[0]) 
# >>> [1.23015595e-04 1.46171445e-01 8.53705539e-01]
# or, in floats: [~0.0001 ~0.146 ~0.85]
print(y_pred[0]) 
# >>> 2

In the code above, we see that the first object of the test set has the highest probability of belonging to the third class - iris virginica (label 2), so the final result in y_pred is 2. The probabilities are ordered by the label of classes, thus, the first element of y_proba[0] corresponds to the probability of belonging to iris setosa (first class), the second element — to iris versicolor (second class), and the third element to iris virginica.

Then we can evaluate our results using popular classification metrics. Let's take precision and F1-score, for example.

Corresponding functions to calculate them are implemented in the metrics module of sklearn. Let's import them.

from sklearn.metrics import precision_score, f1_score

Now let's apply these functions to y_test and y_pred to get the scores.

precision = precision_score(y_test, y_pred, average='macro')
print(precision) # 0.9470588235294118

f1 = f1_score(y_test, y_pred, average='macro')
print(f1) # 0.9457875457875456

The parameter average='macro' is used to get the average value of the metric among all classes as a result.

After that we can say, that our model on the test set can make correct prediction in 94% of cases.

Inspecting a model

Fitted model has two attributes for accessing its coefficients. They are model.coef_, which gives the coefficients $w_1, \dots, w_n$ and model.intercept_, which gives the coefficient $w_0$ .

In the case of multiclassification we will get as many sets of coefficients as we have class labels in the dataset.

For example, our model has 3 sets of coefficient.

print(model.coef_)
# array([[-0.3822008 ,  0.86131077, -2.22946462, -0.96608885],
#        [ 0.36372752, -0.52482781, -0.08075919, -0.75884472],
#        [ 0.01847327, -0.33648296,  2.3102238 ,  1.72493357]])

print(model.intercept_)
# array([  8.59296994,   2.79215664, -11.38512658])

This is because each logistic regression equation divides the space into a region with one class and a region with all other classes (the picture source).

A 2D plot of 3 classes separated by 3 interconnecting lines

If necessary, intercept $w_0$ can be removed from the model equation. To do this, when creating an instance, you must specify fit_intercept=False.

model = LogisticRegression(fit_intercept=False)

Conclusion

Logistic regression is a linear model, so you can find it in module sklearn.linear_model .
To split data on train and test sets you can use train_test_split function from sklearn.model_selection .
To train a Logistic regression model, use the fit(X_train, y_train) method.
You can use the predict(X_test) method in fitted model to get the predictions for objects in X_test.
To get a model's coefficients, use the model.intercept_ for the intercept $w_0$ and model.coef_ for other coefficients $w_1, \dots, w_n$ .
In the case of multiclassification, we get as many sets of coefficients as we have class labels in the dataset.

20 learners liked this piece of theory. 0 didn't like it. What about you?

Report a typo