Table of contents
Text Link

The Way Data Pros Utilize PyCaret

PyCaret is an open-source, low-code machine-learning model library. It is designed to make machine learning more accessible to everyone, regardless of their experience level. PyCaret simplifies the end-to-end machine learning workflow, making building, evaluating, and deploying machine learning models easier. PyCaret is a model management tool that automates data processing, feature engineering, model feature selection, model optimization, and other machine learning workflow transformation steps. This article will explore how PyCaret can help you with experiments and classification tasks using the UCI Blood Donation dataset.

Getting Started with PyCaret

Before diving into the code, we must install PyCaret 3 and the necessary dependencies. Install the full version of PyCaret that comes with all the optional dependencies:

pip install pycaret[full]

It will install optional dependencies not required for the projects. Alternatively, you can install PyCaret with the hard and optional dependencies that you need as you work on the project:

# install pycaret with the hard dependencies
pip install pycaret

# install analysis extras
pip install pycaret[analysis]

# install models extras
pip install pycaret[models]

You can find how to install other optional dependencies you may need for your project on the official PyCaret site. For this article, install in line with the full version.

The blood donation dataset

The blood donation dataset is publicly available on the UC Irvine machine learning repository. This is a complete dataset with various attributes related to blood donation, including recency (months since last donation), frequency (total number of donations), monetary (total blood donated in c.c.), time (month since first donation), and a binary class target variable (0—not donating, 1 - donating). You can import and load the entire dataset with the following code:

from pycaret.datasets import get_data

df = get_data(‘blood’)

# view other pycaret dataset
all_data = get_data(‘index’)

Setting up, preprocessing, and EDA

The PyCaret functional API is arranged in modules. Before you can run any other function in a particular module, you must first initialize the model training environment and create the transformation pipeline with the import setup function:

# import all functions in the classification module

from pycaret.classification import *

# get the numeric features

numeric_features = df.columns.drop('Class').tolist()

# initialize the classification environment with setup()

clf = setup(data=df, target='Class', train_size=0.8, session_id=1,
            fix_imbalance=True, numeric_features=numeric_features,
            preprocess=True, transformation=True,
            transformation_method='yeo-johnson', normalize=True,
)

The input variables are numerical, while the target variable is binary. You have also identified the target variable with the target parameter. The preprocess parameter is set to true so that you can preprocess the input, and the transformation parameter is also true to transform these numerical parameters into Guassian-like.

 

As you see from the exploratory data analysis, the dataset has more Class 0 entries. The fix_imbalance parameter is set to true to help synthetically create a balanced dataset with the default SMOTE. You can change the imbalance estimator to the one of your choice with the fix_imbalance_method parameter.

The setup function has already processed the dataset for you by specifying the necessary parameters. This would take several lines of code in traditional machine learning. After setting up the environment, you can perform the exploratory data analysis with the eda function:

# perform exploratory data analysis

eda(display_format = 'bokeh')

You get an interactive dashboard to explore the relationships between your categorical features. You can change the display format to svg if you prefer static plots.

With few lines of code, you have performed the data preprocessing and the exploratory data analysis. Training different machine learning models and comparing their performance with PyCaret is even easier.

Train, compare, evaluate, predict, finalize, and save

With the compare_models function, you can train different classification models and evaluate various custom metrics:

# train and compare models with their accuracy score
best = compare_models(n_select=1, sort='Accuracy')


# return a list of the top 3 models
top3 = compare_models(n_select=3, sort='Accuracy')

The n_select parameter default value is 1. So, the compare_models function returns the trained model with the highest accuracy score. When the n_select parameter is set to a value greater than 1, the compare_models function returns a list of the trained models.

You can use the get_params() method to view the parameters of the model:

# prints the parameters of the best model
print(best.get_params())

# also prints the parameters of the best model
print(top3[0].get_params())

If you are interested in training a particular model, you can do it with the create_model function:

# train a AdaBoostClassifier
model_ada = create_model('ada')

After training, the plot_model function will help evaluate the model in the graphic format:

# make an area under the curve plot
plot_model(model_ada, 'auc', verbose=False)


# make a confusion matrix plot
plot_model(model_ada, plot='confusion_matrix', verbose=False)


# make a classification report
plot_model(model_ada, plot='class_report', verbose=False)


# plot feature importance scores
plot_model(model_ada, 'feature', verbose=False)


# make a boundary plot
plot_model(model_ada, 'boundary', verbose=False)

Instead of making several plots, you can view all the plots available to you with the evaluate_model function in one line:

evaluate_model(model_ada)

After model evaluation, you can predict with the model:

# make predictions with the trained model
predict_ada = predict_model(model_ada, probability_threshold=0.5)

The probability_threshold parameter has a default value 0.5 to determine a particular entry's class. You can increase or decrease the value of this parameter depending on your problem.

If you are satisfied with your prediction, you can finalize and save your model. You finalize your model by training on the complete dataset with the finalize_model function, and you save the final model with the save_model function:

# train model on the whole dataset
final_model = finalize_model(model_ada)


# save final model
save_model(final_model, 'clf_blood')

Changing the AdaBoostClassifier base estimator and Tuning the model

PyCaret is interoperable with Scikit-Learn and other popular machine-learning libraries. You can see all the classifiers available to you on PyCaret with the models function:

all_classifiers = models()

The AdaBoostClassifier is from Scikit-Learn. The documentation shows that the default estimator of this classifier is DecisionTreeClassifier. You can change the default by specifying the base_estimator parameter when you create AdaBoostClassifier. You cannot, however, change the base_estimator to any classifier. It must be a weak learner:

# import the required classifiers

from sklearn.ensemble import AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron


# train AdaBoostClassifier with a GaussianNB weak learner

nb_model_ada = create_model(AdaBoostClassifier(base_estimator=GaussianNB()))

# train AdaBoostClassifier with a Perceptron weak learner
per_model_ada = create_model(AdaBoostClassifier(base_estimator=Perceptron(), algorithm='SAMME'))

You can turn your model with the tune_model function. All you need to do is to list the hyperparameters of the model for tuning:

# specify model hyperparameters to tune

params = {
    'n_estimators': np.arange(50, 250, 50),
    'learning_rate': np.logspace(1e-3, 1.0, base=10)
}


# tune AdaBoostClassifier with GaussianNB weak learner

nb_tuned_ada = tune_model(
    nb_model_ada, optimize='Accuracy', fold=5,
    tuner_verbose=False, custom_grid=params, n_iter=50
)


# tuned AdaBoostClassifier with Perceptron weak learner

per_tuned_ada = tune_model(
    per_model_ada, optimize='Accuracy', fold=5,
    tuner_verbose=False, custom_grid=params, n_iter=50
)

As shown before, you can evaluate and predict with your tuned model with the appropriate functions.

Stacking classifiers

Stacking is an ensemble technique that outputs several classifiers to train another model, also called a meta-estimator. Envision stacking as a two-layer model. The training data introduces all the models in the first layer. The model's output is that the first layer trains the meta-estimator in the second layer, which is then used to make the prediction.

Let's see how stacking works with the top3 classifiers you already trained:

stack_models(top3)

The figure below shows how the stack_models function trains the default final_estimator, LogisticRegression, on the outputs of the top3 classifiers:

stacking classifier

You can change the meta-estimator from the default. In the following code, the meta-estimator was altered to an MLPClassifier:

# import the meta-model
from sklearn.neural_network import MLPClassifier


# change the meta-model from the default
stacked = stack_models(top3, meta_model=MLPClassifier())

# predict with the stacked model
stacked_predict = predict_model(stacked, data=df)

# evaluate the stack model with boundary plot
plot_model(stacked, 'boundary')

You can stack other models of your choice by placing the trained model in a list. In the following example:

stack_lst = [
    create_model('ada'),
    create_model('lightgbm'),
    create_model('gbc')
]

new_stacked = stack_models(stack_lst, meta_model=MLPClassifier())

Conclusion

PyCaret is an exceptional tool that simplifies the complex process of machine learning. Its low-code approach allows beginners and experienced data scientists to navigate the entire workflow easily. PyCaret automates critical steps in the machine-learning process, from data preprocessing to fine-tuning, making it an efficient and time-saving choice. This article explored how PyCaret can be used for classification tasks with the UCI Blood Donation dataset, showcasing its user-friendly features.

You can do much more with PyCaret than we have discussed in this article. You can explore the PyCaret documentation for more. Learn about the other data science and machine learning libraries that will enable you to properly use PyCaret in our Introduction to Data Science, Machine Learning Algorithms from Scratch, and Pandas for Data Analysis tracks.

Share this article
Get more articles
like this
Thank you! Your submission has been received!
Oops! Something went wrong.

Create a free account to access the full topic

Wide range of learning tracks for beginners and experienced developers
Study at your own pace with your personal study plan
Focus on practice and real-world experience
Andrei Maftei
It has all the necessary theory, lots of practice, and projects of different levels. I haven't skipped any of the 3000+ coding exercises.