Computer scienceData scienceMachine learningEnsemble learning

Catboost

4 minutes read

CatBoost is a type of gradient boosting, but with some additional improvements. One of them is the ability to handle categorical features efficiently. Unlike traditional gradient boosting algorithms that require extensive preprocessing of categorical data, CatBoost uses Ordered Target Encoding (OTE), which automatically encodes categorical variables during the training process.

In this topic, we will briefly consider the theoretical foundations of the framework and see how to use Catboost in practice.

The theoretical setup

One of the main motivations for CatBoost was to improve the handling of categorical features in gradient boosting models. Traditional methods often rely on techniques like one-hot encoding or label encoding, which can lead to several issues:

a. High dimensionality: One-hot encoding can result in a high-dimensional sparse feature space, which can slow down the training process and increase memory requirements.

b. Suboptimal ordering: Label encoding assumes an ordinal relationship between categories, which may not always be appropriate.

c. Information loss: These encoding methods fail to capture the relationship between categories and the target variable.

To address this, CatBoost introduces Ordered Target Encoding. This method replaces each categorical value with a numerical value calculated as the mean (or another specified metric) of the target variable for instances with that category. This approach automatically captures the relationship between categories and the target variable, reducing the need for extensive feature engineering.

Another key motivation for CatBoost was to develop a more robust and effective method for mitigating overfitting, a common issue in gradient boosting models, especially when dealing with complex datasets or high-dimensional feature spaces. CatBoost employs a technique called Ordered Boosting, which is a form of stochastic gradient boosting. Instead of adding new trees to the ensemble in a greedy manner, Ordered Boosting randomly permutes the order of the training instances for each new tree. This randomization helps to reduce the correlation between trees in the ensemble, leading to better generalization and reduced overfitting.

CatBoost also incorporates random subsampling of the training data for each new tree. This technique, similar to bagging, helps to introduce additional randomness and diversity into the ensemble, further reducing the risk of overfitting.

In addition to subsampling instances, CatBoost can also perform random subsampling of features for each new tree. This approach, known as feature bagging, further enhances the diversity of the ensemble and prevents individual features from dominating the model.

The example usage: getting the library and the data

As usual, you can install the Catboost library as

pip install catboost

We will work with the Amazon dataset, which is typically used to demonstrate binary classification. The dataset can be loaded as

from catboost.datasets import amazon

train_df, test_df = amazon()

Let's look at the summary statistics:

train_df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32769 entries, 0 to 32768
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   ACTION            32769 non-null  int64
 1   RESOURCE          32769 non-null  int64
 2   MGR_ID            32769 non-null  int64
 3   ROLE_ROLLUP_1     32769 non-null  int64
 4   ROLE_ROLLUP_2     32769 non-null  int64
 5   ROLE_DEPTNAME     32769 non-null  int64
 6   ROLE_TITLE        32769 non-null  int64
 7   ROLE_FAMILY_DESC  32769 non-null  int64
 8   ROLE_FAMILY       32769 non-null  int64
 9   ROLE_CODE         32769 non-null  int64
dtypes: int64(10)
memory usage: 2.5 MB

There aren't any missing values. If you actually print the dataframe, you might think that the features are numerical. Instead, they are actually categorical, with a high category count.

We can verify it as follows:

train_df.nunique()

Output:

ACTION                 2
RESOURCE            7518
MGR_ID              4243
ROLE_ROLLUP_1        128
ROLE_ROLLUP_2        177
ROLE_DEPTNAME        449
ROLE_TITLE           343
ROLE_FAMILY_DESC    2358
ROLE_FAMILY           67
ROLE_CODE            343
dtype: int64

Data preparation

First, we separate the features from the prediction (the ground truths are stored in the ACTION column in this case):

X = train_df.drop("ACTION", axis=1)
y = train_df["ACTION"]

Then, we declare categorical features:

cat_features = list(range(0, X.shape[1]))

At last, we split the data into the train and test sets, as usual:

from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=0)

Fitting the model

In this section, we will fit and evaluate the model.

from catboost import CatBoostClassifier

clf = CatBoostClassifier(
    iterations=5, 
    learning_rate=0.1, 
    #loss_function='CrossEntropy'
)


clf.fit(X_train, y_train, 
        cat_features=cat_features, 
        eval_set=(X_val, y_val), 
        verbose=False
)

We obtain the predictions with

print(clf.predict_proba(X_val))

Also, we can look at the metrics with

from catboost import CatBoostClassifier

clf = CatBoostClassifier(
    iterations=50,
    random_seed=42,
    learning_rate=0.5,
    custom_loss=['AUC', 'Accuracy']
)

clf.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_val, y_val),
    verbose=False,
    plot=True
)

Conclusion

As a result, you are now familiar with the theoretical foundations of Catboost and how to run it in practice.

How did you like the theory?
Report a typo