CatBoost is a type of gradient boosting, but with some additional improvements. One of them is the ability to handle categorical features efficiently. Unlike traditional gradient boosting algorithms that require extensive preprocessing of categorical data, CatBoost uses Ordered Target Encoding (OTE), which automatically encodes categorical variables during the training process.
In this topic, we will briefly consider the theoretical foundations of the framework and see how to use Catboost in practice.
The theoretical setup
One of the main motivations for CatBoost was to improve the handling of categorical features in gradient boosting models. Traditional methods often rely on techniques like one-hot encoding or label encoding, which can lead to several issues:
a. High dimensionality: One-hot encoding can result in a high-dimensional sparse feature space, which can slow down the training process and increase memory requirements.
b. Suboptimal ordering: Label encoding assumes an ordinal relationship between categories, which may not always be appropriate.
c. Information loss: These encoding methods fail to capture the relationship between categories and the target variable.
To address this, CatBoost introduces Ordered Target Encoding. This method replaces each categorical value with a numerical value calculated as the mean (or another specified metric) of the target variable for instances with that category. This approach automatically captures the relationship between categories and the target variable, reducing the need for extensive feature engineering.
Another key motivation for CatBoost was to develop a more robust and effective method for mitigating overfitting, a common issue in gradient boosting models, especially when dealing with complex datasets or high-dimensional feature spaces. CatBoost employs a technique called Ordered Boosting, which is a form of stochastic gradient boosting. Instead of adding new trees to the ensemble in a greedy manner, Ordered Boosting randomly permutes the order of the training instances for each new tree. This randomization helps to reduce the correlation between trees in the ensemble, leading to better generalization and reduced overfitting.
CatBoost also incorporates random subsampling of the training data for each new tree. This technique, similar to bagging, helps to introduce additional randomness and diversity into the ensemble, further reducing the risk of overfitting.
In addition to subsampling instances, CatBoost can also perform random subsampling of features for each new tree. This approach, known as feature bagging, further enhances the diversity of the ensemble and prevents individual features from dominating the model.
The example usage: getting the library and the data
As usual, you can install the Catboost library as
pip install catboostWe will work with the Amazon dataset, which is typically used to demonstrate binary classification. The dataset can be loaded as
from catboost.datasets import amazon
train_df, test_df = amazon()Let's look at the summary statistics:
train_df.info()Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32769 entries, 0 to 32768
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ACTION 32769 non-null int64
1 RESOURCE 32769 non-null int64
2 MGR_ID 32769 non-null int64
3 ROLE_ROLLUP_1 32769 non-null int64
4 ROLE_ROLLUP_2 32769 non-null int64
5 ROLE_DEPTNAME 32769 non-null int64
6 ROLE_TITLE 32769 non-null int64
7 ROLE_FAMILY_DESC 32769 non-null int64
8 ROLE_FAMILY 32769 non-null int64
9 ROLE_CODE 32769 non-null int64
dtypes: int64(10)
memory usage: 2.5 MBThere aren't any missing values. If you actually print the dataframe, you might think that the features are numerical. Instead, they are actually categorical, with a high category count.
We can verify it as follows:
train_df.nunique()Output:
ACTION 2
RESOURCE 7518
MGR_ID 4243
ROLE_ROLLUP_1 128
ROLE_ROLLUP_2 177
ROLE_DEPTNAME 449
ROLE_TITLE 343
ROLE_FAMILY_DESC 2358
ROLE_FAMILY 67
ROLE_CODE 343
dtype: int64Data preparation
First, we separate the features from the prediction (the ground truths are stored in the ACTION column in this case):
X = train_df.drop("ACTION", axis=1)
y = train_df["ACTION"]Then, we declare categorical features:
cat_features = list(range(0, X.shape[1]))At last, we split the data into the train and test sets, as usual:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=0)Fitting the model
In this section, we will fit and evaluate the model.
from catboost import CatBoostClassifier
clf = CatBoostClassifier(
iterations=5,
learning_rate=0.1,
#loss_function='CrossEntropy'
)
clf.fit(X_train, y_train,
cat_features=cat_features,
eval_set=(X_val, y_val),
verbose=False
)We obtain the predictions with
print(clf.predict_proba(X_val))Also, we can look at the metrics with
from catboost import CatBoostClassifier
clf = CatBoostClassifier(
iterations=50,
random_seed=42,
learning_rate=0.5,
custom_loss=['AUC', 'Accuracy']
)
clf.fit(
X_train, y_train,
cat_features=cat_features,
eval_set=(X_val, y_val),
verbose=False,
plot=True
)Conclusion
As a result, you are now familiar with the theoretical foundations of Catboost and how to run it in practice.