Computer scienceData scienceInstrumentsPandasData preprocessing with pandas

Preprocessing categorical features with pandas

10 minutes read

Categorical data is a common form of data encountered in various machine learning and data analysis tasks. However, most machine learning algorithms require numerical data, making it essential to convert categorical features into a suitable numerical representation. An example of a categorical variable can be the payment method for customers. It can take the values "Credit Card", "Cash", "Online Payment", etc.

Label Encoding Methods

In this topic, we will take the example of payment methods and encode them using pandas and scikit-learn techniques. The main purpose of all the methods is to transform categorical labels into a unique numerical representation, where each unique label is mapped to a unique integer value.

payment_methods = [
    "Credit Card",
    "PayPal",
    "Credit Card",
    "Bank Transfer",
    "PayPal",
    "Credit Card",
]

scikit-learn's LabelEncoder

LabelEncoder() is a part of the scikit-learn library. The encoded labels of the LabelEncoder() range from 0 to (n_classes - 1), where n_classes is the number of unique categories in the categorical variable. To transform labels, we first initialize LabelEncoder() and then fit it to the categorical data. The .fit() method learns the unique categories present in the data. Then, we can transform our labels into integers. We can combine these two operations using the .fit_transform() method.

from sklearn.preprocessing import LabelEncoder


le = LabelEncoder()
le.fit(payment_methods)

encoded_labels = le.transform(payment_methods)
# array([1, 2, 1, 0, 2, 1])

Pandas factorize()

To use the pandas library for categorical encoding, we will first create a pandas DataFrame.

import pandas as pd

data = pd.DataFrame({'Payment Method': payment_methods})

The pandas factorize() function is used to encode categorical data. It returns a tuple containing two elements:

An array of integers representing the numerical labels of the categorical data.
An Index object that contains the unique categories in the data, along with their corresponding integer codes.

encoded_labels, unique_categories = pd.factorize(data['Payment Method'])
# encoded labels - [0 1 0 2 1 0]
# unique_categories - Index(['Credit Card', 'PayPal', 'Bank Transfer'], dtype='object')

Pandas get_dummies()

Let's add a new column to our DataFrame with the names of the clients.

data['Client Name'] = ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank']

Pandas get_dummies() function is used for one-hot encoding categorical data, converting categorical variables into a numerical representation. When we apply the get_dummies() to the categorical column with payment methods, it creates new binary columns for each unique payment method.

data = pd.get_dummies(data, columns = ['Payment Method'])

+----+---------------+--------------------------------+------------------------------+-------------------------+
|    | Client Name   |   Payment Method_Bank Transfer |   Payment Method_Credit Card |   Payment Method_PayPal |
|----+---------------+--------------------------------+------------------------------+-------------------------|
|  0 | Alice         |                              0 |                            1 |                       0 |
|  1 | Bob           |                              0 |                            0 |                       1 |
|  2 | Charlie       |                              0 |                            1 |                       0 |
|  3 | David         |                              1 |                            0 |                       0 |
|  4 | Eve           |                              0 |                            0 |                       1 |
|  5 | Frank         |                              0 |                            1 |                       0 |
+----+---------------+--------------------------------+------------------------------+-------------------------+

This approach is called one-hot encoding as we get binary features.

Target Encoding

Mean target encoding, also known as target mean encoding or target encoding, is a technique used to encode categorical data into numerical representations based on the target variable's mean value. It is commonly used in classification tasks when dealing with high-cardinality categorical variables.

Let's add a column for payment amount to our DataFrame.

data['Money'] = [110, 2220, 150, 56, 3000, 129]

To create target encoding, we first map the target column "Payment Method" to the mean values in the "Money" column. We then use this mapping to create a new encoded target column.

mean_amount_payment = data.groupby('Payment Method')['Money'].mean()
data['Mean_Payment_Method'] = data['Payment Method'].map(mean_amount_payment)

+----+------------------+---------------+---------+-----------------------+
|    | Payment Method   | Client Name   |   Money |   Mean_Payment_Method |
|----+------------------+---------------+---------+-----------------------|
|  0 | Credit Card      | Alice         |     110 |               129.667 |
|  1 | PayPal           | Bob           |    2220 |              2610     |
|  2 | Credit Card      | Charlie       |     150 |               129.667 |
|  3 | Bank Transfer    | David         |      56 |                56     |
|  4 | PayPal           | Eve           |    3000 |              2610     |
|  5 | Credit Card      | Frank         |     129 |               129.667 |
+----+------------------+---------------+---------+-----------------------+

Alternatively, you can perform the same operation in a single line:

data["Mean_Payment_Method"] = data.groupby('Payment Method')['Money'].transform("mean")

Mean target encoding helps to capture the relationship between the categorical feature and the target variable effectively. However, it risks overfitting, especially when applied to small datasets, or if certain categories have very few occurrences.

Frequency Encoding

Frequency encoding replaces each category with the number of times it appears in the dataset. To use frequency encoding, we first need to create a mapping between the target column and its frequencies using the .value_counts() method.

frequency_payments = data['Payment Method'].value_counts()

Then we use this mapping to encode our target column.

data['Frequency_Encoding'] = data['Payment Method'].map(frequency_payments)

+----+------------------+-----------------------------+
|    | Payment Method   |   Frequency_encoded_payment |
|----+------------------+-----------------------------|
|  0 | Credit Card      |                           3 |
|  1 | PayPal           |                           2 |
|  2 | Credit Card      |                           3 |
|  3 | Bank Transfer    |                           1 |
|  4 | PayPal           |                           2 |
|  5 | Credit Card      |                           3 |
+----+------------------+-----------------------------+

Frequency encoding can be effective for variables where the frequency of categories correlates with the target variable. However, we should pay attention to the distribution of frequencies. Rare categories with low frequencies may not have their significance adequately represented by their encoded values.

Ordinal Encoding

Ordinal, or rank encoding, is a technique that converts categorical data into numerical representations by assigning integers based on the categories' ordinal relationships. Thus, the first category in the dataset should be labeled as 1, the second as 2, and so on. To implement this in pandas, we can use the previously mentioned factorize() function and add 1 to the values, ensuring our ranking starts from 1.

data['Ordinal_Payment'] = pd.factorize(data['Payment Method'])[0] + 1

+----+------------------+-------------------+
|    | Payment Method   |   Ordinal_Payment |
|----+------------------+-------------------|
|  0 | Credit Card      |                 1 |
|  1 | PayPal           |                 2 |
|  2 | Credit Card      |                 1 |
|  3 | Bank Transfer    |                 3 |
|  4 | PayPal           |                 2 |
|  5 | Credit Card      |                 1 |
+----+------------------+-------------------+

Conclusion

Preprocessing categorical features is vital in machine learning. We've covered various encoding techniques such as label, one-hot, ordinal, frequency, and target encoding. These can be implemented using pandas mappings, factorize(), or get_dummies() functions. Each method has benefits and drawbacks, with the choice depending on factors like data type, cardinality, and the target variable relationship.

How did you like the theory?

Report a typo