Categorical features play an important role in various fields, such as machine learning and data analysis. These features represent qualitative or non-numerical data, such as gender, color, or product categories. However, before using categorical data in computational models, it is necessary to preprocess them into numerical values. This process is known as encoding.
In this topic, we will explore common techniques for preprocessing categorical features. Each approach has its advantages and disadvantages, which we will carefully examine. By understanding the strengths and limitations of these techniques, we can make informed decisions when dealing with categorical data.
Label encoding
Label encoding is a technique used to convert categorical data into numerical form. In this method, each unique category or label is assigned a unique numerical value. The numerical values are assigned in an ordered manner, starting from 0 or 1 and incrementing for each subsequent category.
Before Label Encoding:
| Category |
|---|
| Apple |
| Banana |
| Orange |
| Apple |
| Orange |
| Banana |
After Label Encoding:
| Category |
|---|
| 0 |
| 1 |
| 2 |
| 0 |
| 2 |
| 1 |
In this example, the original categorical values (Apple, Banana, Orange) are replaced with numerical labels using label encoding. Each unique category is assigned a unique numerical value, with Apple labeled as 0, Banana as 1, and Orange as 2. The dataset now consists of numerical labels that can be used for further analysis or modeling.
Advantages:
- Label encoding is simple and easy to implement.
- Label encoding is suitable for nominal features without any specific order
Disadvantages:
- Label encoding may introduce unintended ordinality where none exists in the data.
- It may result in a bias towards the numerical values assigned, which can impact the performance of certain algorithms.
One-Hot encoding
One-hot encoding is a technique used to convert categorical data into binary vectors. Each unique category is represented by a binary vector, where only one element is "hot" (1) and the rest are "cold" (0). This method creates new binary features, one for each category, and is particularly useful when there is no inherent order or hierarchy among the categories.
Before One-Hot Encoding:
| Category |
|---|
| Apple |
| Banana |
| Orange |
| Apple |
| Orange |
| Banana |
After One-Hot Encoding:
| Apple | Banana | Orange |
|---|---|---|
| 1 | 0 | 0 |
| 0 | 1 | 0 |
| 0 | 0 | 1 |
| 1 | 0 | 0 |
| 0 | 0 | 1 |
| 0 | 1 | 0 |
In this example, the original categorical values (Apple, Banana, Orange) are replaced with binary features using one-hot encoding. Each unique category has its binary feature, where the corresponding category is represented by a value of 1, and the other features have a value of 0. The dataset now consists of binary features that can be used for further analysis or modeling.
Advantages:
- One-hot encoding preserves the individuality of each category without introducing any ordinality or hierarchy.
- One-hot encoding is suitable for situations where the number of categories is relatively small.
Disadvantages:
- One-hot encoding can lead to a significant increase in the dimensionality of the dataset, especially when there are many unique categories.
- It can lead to the "curse of dimensionality" problem, where the sparsity of data increases and computational resources become more demanding.
Ordinal encoding
Ordinal encoding is a technique used to convert categorical data into numerical form while preserving the order or rank among the categories. In this method, each unique category is assigned a numerical value based on its relative position or rank in the order of appearance.
Before Ordinal Encoding:
| Category |
|---|
| Low |
| Medium |
| High |
| Low |
| High |
| Medium |
After Ordinal Encoding:
| Category |
|---|
| 1 |
| 2 |
| 3 |
| 1 |
| 3 |
| 2 |
In this example, the original categorical values (Low, Medium, High) are replaced with numerical values using ordinal encoding. The assigned numerical values reflect the order of the categories, with Low represented as 1, Medium as 2, and High as 3. The resulting dataset contains numerical representations that preserve the relative ranking among the categories.
Advantages:
- Ordinal encoding retains the ordinal relationship among the categories, which can be useful in certain analytical or modeling scenarios.
- It preserves the relative importance or precedence of the categories.
Disadvantages:
- It assumes an inherent order or ranking among the categories, which may not always be meaningful or accurate.
- May not be suitable for non-ordinal categorical features where the order is arbitrary.
Frequency encoding
Frequency encoding is a technique used to convert categorical data into numerical form based on the frequency of each category in the dataset. In this method, each unique category is assigned a numerical value representing the proportion or percentage of occurrences of that category in the dataset.
Before Frequency Encoding:
| Category |
|---|
| Apple |
| Banana |
| Orange |
| Apple |
| Orange |
| Banana |
| Apple |
After Frequency Encoding:
| Category |
|---|
| 0.428 |
| 0.285 |
| 0.285 |
| 0.428 |
| 0.285 |
| 0.285 |
| 0.428 |
In this example, the original categorical values (Apple, Banana, Orange) are replaced with numerical values using frequency encoding. Each category is assigned a numerical value representing the proportion of its occurrence in the dataset. The frequency values are calculated based on the actual frequencies in the dataset. Apple appears 3 times out of 7, so it is assigned a frequency value of 0.428. Bananas and Oranges each appear 2 times out of 7, so they are assigned a frequency value of 0.285. The resulting dataset reflects the varying frequencies of each category.
Advantages:
- Frequency encoding captures the distribution and importance of each category based on their occurrence in the dataset.
- Frequency encoding can be effective for variables where the frequency of categories correlates with the target variable.
Disadvantages:
- It may not work well with rare categories that have low frequencies, as their encoded values may not adequately represent their significance.
- Frequency encoding may lead to overfitting if not properly handled, as it can inadvertently leak information about the target variable.
Target encoding
Target encoding is a technique used to convert categorical data into numerical form based on the target variable. It replaces each category with the average value of the target variable for that category.
Before target encoding
| City | Sales ($) |
|---|---|
| London | 500 |
| Paris | 700 |
| London | 450 |
| Berlin | 600 |
| Paris | 800 |
After target encoding:
| City | Sales ($) |
|---|---|
| 475 | 500 |
| 750 | 700 |
| 475 | 450 |
| 600 | 600 |
| 750 | 800 |
Advantages:
- It captures the relationship between the categorical variable and the target variable more accurately.
- It provides a way to utilize the information from the target variable directly.
- It can be effective in improving the performance of machine learning models.
Limitations:
- Target encoding is prone to overfitting, especially when dealing with categories with very few instances.
- Target encoding is sensitive to imbalanced datasets, as rare categories might have unreliable target value representations.
Comparison and selection of encoding techniques:
When choosing an encoding technique for categorical features, consider the following factors:
- Data type and cardinality: Determine if the feature is nominal (unordered categories) or ordinal (ordered categories). Also, consider the number of unique categories in the feature.
- Relationship between categories: Examine if there is an inherent order or hierarchy among the categories.
- Target variable relationship: Analyze if the categories' relationship with the target variable is important for the prediction task.
- Interpretability: Consider the ease of interpretation of the encoded values.
- Computational efficiency: Assess the computational resources required to implement the encoding technique.
The choice of encoding technique depends on the specific characteristics of the dataset and the goals of the analysis. Here are some recommendations:
- For nominal features with low cardinality, one-hot encoding is often a good choice.
- For ordinal features, ordinal encoding can capture the order or hierarchy.
- For high cardinality nominal features, consider target encoding, balancing the trade-off between dimensionality and capturing information.
- If interpretability is crucial, label encoding or ordinal encoding may be preferred.
- Experiment with different encoding techniques and evaluate their impact on model performance using appropriate validation methods.
Here is a table differentiating between the different encodings to help you choose:
| Type of Variable | Support High Cardinality | Handle Unseen variable | Cons | |
|---|---|---|---|---|
| Label Encoding | Nominal | Yes | No | May introduce unintended ordinality |
| One-Hot Encoding | Nominal | No | Yes | Increases dimensionality; Can be memory-intensive for high cardinality |
| Ordinal Encoding | Ordinal | Yes | Yes | It assumes an inherent order or ranking among the categories, which may not always be meaningful or accurate. |
| Frequency Encoding | Nominal | Yes | Yes | May result in similar representations for categories with the same frequency |
| Target Encoding | Nominal | Yes | Yes | Prone to overfitting if not properly validated; Potential leakage if not used with caution |
Note: unseen variables refer to the categories of variables the model encounters in the test data that weren't present in the train set. For example, the training data contains "red", "green", and "blue", but the test set contains "red", "green", "blue", and "yellow" ("yellow" is the unseen category in this case).
Conclusion
In conclusion, preprocessing categorical features is an important step in machine learning and data analysis. Various encoding techniques such as label encoding, one-hot encoding, ordinal encoding, binary encoding, frequency encoding, and target encoding offer different ways to convert categorical data into numerical form. Each technique has advantages and disadvantages, and the choice depends on factors such as data type, cardinality, relationship with the target variable, interpretability, and computational efficiency. It is recommended to carefully consider these factors and experiment with different techniques to select the most suitable encoding method for your specific dataset and prediction task.