Computer scienceData scienceMachine learningData preprocessing

Preprocessing categorical features

10 minutes read

Categorical features play an important role in various fields, such as machine learning and data analysis. These features represent qualitative or non-numerical data, such as gender, color, or product categories. However, before using categorical data in computational models, it is necessary to preprocess them into numerical values. This process is known as encoding.

In this topic, we will explore common techniques for preprocessing categorical features. Each approach has its advantages and disadvantages, which we will carefully examine. By understanding the strengths and limitations of these techniques, we can make informed decisions when dealing with categorical data.

Label encoding

Label encoding is a technique used to convert categorical data into numerical form. In this method, each unique category or label is assigned a unique numerical value. The numerical values are assigned in an ordered manner, starting from 0 or 1 and incrementing for each subsequent category.

Before Label Encoding:

Category
Apple
Banana
Orange
Apple
Orange
Banana

After Label Encoding:

Category
0
1
2
0
2
1

In this example, the original categorical values (Apple, Banana, Orange) are replaced with numerical labels using label encoding. Each unique category is assigned a unique numerical value, with Apple labeled as 0, Banana as 1, and Orange as 2. The dataset now consists of numerical labels that can be used for further analysis or modeling.

Advantages:

  1. Label encoding is simple and easy to implement.
  2. Label encoding is suitable for nominal features without any specific order

Disadvantages:

  1. Label encoding may introduce unintended ordinality where none exists in the data.
  2. It may result in a bias towards the numerical values assigned, which can impact the performance of certain algorithms.

One-Hot encoding

One-hot encoding is a technique used to convert categorical data into binary vectors. Each unique category is represented by a binary vector, where only one element is "hot" (1) and the rest are "cold" (0). This method creates new binary features, one for each category, and is particularly useful when there is no inherent order or hierarchy among the categories.

Before One-Hot Encoding:

Category
Apple
Banana
Orange
Apple
Orange
Banana

After One-Hot Encoding:

Apple Banana Orange
1 0 0
0 1 0
0 0 1
1 0 0
0 0 1
0 1 0

In this example, the original categorical values (Apple, Banana, Orange) are replaced with binary features using one-hot encoding. Each unique category has its binary feature, where the corresponding category is represented by a value of 1, and the other features have a value of 0. The dataset now consists of binary features that can be used for further analysis or modeling.

Advantages:

  1. One-hot encoding preserves the individuality of each category without introducing any ordinality or hierarchy.
  2. One-hot encoding is suitable for situations where the number of categories is relatively small.

Disadvantages:

  1. One-hot encoding can lead to a significant increase in the dimensionality of the dataset, especially when there are many unique categories.
  2. It can lead to the "curse of dimensionality" problem, where the sparsity of data increases and computational resources become more demanding.

Ordinal encoding

Ordinal encoding is a technique used to convert categorical data into numerical form while preserving the order or rank among the categories. In this method, each unique category is assigned a numerical value based on its relative position or rank in the order of appearance.

Before Ordinal Encoding:

Category
Low
Medium
High
Low
High
Medium

After Ordinal Encoding:

Category
1
2
3
1
3
2

In this example, the original categorical values (Low, Medium, High) are replaced with numerical values using ordinal encoding. The assigned numerical values reflect the order of the categories, with Low represented as 1, Medium as 2, and High as 3. The resulting dataset contains numerical representations that preserve the relative ranking among the categories.

Advantages:

  1. Ordinal encoding retains the ordinal relationship among the categories, which can be useful in certain analytical or modeling scenarios.
  2. It preserves the relative importance or precedence of the categories.

Disadvantages:

  1. It assumes an inherent order or ranking among the categories, which may not always be meaningful or accurate.
  2. May not be suitable for non-ordinal categorical features where the order is arbitrary.

Frequency encoding

Frequency encoding is a technique used to convert categorical data into numerical form based on the frequency of each category in the dataset. In this method, each unique category is assigned a numerical value representing the proportion or percentage of occurrences of that category in the dataset.

Before Frequency Encoding:

Category
Apple
Banana
Orange
Apple
Orange
Banana
Apple

After Frequency Encoding:

Category
0.428
0.285
0.285
0.428
0.285
0.285
0.428

In this example, the original categorical values (Apple, Banana, Orange) are replaced with numerical values using frequency encoding. Each category is assigned a numerical value representing the proportion of its occurrence in the dataset. The frequency values are calculated based on the actual frequencies in the dataset. Apple appears 3 times out of 7, so it is assigned a frequency value of 0.428. Bananas and Oranges each appear 2 times out of 7, so they are assigned a frequency value of 0.285. The resulting dataset reflects the varying frequencies of each category.

Advantages:

  1. Frequency encoding captures the distribution and importance of each category based on their occurrence in the dataset.
  2. Frequency encoding can be effective for variables where the frequency of categories correlates with the target variable.

Disadvantages:

  1. It may not work well with rare categories that have low frequencies, as their encoded values may not adequately represent their significance.
  2. Frequency encoding may lead to overfitting if not properly handled, as it can inadvertently leak information about the target variable.

Target encoding

Target encoding is a technique used to convert categorical data into numerical form based on the target variable. It replaces each category with the average value of the target variable for that category.

Before target encoding

City Sales ($)
London 500
Paris 700
London 450
Berlin 600
Paris 800

After target encoding:

City Sales ($)
475 500
750 700
475 450
600 600
750 800

Advantages:

  1. It captures the relationship between the categorical variable and the target variable more accurately.
  2. It provides a way to utilize the information from the target variable directly.
  3. It can be effective in improving the performance of machine learning models.

Limitations:

  1. Target encoding is prone to overfitting, especially when dealing with categories with very few instances.
  2. Target encoding is sensitive to imbalanced datasets, as rare categories might have unreliable target value representations.

Comparison and selection of encoding techniques:

When choosing an encoding technique for categorical features, consider the following factors:

  1. Data type and cardinality: Determine if the feature is nominal (unordered categories) or ordinal (ordered categories). Also, consider the number of unique categories in the feature.
  2. Relationship between categories: Examine if there is an inherent order or hierarchy among the categories.
  3. Target variable relationship: Analyze if the categories' relationship with the target variable is important for the prediction task.
  4. Interpretability: Consider the ease of interpretation of the encoded values.
  5. Computational efficiency: Assess the computational resources required to implement the encoding technique.

The choice of encoding technique depends on the specific characteristics of the dataset and the goals of the analysis. Here are some recommendations:

  1. For nominal features with low cardinality, one-hot encoding is often a good choice.
  2. For ordinal features, ordinal encoding can capture the order or hierarchy.
  3. For high cardinality nominal features, consider target encoding, balancing the trade-off between dimensionality and capturing information.
  4. If interpretability is crucial, label encoding or ordinal encoding may be preferred.
  5. Experiment with different encoding techniques and evaluate their impact on model performance using appropriate validation methods.

Here is a table differentiating between the different encodings to help you choose:

Type of Variable Support High Cardinality Handle Unseen variable Cons
Label Encoding Nominal Yes No May introduce unintended ordinality
One-Hot Encoding Nominal No Yes Increases dimensionality; Can be memory-intensive for high cardinality
Ordinal Encoding Ordinal Yes Yes It assumes an inherent order or ranking among the categories, which may not always be meaningful or accurate.
Frequency Encoding Nominal Yes Yes May result in similar representations for categories with the same frequency
Target Encoding Nominal Yes Yes Prone to overfitting if not properly validated; Potential leakage if not used with caution

Note: unseen variables refer to the categories of variables the model encounters in the test data that weren't present in the train set. For example, the training data contains "red", "green", and "blue", but the test set contains "red", "green", "blue", and "yellow" ("yellow" is the unseen category in this case).

Remember that there is no one-size-fits-all encoding technique, and it's essential to evaluate and compare their performance on your specific dataset to make an informed decision.

Conclusion

In conclusion, preprocessing categorical features is an important step in machine learning and data analysis. Various encoding techniques such as label encoding, one-hot encoding, ordinal encoding, binary encoding, frequency encoding, and target encoding offer different ways to convert categorical data into numerical form. Each technique has advantages and disadvantages, and the choice depends on factors such as data type, cardinality, relationship with the target variable, interpretability, and computational efficiency. It is recommended to carefully consider these factors and experiment with different techniques to select the most suitable encoding method for your specific dataset and prediction task.

6 learners liked this piece of theory. 1 didn't like it. What about you?
Report a typo