Computer scienceData scienceInstrumentsPandasData preprocessing with pandas

Handling missing values

14 minutes read

In the introductory topic about missing values, you saw how we can detect and delete them. In this topic, you are going to learn more advanced methods to handle missing values in your data.

Stating a problem

Why can't we always just delete NaNs (= missing values) and forget about them? Let's look at the dataset below.

        district  totsp  balconysp  dist2subway  price  security
0   Python-beach   77.0        5.8          NaN  142.0       NaN
1      East Java  100.0        8.6          0.9    NaN       NaN
2      East Java   64.0        NaN          0.7  120.0       0.0
3      East Java    NaN        4.1          0.7  110.0       1.0
4   Python-beach   75.0        NaN          1.6  143.0       NaN
5   Python-beach   60.0        4.2          NaN  155.0       NaN
6      East Java   71.0        NaN          0.3  170.0       NaN
7    Kotlin-side   75.0        4.1          0.8  122.0       1.0
8            NaN   76.0        5.1          1.2  131.0       NaN
9    Kotlin-side  134.0        8.8          NaN  730.0       NaN
10   Kotlin-side   58.0        NaN          2.8  108.0       NaN
11     East Java   79.0        6.1          NaN  130.0       1.0
12  Python-beach   79.0        NaN          1.1  118.0       NaN
13  Python-beach   80.0        5.8          NaN  160.0       NaN
14   Kotlin-side   63.0        3.4          NaN   95.0       NaN
15  Python-beach   85.0        4.8          1.3    NaN       NaN

We apply the delete-all-rows-with-NaNs method:

data.dropna()

And that is what we are left with:

      district  totsp  balconysp  dist2subway  price  security
7  Kotlin-side   75.0        4.1          0.8  122.0       1.0

That happened because each row in our toy dataset contains a missing value. In reality, even a good dataset can contain 80% of rows with NaNs in different columns. dropna() would delete 80% of the data. This method isn't suitable for us because together with missing values we would lose a lot of valuable information.

However, it makes sense to drop the security column, which consists almost entirely of missing values.

data.dropna(axis=1, thresh=10, inplace=True)

Recall the parameters: axis=1 deletes columns instead of rows, thresh=10 requires that a column has at least 10 non-NaNs to survive, and inplace=True saves the changes in the original DataFrame.

Here is the result:

        district  totsp  balconysp  dist2subway  price
0   Python-beach   77.0        5.8          NaN  142.0
1      East Java  100.0        8.6          0.9    NaN
2      East Java   64.0        NaN          0.7  120.0
3      East Java    NaN        4.1          0.7  110.0
4   Python-beach   75.0        NaN          1.6  143.0
5   Python-beach   60.0        4.2          NaN  155.0
6      East Java   71.0        NaN          0.3  170.0
7    Kotlin-side   75.0        4.1          0.8  122.0
8            NaN   76.0        5.1          1.2  131.0
9    Kotlin-side  134.0        8.8          NaN  730.0
10   Kotlin-side   58.0        NaN          2.8  108.0
11     East Java   79.0        6.1          NaN  130.0
12  Python-beach   79.0        NaN          1.1  118.0
13  Python-beach   80.0        5.8          NaN  160.0
14   Kotlin-side   63.0        3.4          NaN   95.0
15  Python-beach   85.0        4.8          1.3    NaN

Guessing missing data

In this section, we will get rid of NaNs without deleting any of them. There are different methods to handle missing values in categorical and numerical features. For all cases, we will use pandas.Series.fillna(), which basically takes a value and fills all the holes in a column.

To move further, we need to know the data context. The dataset above contains information about flats in Hyperskill city:

  • district is district

  • totsp is the total area of an apartment (m2)

  • balconysp is an area of a balcony in the apartment (m2)

  • dist2subway is a distance to the nearest subway station (km)

  • price is the cost of a flat in thousands of dollars

Here are some ways to handle missing values:

1) Fill NaNs with the most frequent value (the mode in the language of statistics) for categorical features:

mode_district = data['district'].mode()[0] # calculate the mode
data.fillna({'district': mode_district}, inplace=True) # replace NaNs with that mode

2) For numerical features, use the column average. In our dataset, we first process the totsp column:

mean_totsp = data['totsp'].mean() # calculate the average
data.fillna({'totsp': mean_totsp}, inplace=True) # replace NaNs with that average

We left dist2subway, which is a distance to the nearest subway station. Let's be more elegant here. The knowledge of the district of a flat helps to guess the distance more accurately than the average for all data. Let's fill the missing values with the average for the district, where the given flat is located:

data["dist2subway"] = data.groupby("district", group_keys=False)["dist2subway"].apply(
    lambda x: x.fillna(x.mean())
)

Looks complicated, but let's take it step by step. Firstly, data.groupby("district") groups samples by their districts. Secondly, we take the dist2subway column to process. Then, we use the apply() method to apply a function inside, which is lambda x: ..., group-wise. The function takes a set of samples, which belong to the same district, calculates the average distance, and fills the missing values in the group with that average.

3) Fill the missing values with a median value for numerical features.
This is usually the way to choose when a feature has outliers. They affect the average, so it no longer represents a typical value of this feature. Fortunately, outliers don't bother the median value.

median_price = data['price'].median() # calculate the median value
data.fillna({'price': median_price}, inplace=True) # replace NaNs with that value

4) Replace all NaNs with some value.
In other words, you might try and guess a value that is meaningful. Sometimes it's not possible. In our toy dataset, we suppose that NaN in balconysp means that there is no balcony in a flat, so its area equals zero. We replace missing values with 0, and it makes some sense.

data.fillna({'balconysp': 0}, inplace=True)

When we can't find a meaningful value, we fill NaNs with -1 to unite these observations in a separate group based on the absence of a value in some feature. Note that the method works for both categorical and numerical features.

Finally, we got the data without missing values:

        district  totsp  balconysp  dist2subway  price
0   Python-beach   77.0        5.8         1.30  142.0
1      East Java  100.0        8.6         0.90  130.5
2      East Java   64.0        0.0         0.70  120.0
3      East Java   78.4        4.1         0.70  110.0
4   Python-beach   75.0        0.0         1.60  143.0
5   Python-beach   60.0        4.2         1.30  155.0
6      East Java   71.0        0.0         0.30  170.0
7    Kotlin-side   75.0        4.1         0.80  122.0
8   Python-beach   76.0        5.1         1.20  131.0
9    Kotlin-side  134.0        8.8         1.80  730.0
10   Kotlin-side   58.0        0.0         2.80  108.0
11     East Java   79.0        6.1         0.65  130.0
12  Python-beach   79.0        0.0         1.10  118.0
13  Python-beach   80.0        5.8         1.30  160.0
14   Kotlin-side   63.0        3.4         1.80   95.0
15  Python-beach   85.0        4.8         1.30  130.5

Conclusion

In this topic, you explored different ideas on how to calculate or guess a value to fill missing values. Here is a cheat sheet:

A cheat sheet on how to work with missing data depending on the amount of missing values in the dataset

The topic covers some basics about missing values. It is a good starting point for exploring more complicated and intelligent methods to deal with NaNs. As usual, in data science, there is no universal algorithm — you just try different approaches and keep track of your model's performance.

49 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo