In the introductory topic about missing values, you saw how we can detect and delete them. In this topic, you are going to learn more advanced methods to handle missing values in your data.
Stating a problem
Why can't we always just delete NaNs (= missing values) and forget about them? Let's look at the dataset below.
district totsp balconysp dist2subway price security
0 Python-beach 77.0 5.8 NaN 142.0 NaN
1 East Java 100.0 8.6 0.9 NaN NaN
2 East Java 64.0 NaN 0.7 120.0 0.0
3 East Java NaN 4.1 0.7 110.0 1.0
4 Python-beach 75.0 NaN 1.6 143.0 NaN
5 Python-beach 60.0 4.2 NaN 155.0 NaN
6 East Java 71.0 NaN 0.3 170.0 NaN
7 Kotlin-side 75.0 4.1 0.8 122.0 1.0
8 NaN 76.0 5.1 1.2 131.0 NaN
9 Kotlin-side 134.0 8.8 NaN 730.0 NaN
10 Kotlin-side 58.0 NaN 2.8 108.0 NaN
11 East Java 79.0 6.1 NaN 130.0 1.0
12 Python-beach 79.0 NaN 1.1 118.0 NaN
13 Python-beach 80.0 5.8 NaN 160.0 NaN
14 Kotlin-side 63.0 3.4 NaN 95.0 NaN
15 Python-beach 85.0 4.8 1.3 NaN NaN
We apply the delete-all-rows-with-NaNs method:
data.dropna()
And that is what we are left with:
district totsp balconysp dist2subway price security
7 Kotlin-side 75.0 4.1 0.8 122.0 1.0
That happened because each row in our toy dataset contains a missing value. In reality, even a good dataset can contain 80% of rows with NaNs in different columns. dropna() would delete 80% of the data. This method isn't suitable for us because together with missing values we would lose a lot of valuable information.
However, it makes sense to drop the security column, which consists almost entirely of missing values.
data.dropna(axis=1, thresh=10, inplace=True)
Recall the parameters: axis=1 deletes columns instead of rows, thresh=10 requires that a column has at least 10 non-NaNs to survive, and inplace=True saves the changes in the original DataFrame.
Here is the result:
district totsp balconysp dist2subway price
0 Python-beach 77.0 5.8 NaN 142.0
1 East Java 100.0 8.6 0.9 NaN
2 East Java 64.0 NaN 0.7 120.0
3 East Java NaN 4.1 0.7 110.0
4 Python-beach 75.0 NaN 1.6 143.0
5 Python-beach 60.0 4.2 NaN 155.0
6 East Java 71.0 NaN 0.3 170.0
7 Kotlin-side 75.0 4.1 0.8 122.0
8 NaN 76.0 5.1 1.2 131.0
9 Kotlin-side 134.0 8.8 NaN 730.0
10 Kotlin-side 58.0 NaN 2.8 108.0
11 East Java 79.0 6.1 NaN 130.0
12 Python-beach 79.0 NaN 1.1 118.0
13 Python-beach 80.0 5.8 NaN 160.0
14 Kotlin-side 63.0 3.4 NaN 95.0
15 Python-beach 85.0 4.8 1.3 NaNGuessing missing data
In this section, we will get rid of NaNs without deleting any of them. There are different methods to handle missing values in categorical and numerical features. For all cases, we will use pandas.Series.fillna(), which basically takes a value and fills all the holes in a column.
To move further, we need to know the data context. The dataset above contains information about flats in Hyperskill city:
-
districtis district -
totspis the total area of an apartment (m2) -
balconyspis an area of a balcony in the apartment (m2) -
dist2subwayis a distance to the nearest subway station (km) -
priceis the cost of a flat in thousands of dollars
Here are some ways to handle missing values:
1) Fill NaNs with the most frequent value (the mode in the language of statistics) for categorical features:
mode_district = data['district'].mode()[0] # calculate the mode
data.fillna({'district': mode_district}, inplace=True) # replace NaNs with that mode
2) For numerical features, use the column average. In our dataset, we first process the totsp column:
mean_totsp = data['totsp'].mean() # calculate the average
data.fillna({'totsp': mean_totsp}, inplace=True) # replace NaNs with that average
We left dist2subway, which is a distance to the nearest subway station. Let's be more elegant here. The knowledge of the district of a flat helps to guess the distance more accurately than the average for all data. Let's fill the missing values with the average for the district, where the given flat is located:
data["dist2subway"] = data.groupby("district", group_keys=False)["dist2subway"].apply(
lambda x: x.fillna(x.mean())
)
Looks complicated, but let's take it step by step. Firstly, data.groupby("district") groups samples by their districts. Secondly, we take the dist2subway column to process. Then, we use the apply() method to apply a function inside, which is lambda x: ..., group-wise. The function takes a set of samples, which belong to the same district, calculates the average distance, and fills the missing values in the group with that average.
3) Fill the missing values with a median value for numerical features.
This is usually the way to choose when a feature has outliers. They affect the average, so it no longer represents a typical value of this feature. Fortunately, outliers don't bother the median value.
median_price = data['price'].median() # calculate the median value
data.fillna({'price': median_price}, inplace=True) # replace NaNs with that value
4) Replace all NaNs with some value.
In other words, you might try and guess a value that is meaningful. Sometimes it's not possible. In our toy dataset, we suppose that NaN in balconysp means that there is no balcony in a flat, so its area equals zero. We replace missing values with 0, and it makes some sense.
data.fillna({'balconysp': 0}, inplace=True)
When we can't find a meaningful value, we fill NaNs with -1 to unite these observations in a separate group based on the absence of a value in some feature. Note that the method works for both categorical and numerical features.
Finally, we got the data without missing values:
district totsp balconysp dist2subway price
0 Python-beach 77.0 5.8 1.30 142.0
1 East Java 100.0 8.6 0.90 130.5
2 East Java 64.0 0.0 0.70 120.0
3 East Java 78.4 4.1 0.70 110.0
4 Python-beach 75.0 0.0 1.60 143.0
5 Python-beach 60.0 4.2 1.30 155.0
6 East Java 71.0 0.0 0.30 170.0
7 Kotlin-side 75.0 4.1 0.80 122.0
8 Python-beach 76.0 5.1 1.20 131.0
9 Kotlin-side 134.0 8.8 1.80 730.0
10 Kotlin-side 58.0 0.0 2.80 108.0
11 East Java 79.0 6.1 0.65 130.0
12 Python-beach 79.0 0.0 1.10 118.0
13 Python-beach 80.0 5.8 1.30 160.0
14 Kotlin-side 63.0 3.4 1.80 95.0
15 Python-beach 85.0 4.8 1.30 130.5Conclusion
In this topic, you explored different ideas on how to calculate or guess a value to fill missing values. Here is a cheat sheet:
The topic covers some basics about missing values. It is a good starting point for exploring more complicated and intelligent methods to deal with NaNs. As usual, in data science, there is no universal algorithm — you just try different approaches and keep track of your model's performance.