In this topic, you're going to learn the basics about missing values (also called NaNs). First, we will define what is a missing value and why it occurs. Secondly, we will walk you through pandas methods for detecting and counting missing values in the dataset. Then we will learn how to delete NaNs using pandas.
What is a missing value?
A missing value is the absence of data. It can occur for various reasons. For example, if you have a dataset that contains students' test answers, some of the questions may have been skipped by a student. Also, a problem might have occurred during data collection from the website, so some features weren't recorded. We will use pandas to read and process the data. It highlights a missing value as NaN, which means "Not a Number".
NaNs don't necessarily stand for missing numbers: it can stand for missing strings, dates, or any other data types.
Why are NaNs a problem?
Firstly, we can only feed data without missing values to a machine learning model. We will face an error trying to use an unprepared dataset with NaNs.
Secondly, if a feature contains a lot of missing values (more than 60-70%), it will likely be useless.
Nevertheless, sometimes NaN occurs for a good reason. For example, there is a dataset with a wide range of various medical tests as features. Usually, doctors do not carry out all possible medical tests, but only those that can help diagnose a suspected disease. So there may be some missing values in some fields of the dataset, simply because a patient is not being tested for abnormalities in those medical parameters. In this case, the data is not literally "missing": NaNs just indicate that a patient's missing test values aren't relevant to the diagnosis and are most likely within the normal range.
Therefore, handling missing values is crucial for the data pre-processing stage. Let's start with a simple example:
We will use part of the dataset about Portuguese students:
-
schoolis the name of the school; -
ageis a student's age; -
famsizeis the size of a student's family ('LE3' if less or equal to 3 or 'GT3' if greater than 3); -
studytimeis weekly study time (1means less than 2 hours,2: 2-5 hours,3: 5 to 10 hours, or4means greater than 10 hours).
How do we find them?
Let's have a look at several methods in pandas that work with NaNs. Our dataset is stored in data.
-
pandas.DataFrame.isnull()to mark which value isNaNand which isn't:
data.isnull()
It returns True if the value is missing, otherwise it returns False:
In the screenshot above, sample 244 has a missing value in the school, age, and studytime fields. At the same time, 244 doesn't have NaN in the famsize field.
-
A combination of methods to calculate a proportion of
NaNs per feature:
data.isnull().sum() / data.shape[0]
Let's figure it out step by step. Firstly, isnull() marks which value is NaN and which isn't. Secondly, sum() calculates how many Trues (True = NaN) per feature there are. Then, / data.shape[0] divides the number of missing values per feature by the number of rows. Finally, we get a proportion of missing values per column.
-
Finally, this is how we can check whether there are any missing values or not:
data.isnull().any()
In this case, there are NaNs in each column, so we received True for all the features.
How to deal with them?
As we mentioned above, you have to somehow get rid of missing values before training a model. There are many ways to tackle this problem. The simplest one is to delete them.
-
We can drop all the rows which contain missing values with
pandas.DataFrame.dropna(axis=0):
data.dropna(axis=0)
-
We can also drop columns by setting
axis=1:
data.dropna(axis=1)
In order for the changes to be saved in the original DataFrame , set the inplace=True parameter.
Conclusion
Here is a diagram that summarizes the main information about NaNs to keep in mind:
This topic covers the basics of missing values. It is a good starting point for the exploration of more intelligent ways to deal with NaNs. For example, filling NaNs with certain values is a popular approach in cases when we can't just delete the data. You will learn about it in future topics.