Computer scienceData scienceInstrumentsPandasData preprocessing with pandas

Working with missing values

11 minutes read

In this topic, you're going to learn the basics about missing values (also called NaNs). First, we will define what is a missing value and why it occurs. Secondly, we will walk you through pandas methods for detecting and counting missing values in the dataset. Then we will learn how to delete NaNs using pandas.

What is a missing value?

A missing value is the absence of data. It can occur for various reasons. For example, if you have a dataset that contains students' test answers, some of the questions may have been skipped by a student. Also, a problem might have occurred during data collection from the website, so some features weren't recorded. We will use pandas to read and process the data. It highlights a missing value as NaN, which means "Not a Number".

NaNs don't necessarily stand for missing numbers: it can stand for missing strings, dates, or any other data types.

Why are NaNs a problem?

Firstly, we can only feed data without missing values to a machine learning model. We will face an error trying to use an unprepared dataset with NaNs.

Secondly, if a feature contains a lot of missing values (more than 60-70%), it will likely be useless.

Nevertheless, sometimes NaN occurs for a good reason. For example, there is a dataset with a wide range of various medical tests as features. Usually, doctors do not carry out all possible medical tests, but only those that can help diagnose a suspected disease. So there may be some missing values in some fields of the dataset, simply because a patient is not being tested for abnormalities in those medical parameters. In this case, the data is not literally "missing": NaNs just indicate that a patient's missing test values aren't relevant to the diagnosis and are most likely within the normal range.

Therefore, handling missing values is crucial for the data pre-processing stage. Let's start with a simple example:

A sample dataframe with a couple of NaN values

We will use part of the dataset about Portuguese students:

  • school is the name of the school;

  • age is a student's age;

  • famsize is the size of a student's family ('LE3' if less or equal to 3 or 'GT3' if greater than 3);

  • studytime is weekly study time (1 means less than 2 hours, 2 : 2-5 hours, 3 : 5 to 10 hours, or 4 means greater than 10 hours).

How do we find them?

Let's have a look at several methods in pandas that work with NaNs. Our dataset is stored in data.

  • pandas.DataFrame.isnull() to mark which value is NaN and which isn't:

data.isnull()

It returns True if the value is missing, otherwise it returns False:

The effect of calling data.is_null()

In the screenshot above, sample 244 has a missing value in the school, age, and studytime fields. At the same time, 244 doesn't have NaN in the famsize field.

  • A combination of methods to calculate a proportion of NaNs per feature:

data.isnull().sum() / data.shape[0]

Result of calculating a proportion of NaNs per feature

Let's figure it out step by step. Firstly, isnull() marks which value is NaN and which isn't. Secondly, sum() calculates how many Trues (True = NaN) per feature there are. Then, / data.shape[0] divides the number of missing values per feature by the number of rows. Finally, we get a proportion of missing values per column.

A scheme that shows the original df, the df after isnull() is called, sum() on the previous df, and data.shape[0]

  • Finally, this is how we can check whether there are any missing values or not:

data.isnull().any()

Result of checking missing values using data.isnull()

In this case, there are NaNs in each column, so we received True for all the features.

How to deal with them?

As we mentioned above, you have to somehow get rid of missing values before training a model. There are many ways to tackle this problem. The simplest one is to delete them.

  • We can drop all the rows which contain missing values with pandas.DataFrame.dropna(axis=0):

data.dropna(axis=0)

Original dataframe and the dataframe after calling data.dropna(axis=0)

  • We can also drop columns by setting axis=1:

data.dropna(axis=1)

Original dataframe and the dataframe after calling data.dropna(axis=1)

In order for the changes to be saved in the original DataFrame , set the inplace=True parameter.

Conclusion

Here is a diagram that summarizes the main information about NaNs to keep in mind:

A digram that shows the methods to detect or delete the NaN values

This topic covers the basics of missing values. It is a good starting point for the exploration of more intelligent ways to deal with NaNs. For example, filling NaNs with certain values is a popular approach in cases when we can't just delete the data. You will learn about it in future topics.

51 learners liked this piece of theory. 2 didn't like it. What about you?
Report a typo