Computer scienceFundamentalsSQL and DatabasesFor data analysis

Basic Data Cleaning

6 minutes read

In the real world, the data that is provided to analysts often has various flaws that should be discarded. Such flaws can be missing values, duplicate values, out-of-bounds values, and other inaccuracies. Keeping an eye on data quality is important, as poor quality data can reflect a very different picture of things while analysis. When building predictive models, the use of poor-quality data can lead to inaccurate results. The process of dealing with various defects in the data is called data cleaning. In this topic, we'll discuss how to identify defects in data and review basic practices for dealing with them.

Handling missing data

In real datasets, there are times when not all columns are populated with certain values. There can be many reasons for this, starting from errors during unloading and ending with the inability to get the value. The goal of a data analyst is to identify missing values and decide what to do in a given case. There are basically two ways to handle the missing data:

Deleting a column or row containing multiple missing values from a dataset.
Imputation of missing values.

Keep in mind that these methods are appropriate when we do not have a significant number of missing values. If a large part of the dataset is missing, it is better to re-collect the information.

Deleting rows or columns with missing values is the most common way to handle the missing values. If any of the dataset columns contains numerous missing values and the presence of this column does not play a special role in the analysis or model building, it is possible to simply get rid of such a column. The same situation can be observed with strings. If the dataset contains a few rows with missing values, it is easier to simply delete them. Most libraries and data analysis tools allow you to do this in a couple of lines of code or mouse clicks.

The second method is used in case we can't just let certain rows or columns in the data be deleted. When every row and column is important, data imputation is resorted to. This process is much more complicated than the previous one and requires some knowledge of statistics, as we cannot simply replace missing values with random ones.

One of the ways of imputation is taking a look at the distribution of features that have missing values. In case a feature has a normal or binomial distribution, the missing values can be replaced by its median or mean value, as this value will be the most frequent in the sample.

Normal distribution

However, if the data is distributed in some other way, you will have to use other methods. For example, it is possible to identify dependencies between other attributes and substitute missing ones based on that, in case there is any significant dependency. In more complex situations, it can use model building and predict the missing value. But in most cases are limited to the methods described above.

Handling duplicates

When uploading a large amount of data, there can often be repeated rows in the data. Therefore, it is a good idea to check the data for duplicates in the first place. Analyzing data that has many duplicates may reflect the wrong picture, and a model built on such data will perform poorly when tested on real data.

The duplicate handling process can't be easier. All the analyst should do is remove them from the dataset. Most libraries allow doing this using just one function.

Handling outliers

There may also be situations where some values in the data may be abnormally large or small, or may otherwise stand out from the overall distribution. These stand-outs are called outliers, and sometimes it also needs to deal with them. The outliers in quantitative data can be usually spotted on box plots or distribution charts.

Outliers on distribution plot and box plot

However, such outliers may be in the sample for a variety of reasons. These can be errors or quite genuine data, which you do not need to delete. For example, if we have a dataset containing information about the population in US cities, a metropolitan area such as New York would be considered an outlier. But there are actually 8.3 million people in NY, and we can't remove this type of outlier.

On the other hand, if we consider a dataset that contains the age of clients of some service. We can come across a client with an age of 187, this is definitely a mistake, and we can't have the client this old. This type of outlier should be deleted from the dataset.

So, the main rule in outlier handling is to understand which values the analyzed data naturally have. And if the value was considered an outlier, and you understand that it can't have a value this big or this small, it should be removed.

Data formatting

Real data in large datasets may not always be presented in the format that is expected. For example, numeric values may be written as a string in some strings, or a simple string format may also be used instead of the special date format. Using the wrong data types and formats can also lead to difficulties in analysis and errors in building a machine-learning model. Therefore, it is very important to check the data for formatting and format it if necessary.

Data formatting can include the following:

Standardizing date formats. All the data in the dataset should be in one format for easier analysis or machine learning.
Handling text capitalization and removing leading or trailing spaces. It is significant to check that the text is formatted properly and that there are no extra spaces in the text data.
Converting units of measurement. Sometimes different values of the same attribute in different rows may be measured in different units of measure. Therefore, it is important to check this to bring everything to the same units to make the analysis process more convenient.
Dealing with numeric formatting. As it was mentioned earlier, numeric data may be written in text format. So it is necessary to convert all numeric values to numeric format.

Conclusion

In this topic, we have discussed the main points of data cleansing and initial data processing. Especially, handling duplicates, missing values, and outliers. Also, you learned about the data formatting process and what it includes. Whatever your future goal is, this step is quite essential, and it is required to know the data cleaning techniques. Enough the theory for now, let's dive into some practice!

2 learners liked this piece of theory. 0 didn't like it. What about you?

Report a typo