Preprocess the dataset

Report a typo

Imagine yourself as a data scientist. You are going to find some errors in a huge log file that looks like this:

2005-06-03-15 R02-M1-N0-C:J12-U11 RAS KERNEL INFO instruction cache parity error corrected
2005-06-03-16 R23-M0-NE-C:J05-U01 RAS KERNEL INFO 63543 double-hummer alignment exceptions
2005-06-04-20 R30-M0-N7-C:J08-U01 RAS KERNEL INFO CE sym 20, at 0x1438f9e0, mask 0x40
2005-06-05-00 R25-M0-N7-C:J02-U01 RAS KERNEL INFO generating core.2275

Before building a model, you need to preprocess a file to decrease complexity. You've come up with a brilliant idea: find similar, but not the same words, such as

2005-06-03-15

and

2005-06-04-20

and pop them out of the dataset.

You are going to measure the similarity between two words via the Levenshtein distance. Calculate the distance between '2005-06-03-15' and '2005-06-04-20' to assess whether this approach is a good way to simplify the dataset.

Enter a number
___

Create a free account to access the full topic