Imagine yourself as a data scientist. You are going to find some errors in a huge log file that looks like this:
2005-06-03-15 R02-M1-N0-C:J12-U11 RAS KERNEL INFO instruction cache parity error corrected
2005-06-03-16 R23-M0-NE-C:J05-U01 RAS KERNEL INFO 63543 double-hummer alignment exceptions
2005-06-04-20 R30-M0-N7-C:J08-U01 RAS KERNEL INFO CE sym 20, at 0x1438f9e0, mask 0x40
2005-06-05-00 R25-M0-N7-C:J02-U01 RAS KERNEL INFO generating core.2275
Before building a model, you need to preprocess a file to decrease complexity. You've come up with a brilliant idea: find similar, but not the same words, such as
2005-06-03-15
and
2005-06-04-20
and pop them out of the dataset.
You are going to measure the similarity between two words via the Levenshtein distance. Calculate the distance between '2005-06-03-15' and '2005-06-04-20' to assess whether this approach is a good way to simplify the dataset.