Computer scienceData scienceNLPNLP metricsText similarity metrics

The Jaccard similarity index

9 minutes read

We bet that you've used Netflix for watching movies or series at least once in your lifetime. And we believe that after pressing the like button, the site has proposed similar content. The philosophy behind it is that if you've liked that video, you'll probably like similar ones. This can be achieved with the Jaccard similarity index. In this topic, we will learn about how to use this index for NLP applications.

The Jaccard similarity index explained

The Jaccard similarity index (JSI) can be defined as a ratio of intersection of two sets to their union. In JSI, repeating words have no weight. If a word repeats more than once, all repetitions are considered as one element.

The index is denoted by JJ in the formula below: J(A,B)=ABAB=AB(A+BAB)J(A, B) = {|A∩B|\over|A∪B|} = {|A∩B| \over (|A|+|B| - |A∩B|)}

According to the JSI formula, 0J(A,B)10\leq J(A,B)\leq 1. The similarity ratio is 11 if AA and BB have the same elements. If there are no common elements, the similarity ratio is 00. This formula compares differences in two pieces of data. For example, you can find the similarity ratio between two texts. It can also create customer profiles according to the similarity rates of the products they use. This may enhance advertising by a big margin.

Recommendation systems

Let's stick with movie recommendation systems. Imagine two people using a site for watching movies. If the movies they watch are similar, the Jaccard similarity index is higher. Let's examine it in the example:

Here is the intersection of two circles with the inscriptions "mermaids" and "wings". "Haven", "limelight", and "fear" are written inside the intersection.

As we can see in the sample, the similarity index is 0.60.6. In this case, once User1 has finished Fear, the site will suggest Wings. Let's take a look at the code:

def get_jaccard_similarity(first_user, second_user):
    intersection = first_user.intersection(second_user)
    union = first_user.union(second_user)
    return len(intersection) / len(union)


user_1 = {"Haven", "Limelight", "Fear", "Mermaids"}
user_2 = {"Haven", "Limelight", "Wings", "Fear"}

jaccard_similarity = get_jaccard_similarity(user_1, user_2)
print(jaccard_similarity)  # 0.6

We store the movies that users watch inside two sets. Sets employ intersection and union methods. Thanks to them, we can easily carry out the calculations. We do the proposition based on the movies that users watch. The information regarding how many times users watch a particular movie doesn't concern us. That's why we use the set collections because it doesn't allow duplicate elements.

Document analysis

Jaccard similarity is also widely used in NLP. Now, we will use it to find similarities between documents. Take a look at the analysis of four sentences:

from nltk.tokenize import word_tokenize

def text_tokenize(text):
    tokenized_text = word_tokenize(text)
    return [word.lower() for word in tokenized_text if word.isalpha()]

In the code above, we used word_tokenize from the nltk.tokenize library. It extracts the words from the string. If there is a punctuation mark, it adds the word and the punctuation mark to the list individually. After that, we select tokens (words) that are alphabetical characters and turn them into lowercase. We don't need punctuation marks and numbers for analysis. Finally, we can return our list:

def get_jaccard_similarity(text_1, text_2):
    token_1 = text_tokenize(text_1)
    token_2 = text_tokenize(text_2)
    set_1 = set(token_1)
    set_2 = set(token_2)
    union = set_1.union(set_2)
    intersection = set_1.intersection(set_2)
    return float(len(intersection)) / float(len(union))

First, we send the documents to the text_tokenize function. This function returns a tokenized list. The next step is to turn these tokenized lists into sets. By doing this, we can get rid of repetitive words. Finally, let's find the Jaccard similarity index by dividing the intersections by their unions:

sentences = [
    "The quick brown 23 fox walks over the lazy dog.",
    "The quick red fox jumps over, 17 the lazy cat!",
    "The slow yellow fox walks over the lazy snake.",
    "The fast green fox jumps over the quick cat!",
]

for idx, sentence_1 in enumerate(sentences):
    for idy, sentence_2 in enumerate(sentences):
        if idx != idy:
            similarity = get_jaccard_similarity(sentence_1, sentence_2)
            info_string = f"Similarity rate of sentence {idx + 1} and {idy + 1}"
            print(f"{info_string}: {similarity}")

# Similarity rate of sentences 1 and 2: 0.45454545454545453
# Similarity rate of sentences 1 and 3: 0.45454545454545453
# Similarity rate of sentences 1 and 4: 0.3333333333333333
# Similarity rate of sentences 2 and 1: 0.45454545454545453
# Similarity rate of sentences 2 and 3: 0.3333333333333333
# Similarity rate of sentences 2 and 4: 0.6
# Similarity rate of sentences 3 and 1: 0.45454545454545453
# Similarity rate of sentences 3 and 2: 0.3333333333333333
# Similarity rate of sentences 3 and 4: 0.23076923076923078
# Similarity rate of sentences 4 and 1: 0.3333333333333333
# Similarity rate of sentences 4 and 2: 0.6
# Similarity rate of sentences 4 and 3: 0.23076923076923078

In the code above, we have compared the sentences with each other and found the similarity ratios.

Conclusion

In this topic, we have discussed the Jaccard similarity index and ways to utilize this. We've also seen how to compare two sets for NLP applications. Finally, we have examined the similarity rates between several documents.

Now let's move on to the practice.

8 learners liked this piece of theory. 1 didn't like it. What about you?
Report a typo