11 minutes read

Nearly all NLP tasks require understanding a text. Text similarity is comparing texts and determining how close they are semantically. For example, I have no coffee left, and My cup is already empty. Humans understand that these two sentences mean the same thing and are close to each other, but they don't share words.

This topic will cover why text similarity is an essential concept in NLP.

Text similarity

Let's start with units that we can compare:

  • Text/document — several connected sentences. For example, an article or a post;

  • Sample — one phrase or sentence. For example, a short message or an expression;

  • Word — a sequence of letters. For example, car or any non-existent word like babwls.

Text similarity is an integral part of:

  1. Question answering and search engines — to understand similar queries from users and provide a uniform response. Text similarity helps to group queries close in meaning;

  2. Argument mining — to automatically detect arguments, relations, and their internal structure. Text similarity allows for finding the relationships between arguments;

  3. Clustering – to find specific groups in a set of text or sentences. For example, finding all articles on one website can be grouped into topics about science, news, sports, and so on. Text similarity helps to determine the closest samples.

There are many other tasks where text similarity techniques are used as an additional part of algorithms. It can help choose the most appropriate or the closest job position for a CV, select the most similar products for a customer shopping when there is no exact product, and so on. You can always find a way to add text similarity in your solution or even as an additional step of exploratory data analysis to understand how texts are connected in your data.

Taxonomies of text similarity and approaches

We can divide text similarity techniques depending on the amount of text:

  • String-based — similarity measures operate on string sequences and character composition.

    • Character-based (sequence-based or edit distance, ED) — takes two strings of characters and then calculates the edit distance. The character-based measure helps recognize typographical errors but is useless in identifying rearranged terms. Metrics: Hamming distance, Levenshtein distance, Jaro, N-gram, and others.

    • Term-based (token-based) – models each string as a set of tokens and uses the overlap of two token sets as likeness quantification. The token-based similarity approach helps recognize the term rearrangement by breaking the strings into substrings. Metrics: Jaccard similarity, Dice's coefficient, Cosine similarity, Manhattan, and Euclidean distance.

  • Corpus-based – determines the similarity between two concepts based on the information extracted from a large corpus. Corpus-based measures are efficient when you need to analyze a specific corpus with certain concepts. For example, the similarity between the words a bug and a mistake should be close using IT-related corpus. Techniques: Hyperspace Analogue to Language (HAL), Latent Semantic Analysis (LSA), Explicit Semantic Analysis (ESA), Pointwise Mutual Information (PMI), Normalized Google Distance (NGD), and Extracting DIstributionally Similar words using CO-occurrence (DISCO).

Lexical similarity metrics

Two samples/texts/words can be analyzed in terms of similarity using two main approaches: lexical and semantic similarity.

Lexical similarity shows how close two samples are on the character/word/n-gram level. To maintain lexical similarity, we don't need to transform our samples, tokens, characters, and n-grams.

If we take two samples:

  1. Anny rejected the invitation from Jessy.

  2. Jessy rejected the invitation from Anny.

In lexical similarity, these two samples are very close.

We can use the Overlap coefficient, Levenshtein Distance, Longest Common Substring, Jaro, Needleman-Wunsch, or Jaccard similarity metric for such lexical similarity.

Jaccard similarity

The Jaccard similarity can be expressed as the number of common words over the total number of words in two texts or documents. The Jaccard similarity of two documents ranges from 00 to 11, where 00 signifies no similarity and 11 signifies a complete overlap.

Jaccard(doc1,doc2)=doc1doc2doc1doc2=doc1doc2doc1+doc2doc1doc2\text{Jaccard} (\text{doc}_1, \text{doc}_2) = \frac {|\text{doc}_1 \cap \text{doc}_2|} {|\text{doc}_1 \cup \text{doc}_2|} = \frac {|\text{doc}_1 \cap \text{doc}_2|}{|\text{doc}_1| + |\text{doc}_2| - |\text{doc}_1 \cap \text{doc}_2|}

Note that both doc1\text{doc}_1 and doc2\text{doc}_2 are sets(only contain the unique values from their respective documents).

Let's count the Jaccard similarity between examples:

doc_1 = "NLP is important part of AI"
doc_2 = "NLP is famous for AI chatbots"
words_doc1 = {nlp, is, important, part, of, ai}
words_doc2 = {nlp, is, famous, for, ai, chatbots}

Words in common: nlp, is, ai.

All words: nlp ,is, important, part, of, ai, famous, for, chatbots.

Jaccard similarity = 3/9 = 1/3

Semantic similarity metrics

Semantic similarity shows how close two samples are in terms of their meaning. Semantic similarity requires the vectorization of samples, which is also called embedding. It can be an embedding of one word or the whole sentence. There are a few approaches to preparing text embeddings better. You can use TF-IDF, Word2Vec, Transformers sentence embeddings, Bag-of-Words, FastText, and others. We assume that our samples have been prepared and we have embeddings of samples. If we take these examples:

  1. Anny rejected the invitation from Jessy.

  2. Jessy rejected the invitation from Anny.

In semantic similarity, they have different meanings.

We can use cosine similarity, Euclidean distance, pointwise mutual information, Word mover's distance, block distance, and simple matching coefficient for semantic similarity.

Cosine similarity

It measures the cosine angle between two n-dimensional vectors projected in a multi-dimensional space. Cosine similarity is always between 00 and 11. If the cosine similarity score is 11, two vectors have the same orientation. A value that is closer to 00 indicates that the two documents have less similarity:

similarity=cos(θ)=ABAB\text{similarity} = \cos (\theta) = \frac {A \cdot B }{\| A \| \| B \|}

For example:

doc_1 = "NLP is important part of AI"
doc_2 = "NLP is famous for AI chatbots"

And that's their bag-of-words representation (created with scikit-learn's CountVectorizer()):

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
corpus = [doc_1, doc_2]
X = vectorizer.fit_transform(corpus)

doc_1_vector = X[0].toarray()[0]
doc_2_vector = X[1].toarray()[0]

# doc_1_vector = [1, 0, 0, 0, 1, 1, 1, 1, 1]
# doc_2_vector = [1, 1, 1, 1, 0, 1, 1, 0, 0]

We calculate the cosine similarity as follows:

import numpy as np
from numpy.linalg import norm

cosine_score = np.dot(doc_1_vector, doc_2_vector) / (
    norm(doc_1_vector) * norm(doc_2_vector)
)

print(cosine_score)
# 0.5000000000000001

We see that the cosine similarity here is 0.5.

Euclidean distance

Euclidean distance is the distance between two points or a straight line:

There is a Euclidean distance visualization

euclidean(p,q)=(q1p1)2+(q2p2)2\text{euclidean} (p,q)={\sqrt {(q_{1}-p_{1})^{2}+(q_{2}-p_{2})^{2}}}

The larger the distance between two vectors, the lower the similarity score, and vice versa. To normalize the Euclidean distance, as it is not in the range of 0 to 1, we can use Euler's constant as follows:

1ed\frac {1}{\mathrm{e}^{d}}

For our examples above, we get a Euclidean distance of 2.44 or 0.086 after normalization.

Word Mover's Distance

The word mover's distance is calculated by the minimum cumulative distance that words from one text document need to travel to match the point cloud of another text document. It uses word2vec vector embeddings of words, so this distance allows us to work with samples with no common words, as in the example below.

Under the hood, as a normalization technique, we weigh each distance according to the number of words.

The distance can be visualized as illustrated in the following graph:

This is an example of words in vector space

Conclusion

In this topic, we have first looked at lexical and semantic similarity metrics: Jaccard similarity, cosine similarity, Euclidean distance, and the word mover's distance. These metrics are only the tip of the iceberg; there are many others. As you may notice, Jaccard similarity concerns lexical similarity, while others can be implemented with word embeddings. There is no rule on what metric to choose one way or another. You should try different ones according to your data and their distribution.

Text similarity helps us understand the relatedness of two texts or sentences. Text similarity is helpful for many NLP tasks and can also be used for exploratory data analysis to understand how texts are connected in your data. It can be implemented in lexical or semantic modes.

12 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo