Computer scienceData scienceNLPLanguage representationEmbeddings

Sentence embeddings

12 minutes read

Sentence embedding is the ability to represent a sentence of any length as a vector without losing meaning. However, word embeddings could also mean the sentence in a vector by grouping embeddings for each word. Unfortunately, it is not a one-dimensional vector, and it will lose one of the most important for text analysis — the meaning of the whole phrase. In this topic, we will look at the difference between Word and Sentence Embeddings, their use, and how to implement sentence embeddings differently, including TF-IDF, Doc2Vec, and transformers. We will also consider additional techniques on how to get the embedding for the entire text and, in addition, to improve the accuracy of the received sentence embeddings.

Sentence embeddings definition

Speaking in more detail about sentence embeddings, it is a representation of text (one or more sentences) in vectors, which help convey the meaning of the entire text. The whole text consists of sentences, and each sentence is represented as a vector so that we can compare two sentences with their embeddings. If the meaning in these sentences is similar, their vectors will also be similar (close by the values). For example:

I would like to reschedule my flight due to health issues.
Due to illness, the flight date should be changed.

Suppose we have already received sentence embeddings for these sentences. Using a mathematical method, the cosine similarity, we can get this coefficient is 0.6321178, and it looks true because the sentences describe the same idea but through synonyms and other sentence construction. With the help of sentence embeddings, it is easy to reveal the meaning of the entire sentence.

Sentence embeddings differ from word embeddings in that, word embeddings represent a vector for each word. Thus, according to the values in each vector, one can determine the closeness of words to each other in the semantic sense. A more helpful way to parse a text is to create sentence embeddings because they carry the semantic characteristics of the whole phrase and not words separately, as is the case with word embeddings. This is the main difference between word and sentence embeddings.

There are various applications of sentence embeddings. One of the applications of embeddings is semantic textual similarity (STS). Apart from this, sentence embeddings are very useful for NLP in general, as they can also be used in other tasks. For example, it may be a question-answering dialogue system, text classification, text paraphrase task, etc. Using sentence embeddings, you can help NLP models understand the meaning of sentences.

In the next sections, we cover different methods of Sentence Embeddings implementation. We will cover the simplest ones, such as TF-IDF and Doc2Vec, to well-known transformers like BERT. Each method has its architecture, which implies its advantages and disadvantages.

TF-IDF

The TF-IDF method easily and quickly calculates sentence embeddings, so it does not need enormous computing power, which is a huge advantage. However, it does not consider the order of words in a sentence, which can negatively affect the semantic structure of the text. In this case, other methods are better, but it all depends on the task and the required sentence embeddings.

Doc2Vec

Let us consider one of the most popular methods, called Doc2Vec. Its name recalls the Word2Vec algorithm, which is no accident since Doc2Vec is based on it. The difference is that Word2Vec creates word embeddings while Doc2Vec, also known as Paragraph Vector (PV), creates sentence or document embeddings. The "PV" emphasizes that unlike Word2Vec, which focuses on individual words, Doc2Vec is designed to capture the overall semantic meaning of larger text units.

We have two options for Doc2Vec: DBOW (the Distributed Bag-Of-Words) and DM (the Distributed Memory model). The idea of DBOW, or PV-DBOW, is very similar to the Skip-gram model of Word2Vec. In this model, a word is selected from the paragraph to predict its neighboring words based solely on a probability distribution, without considering word order. The DM model, or PV-DM, incorporates more context by considering the words before the target word (already in vector format) to predict the next word. This model integrates both the processed words and the preceding text, allowing for a deeper understanding of the text's context.

In the diagram below, you can see the noticeable differences between the model architectures.

DM and DBOW models architecture

The primary architectural difference between these models is that DM considers the order of words within sentences, whereas DBOW does not. Consequently, DM is typically more memory-intensive and slower to process texts but can yield richer results due to its consideration of word order. However, the simpler DBOW model may perform better in certain scenarios, making it important to evaluate both models based on specific needs.

Implementing these models is straightforward. By setting the dm parameter, you can choose the model type: dm=1 activates the Distributed Memory model, while dm=0 uses the Distributed Bag of Words model. You can find the meaning of other model parameters in this Doc2vec paragraph embeddings article on the Gensim site.

from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

#example documents
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(common_texts)]
model = Doc2Vec(documents, dm=1, vector_size=5, window=2, min_count=1, workers=4)

model.train(documents, total_examples=model.corpus_count, epochs=model.epochs)

vector = model.infer_vector(['the', 'weather', 'was', 'too', 'rainy', 'yesterday'])
print(vector)

The document vectors generated by either the DBOW or DM (or a combination of both) models can be employed as features for classifiers in supervised learning tasks. This involves using the vectors to train classifiers like logistic regression or neural networks. These classifiers then use the patterns in the vectors to predict labels for new documents. This effectively turns the unsupervised embeddings from Doc2Vec into predictive tools for tasks such as sentiment analysis and topic classification. This enhances their practical utility in a range of applications.

You can find detailed coverage of these methods in the Distributed Representations of Sentences and Documents paper.

BERT

The third algorithm is the BERT model. It is a state-of-the-art solution for many NLP tasks and sentence embeddings. Consequently, the model requires a lot of time and resources for training, but it does an excellent job with big corpora of text data. It can even build dependencies along the entire text so that the semantic structure of documents is preserved more accurately.

You can recall the distributed memory model that also considers word order. Still, they have a remarkable difference — the model uses the so-called window that considers neighboring words. At the same time, BERT contains information about the entire text during training, which gives it an advantage over the Distributed Memory model.

The SBERT model is essentially a fine-tuned BERT used in the Siamese network structure, as shown in the diagram below. The mean strategy, that is the average value of all vectors, was used as a pooling stage. Thanks to this architecture, SBERT is also a state-of-the-art solution. You can read more about this solution in the Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks article by Nils Reimers and Iryna Gurevych.

SBERT architecture

Below is a sample code where we load the all-MiniLM-L6-v2 model, which has already been trained on numerous data. With the encode method, we can get sentence embeddings for our phrases.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = ['I would like to reschedule my flight due to health issues.',
             'Due to illness the flight date should be changed.']

embeddings = model.encode(sentences)

for sentence, embedding in zip(sentences, embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

Average of sentence embeddings

We will also examine how you can further process the received sentence embeddings. For example, we get embeddings for the entire text equal to the number of sentences. So how to get one representative vector for the whole text? At this point, we already have sentence embeddings for a text obtained by the Doc2Vec method. The average of sentence embeddings is a reasonably simple technique where we calculate the mean of all vectors. This way, we get one vector for the entire text, and it is worth noting that this approach is sometimes enough to evaluate the results. Moreover, this is considered a safe method since the average vectors of two independent texts will not be equal. Ben Coleman has a more detailed explanation and proof of this in Why is it Okay to Average Embeddings?

Suppose there is vector1 which is a sentence embedding for the first sentence and vector2 is for the second one. Let us calculate the average of embeddings.

import numpy as np

all_vectors = list()
all_vectors.append(vector1)
all_vectors.append(vector2)

mean = np.array(all_vectors).mean(axis=0)
print(mean)

A weighted average of sentence embeddings approach helps get a weighted average for embeddings in case we want to get a more accurate measure of the impact of each word in a sentence. This is because Doc2Vec does not consider word frequency as TF-IDF does. The implementation differs from the previous method in that we multiply the added vector by the vector obtained by the TF-IDF (tf_idf_vector1 and tf_idf_vector2).

import numpy as np

all_vectors = list()
all_vectors.append(vector1 * tf_idf_vector1)
all_vectors.append(vector2 * tf_idf_vector2)

mean = np.array(all_vectors).mean(axis=0)
print(mean)

Conclusion

In this topic, we've looked at sentence embeddings, how they differ from word embeddings, their purpose, their approaches for extracting them, and how their quality can be improved. The approaches for sentence embeddings are entirely different, but some have common features. Also, these approaches differ in performance, costs for memory, and computing power. Therefore, the more effective approach depends on the task where sentence embeddings are needed.

2 learners liked this piece of theory. 1 didn't like it. What about you?

Report a typo