Computer scienceData scienceNLPLanguage representationEmbeddings

Word embeddings

17 minutes read

Word embeddings are a method for converting text into numerical representations. Essentially, they are vectors where semantically similar words are positioned closely in the vector space. This makes word embeddings applicable to various NLP tasks, such as text classification and language generation, by providing meaningful representations of words.

In this topic, we will explore the key concepts behind word embeddings.

Why do we need word embeddings?

There are several methods to represent text as numerical data, some of the oldest and simplest solutions include bag-of-words (BoW) and TF-IDF. Let's compare using BoW: it creates a vector based on vocabulary size, where each dimension represents the frequency of a word in a given text. BoW can be illustrated as follows:

Sample Documents:
Document 1: 'She drinks coffee.'
Document 2: 'He likes tea and coffee.'
Document 3: 'They prefer tea.'
Document 4: 'Tea or coffee?'

Step 1: Vocabulary and Indexing
{'she': 7, 'drinks': 2, 'coffee': 1, 'he': 3, 'likes': 4, 'tea': 8, 'and': 0, 'they': 9, 'prefer': 6, 'or': 5}

Step 2: BoW Representation:
   and  coffee  drinks  he  likes  or  prefer  she  tea  they
0    0       1       1   0      0   0       0    1    0     0
1    1       1       0   1      1   0       0    0    1     0
2    0       0       0   0      0   0       1    0    1     1
3    0       1       0   0      0   1       0    0    1     0

Each row represents a document, and each column corresponds to a word in the vocabulary.

However, BoW has several limitations:

It disregards word order.
The vectors are lengthy and grow larger with vocabulary size.
Vectors are not normalized.
Variations of similar words are not accounted for.

For instance, consider the sentences: "She drinks neither ___ nor tea for breakfast" and "___ is the most popular drink in Starbucks." The answer is "coffee." This inference relies on context, which is how word embeddings function.

Word embeddings are created by:

Initializing word vectors randomly.
Selecting a target word (e.g., "coffee").
Defining a window size to capture contextual words (with a size of 2, context might include "drinks neither" and "nor tea").
Feeding the model the context to predict the target word.
Repeating this process for all words.

SkipGram and CBOW

In the previous example, we tried to guess the word by its context, this method is called CBOW – Continuous Bag-of-Words. We can also guess the context by the word, this process is called SkipGram.

So, for our example, if we take the word coffee and predict its context (drinks neither + nor tea), we use SkipGram. If we take the context (drinks neither + nor tea) and try to guess the word in between, we use CBOW. The difference can be illustrated as follows:

There are schemes of CBOW and Skip-gram models

Subsampling

Many common words, such as "in," "the," and "a," carry little meaningful information when training word embeddings. These high-frequency words, often referred to as stop words, do not significantly help in understanding the meanings of other words in context. To address the imbalance caused by the overrepresentation of frequent words, we use a subsampling technique. In this approach, each word $w_i$ in the training set is discarded with a probability $P(w_i)$ calculated using the formula:

P(w_i) = 1 - \sqrt{\frac{t}{f(w_i)}}

Here:

$t$ is a chosen threshold (a small constant, typically around $10^{-5}$ ), and words with frequency greater than $t$ are discarded.
$f(w_i)$ is the frequency of the word $w_i$ in the corpus.

This subsampling formula ensures that very frequent words (those with high $f(w_i)$ ) have a higher probability of being discarded, thereby reducing their dominance in the training data.

By discarding these frequent but less informative words, we allow more meaningful words to have a greater impact during training. This leads to embeddings that are more representative of the semantic relationships between words.

Let's take our previous example with some changes: She drinks neither a cup of coffee nor a cup of tea for breakfast.

Without subsampling and using a context window size of 3 around the word "coffee", the context words are:

Before "coffee": "a," "cup," "of"
After "coffee": "nor," "a," "cup"

Many of these context words are common and have little semantic content.

With subsampling applied, frequent words like "a" and "of" are likely to be discarded. The context then becomes:

Before "coffee": "drinks," "neither," "cup"
After "coffee": "cup," "tea," "breakfast"

This subsampled context includes more informative words that provide better insights into the meaning and usage of the word "coffee." By focusing on these meaningful words, we obtain more informative word embeddings.

Negative sampling

When training word embeddings, our aim is to maximize the probability of observing a particular word given a specific context across the entire vocabulary. However, this poses a challenge because it requires normalizing over all possible words in the vocabulary for every update, which is computationally intensive.

To address this, we use negative sampling. Instead of considering the entire vocabulary, we select a small subset of words: some that are actual context words (positive samples) and some random words that do not belong to the context (negative samples). By doing this, we update the model's weights based on a limited number of words from both positive and negative contexts, reducing computational complexity.

Let's consider the target word "coffee" with the same sentence (She drinks neither a cup of coffee nor a cup of tea for breakfast).

Positive context words (words that actually appear near "coffee" in the text): "drinks," "neither," "nor," "tea"

Negative context words (random words from the corpus that are not related to "coffee"): "runs," "just," "teacher"

The negative samples are typically chosen according to the unigram distribution, meaning words are picked based on their overall frequency in the corpus. This approach ensures that more frequent words have a higher probability of being selected as negative samples, which helps the model learn to differentiate between relevant and irrelevant contexts.

By focusing on a small number of positive and negative samples, we can efficiently train the model to distinguish the target word from unrelated words. This helps to capture meaningful semantic relationships in the embeddings without the computational challenge of normalizing over the entire vocabulary.

Popular word embeddings

Word2Vec is a group of related models that are used to produce word embeddings, both CBOW and SkipGram. The underlying idea of Word2Vec is that the word meaning heavily depends on the context, so Word2Vec approximates the sense of a word through vectors of its surroundings. This way, we can note synonyms and antonyms, as well as other types of linguistic phenomena and relations. A popular example is that man is related to king, like woman relates to queen.

A sample illustration of Word2Vec

Still, there are some drawbacks of this approach:

It does not benefit from the information in the whole document, as it uses windows;
It does not capture the subword information;
It cannot handle out-of-vocabulary words (OOV);
It does not handle disambiguation when one word has different meanings.

Let's solve each problem with other models.

GloVe (Global Vectors) combines word prediction with global word statistics from the entire corpus. It constructs a large co-occurrence matrix of words and their contexts. GloVe seeks embeddings where the dot product of word vectors corresponds to their co-occurrence probabilities, similar to methods used in Latent Semantic Analysis (LSA). This approach produces embeddings that include information from the whole corpus. For example, we can get another embedding for the word bug as a mistake when we use Glove on the corpus about coding, but other models will give us the meaning of bug as an insect.

FastText is a Skip-Gram model that incorporates character-level information. It breaks down each word into a set of n-grams (subword units) and represents the word as the sum of these n-gram vectors. This method enables the model to generate embeddings for out-of-vocabulary (OOV) words and capture similarities between morphologically related words. For instance, beautiful can be represented as beaut i full and be close to other adjectives such as wonderful and, at the same, to the word beauty.

Up to this point, we've focused on static embeddings, which assign a single representation to each word regardless of context. However, contextualized models generate different embeddings for a word depending on its surrounding words. These embeddings are trained primarily using two approaches:

Masked language modeling: Predicting a randomly masked word within a sentence (e.g., BERT).
Causal language modeling: Predicting the next word in a sequence (e.g., GPT).

Models like BERT and GPT handle both OOV words and word sense disambiguation by splitting words into subword tokens and generating context-dependent embeddings.

As an example, let's consider the word "Set":

As a Noun: "A set of instruments is on the table."
As a Verb: "Set higher goals; you can achieve everything."

Contextualized embeddings provide different representations for "set" in these sentences, reflecting its meaning in each context.

In the upcoming topics, we'll look at contextualized embeddings in more detail, as they are currently among the most widely used techniques in NLP.

Conclusion

We have covered the main concepts behind word embeddings: a variety of techniques to construct them, their types and techniques to improve the training. Let's quickly pass through them once again:

Word embeddings represent words as vectors of numbers for text analysis;
There are different techniques that can improve the output of the word embedding process: subsampling and negative sampling;
There are many static word embedding models that can help you: Word2Vec, Glove, FastText, but there are also contextualized embeddings which generate dynamic word representations that change based on the surrounding context, allowing the same word to have different meanings in different situations.

7 learners liked this piece of theory. 0 didn't like it. What about you?

Report a typo