Computer scienceData scienceNLPLanguage representationEmbeddings

Contextualized embeddings

8 minutes read

Context can change the way we understand words. In this topic, we will learn about contextualized embeddings in NLP.

Static and contextualized embeddings

Word2Vec, GloVe, and FastText are examples of non-contextualized or static embeddings. This means that each word or token has only one fixed embedding representation. For instance, consider the word "bank" in the following two sentences: "I was driving along the river bank" and "I need to open a new bank account.". In the case of non-contextualized embeddings, both instances of the word "bank" would have the same embedding.

In contrast, contextualized embeddings from models such as BERT or GPT provide different representations based on the surrounding context. This is the key distinction between contextualized and non-contextualized embeddings. Let's summarize other characteristics of these two approaches.

Contextualized

Non-contextualized

Multiple embeddings for a single token are influenced by the context in which it appears.

A single embedding represents a word regardless of its context.

Embeddings are generated based on the entire sentence or text provided.

Embeddings are stored as key-value pairs for word representations, allowing you to retrieve an embedding without needing to process the entire sentence.

Typically, embeddings are created with attention to the entire sequence of words.

Embeddings are often generated using a window size of only 3 to 5 words, which means not all words are interconnected.

Embeddings are created using the explicit word index within the sequence, enabling the model to learn each word's position relative to others.

Such embeddings do not account for word order in their representations.

Out-of-vocabulary words can still be effectively represented, as they are constructed from various learned tokens.

Out-of-vocabulary words typically receive effective representations primarily in certain static embedding architectures, such as FastText.

How contextualized embeddings formed

Contextualized embeddings are often trained using attention mechanisms in models like BERT (Bidirectional Encoder Representations from Transformers). In these models, the attention mechanism is designed so that each word's representation is formed by considering its relationships with other words in the sequence.

During training, the model first tokenizes the text into individual tokens. Each token is then linked to an initial embedding, which typically includes both a word embedding (representing the token's meaning) and a positional embedding (providing information about the token's position in the sequence).

These initial embeddings are then processed through layers that include self-attention mechanisms and feed-forward neural networks. The self-attention mechanism allows each token to consider the significance of other tokens in the sequence, allowing the model to capture contextual relationships between words.

BERT is pre-trained on large unlabeled text corpora using tasks designed for contextual understanding:

  1. In Masked Language Modeling (MLM), a portion of the input tokens is randomly masked (replaced with a special [MASK] token). The model is trained to predict the original vocabulary id of the masked words based on the context provided by the unmasked words. This enables BERT to understand bidirectional context, as it considers both the left and right surroundings of a word.

  2. With Next Sentence Prediction (NSP), the model is trained to predict whether one sentence follows another in the original text. This task helps BERT understand the relationships between sentences, leading to more coherent embeddings for longer text sequences.

Through pre-training, BERT learns to generate embeddings that capture the semantic meaning of words and sentences in context.

Get contextualized embeddings

We will use the Hugging Face inference API with all-MiniLM-L6-v2 and OpenAI's text-embedding-ada-002 models to obtain the embeddings (we provide two examples, in case you don't have an OpenAI API key, you can use HF's free API):

all-miniLM-L6-v2

text-embedding-ada-002

import requests, os
from dotenv import load_dotenv

load_dotenv()

API_KEY = os.environ.get("HUGGINGFACE_API_KEY")
api_url = f"https://api-inference.huggingface.co/pipeline/feature-extraction/sentence-transformers/all-MiniLM-L6-v2"
headers = {"Authorization": f"Bearer {API_KEY}"}

def get_embedding(text: str):
    res = requests.post(api_url, headers=headers, json={"inputs": text, "options": {"wait_for_model": True}})
    if res.ok:
        embedding = res.json()
        if isinstance(embedding, list) and len(embedding) == 384:
            return embedding
    else:
        print(f"API request failed with status code: {res.status_code}")
    return None
import os
from typing import List, Union
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

API_KEY = os.environ.get("OPENAI_API_KEY")
client_OA = OpenAI(api_key=API_KEY)

def get_embedding(text: str) -> Union[List[float], None]:
    text = text.replace("\n", " ")
    try:
        response = client_OA.embeddings.create(input=[text], model="text-embedding-ada-002")
        embedding = response.data[0].embedding
        return embedding
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        return None

Let's get the embeddings for the two example sentences (with the HF API):

river_bank = "I was driving along the river bank"
fin_bank = "I need to open a new bank account"

river_embedding = get_embedding(river_bank)
fin_embedding = get_embedding(fin_bank)

Now, we can check it with the cosine similarity function from the sentence-transformer library. We first install the library and then get the similarity score with the pytorch_cos_sim function:

pip install -U sentence-transformers
from sentence_transformers.util import pytorch_cos_sim

print(pytorch_cos_sim(river_embedding, fin_embedding))
# tensor([[0.2078]])

Since cosine similarity ranges from -1 (vectors are diametrically opposed) to 1 (vectors are identical), with 0 indicating vector dissimilarity, a score of 0.2 suggests that the two embeddings are quite dissimilar overall, which is true when we consider the meaning of the sentences.

Conclusion

In this topic, we learned about contextualized embeddings, a technology that provides better word representations within context, allowing models to distinguish between words with multiple meanings.

How did you like the theory?
Report a typo