Computer scienceData scienceNLPLanguage representationTransformers

Attention Mechanisms in Transformers

29 minutes read

Attention mechanisms are at the heart of modern machine learning architectures, particularly in transformers, which have redefined natural language processing (NLP) and computer vision (CV). In this topic, we will explore the core principles of various types of attention mechanisms, self-attention and multi-head attention, discussing how they work in transformer architecture and why they are so effective at handling sequential data.

Problems with RNNs

Before the advent of transformers, sequence modeling and text generation primarily relied on recurrent neural networks (RNNs). However, RNNs faced several limitations:

RNNs process sequences step-by-step, and they struggle to retain and access information from earlier parts of long sequences, leading to a loss of context and relevance.
When a model learns, its parameter updates can become too small (vanishing gradients) or too large (exploding gradients), slowing learning or causing instability.

As a result, researchers sought more efficient architectures, leading to the development of the encoder-decoder (seq2seq) architecture in the 2010s, which later evolved into the transformer architecture:

Illustration for the seq2seq mechanism

Transformer

The seq2seq architecture addressed some RNNs problems by separating the processing of input and output sequences. The encoder extracts meaningful information from the input sequence, while the decoder generates the output sequence. However, early seq2seq models had a significant limitation: they relied solely on a fixed-size state vector after processing the entire input sequence. This made it difficult to retain and focus on relevant information from earlier parts of the sequence, especially for long inputs.

To address this, the attention mechanism was introduced, allowing the model to focus on specific parts of the input sequence while generating the output. However, this was still different from the transformer architecture. In seq2seq models with attention, the encoder and decoder remain separate, and attention helps the decoder access earlier information more effectively.

Transformers, on the other hand, alter this structure with self-attention layers that process all parts of a sequence at once. Since transformers eliminate recurrence, they introduce positional encoding to retain word order information. This architectural shift makes transformers significantly more efficient, as shown in the following diagram of the Transformer architecture:

The Transformer architecture

Input Sequence Processing

Consider an example input sequence such as the sentence "The cat sat on the mat".

The first step in processing this sequence involves tokenization, where the sentence is broken into individual tokens: “the”, "cat", “sat”, "on", "the", "mat". Each token is then mapped to its corresponding input ID, which represents its position in a predefined vocabulary. This vocabulary is a comprehensive set of tokens that the model recognizes, with each token assigned a unique ID. The input IDs provide a numerical representation of the tokens for the model to process. Next, these tokens are converted into a dense vector representation, known as word embeddings:

Illustration of the transformation from text to embeddings

For example, let the embedding dimension be $d_{\text{model}}=512$ , so each word is represented as a vector $x_i∈R^{512}$ , where $i$ denotes the position of the word in the sequence. Also, the input sequence contains 6 tokens. The input matrix $A_i∈R^{512}$ will have dimensions $(6, 512)$ , with 6 rows (one for each token) and 512 columns (the embedding dimension). Each row $A_i$ corresponds to the embedding vector for the $i$ -th token in the sequence.

Next, we take the transpose of the input matrix $A$ , denoted as $A^T$ , which changes its shape to $(512,6)$ . The transpose adjusts the shape of the input matrix, ensuring it aligns for the necessary matrix multiplication.

When performing matrix multiplication between $A$ with shape $(6, 512)$ and its transpose $A^T$ with $(512,6)$ , the result is a matrix of shape $(6,6)$ :

The dimensionality illustration

Importantly, the word embeddings used in the input matrix are not static. They change during training as the model learns. This dynamic adjustment allows the embeddings to better represent the context and relationships between tokens, enabling the model to refine its understanding of the sequence.

Positional Encoding

In transformer models, positional encoding is a mechanism that enables the model to capture the order of words in a sequence. While embeddings represent the semantic meaning of words, they do not inherently include any information about the position of the words in a sentence. Without positional encoding, the model would see the sequence as just a collection of words with no understanding of their order.

The main purpose of positional encoding is to add positional information to each word, helping the model understand the order of words in a sequence. This creates a pattern the model can learn, allowing it to recognize relationships between words and their context in the sentence.

To achieve this, the model combines two key components:

Word embedding (vector of size $512$ ): Each word is converted into a dense embedding vector (e.g., size $512$ ) that represents its meaning.
Positional embedding (vector of size $512$ ): A precomputed positional embedding (also size $512$ ) is added to the word embedding to encode the word's position in the sequence. These embeddings are computed once and reused across all sentences during training and inference, ensuring efficiency. Since positional encoding is independent of sentence content, it remains consistent across different sentences:

Positional encodings

The final input for each word is the sum of these two components, combining both meaning and positional information.

Positional encoding $PE$ uses fixed sinusoidal functions to encode the positions of tokens in a sequence. For a given token position $pos$ , embedding dimension index $i$ , total dimensionality $d_{\text{model}}$ and scaling factor $10\cdot k$ , the positional encoding is calculated as follows:

For even indices (2i):

\text{PE}(\text{pos}, 2i) = \sin\frac{\text{pos}}{10k^{\frac{2i}{d_{\text{model}}}}}

For odd indices (2i+1):

\text{PE}(\text{pos}, 2i+1) = \cos \frac{\text{pos}}{10k^\frac{2i}{d_{\text{model}}}}

Trigonometric functions, such as sine and cosine, create recognizable patterns that make it easier for the model to identify relative token positions. In this case, sinusoidal functions also provide a periodic structure that enables transformers to generalize to longer sequences and encode positional relationships smoothly.

Self-attention

Self-attention is the mechanism that allows the model to focus on all the words in a sequence when processing each word. This helps the model understand relationships between words, regardless of their distance in the sentence. Unlike RNNs, which process sequences one token at a time, self-attention considers the entire sequence at once.

Let's calculate self-attention step by step using the sequence "The cat sat on the mat" ( $6$ tokens) and the embedding size $d_{\text{model}}$ is $512$ .

Step 1. Each token in the sequence is represented as a dense embedding vector of size $512$ . This gives us an input matrix $A$ , which has dimensions $(6, 512)$ :

$A = \begin{bmatrix} \text{Embedding (The)} \\ \text{Embedding (cat)} \\ \text{Embedding (sat)} \\ \text{Embedding (on)} \\ \text{Embedding (the)} \\ \text{Embedding (mat)} \end{bmatrix} \hspace{0.5cm} \text{ (shape: 6×512)}$

Step 2. To calculate attention, the model needs to represent each token in terms of three vectors:

The query (Q) vector represents what the token is looking for in the sequence.
The key (K) vector represents what the token offers to other tokens for comparison.
The value (V) vector represents the actual information contained in the token.

These $Q$ , $K$ , and $V$ vectors are derived from the input sentence by applying learned weight matrices $W_Q, W_K, W_K$ (each of size $512 \cdot 512$ for our case, but in general, the size is $d_\text{model} \cdot d_\text{model}$ ) to the input sequence $A$ :

$Q=A\cdot W_Q, K=A\cdot W_K, V=A \cdot W_V$

Since $A$ is $(6, 512)$ and $W_Q, W_K, W_V$ are $(512, 512)$ , the resulting shapes of $Q$ , $K$ , and $V$ are $(6,512)$ :

$Q = \begin{bmatrix} Q_{\text{The}} \\ Q_{\text{cat}} \\ Q_{\text{sat}} \\ Q_{\text{on}} \\ Q_{\text{the}} \\ Q_{\text{mat}} \end{bmatrix}, \text{ } K = \begin{bmatrix} K_{\text{The}} \\ K_{\text{cat}} \\ K_{\text{sat}} \\ K_{\text{on}} \\ K_{\text{the}} \\ K_{\text{mat}} \end{bmatrix}, \text{ } V = \begin{bmatrix} V_{\text{The}} \\ V_{\text{cat}} \\ V_{\text{sat}} \\ V_{\text{on}} \\ V_{\text{the}} \\ V_{\text{mat}} \end{bmatrix}$

Step 3. To compute the attention scores, the model measures the relevance between each token in the sequence by applying a dot product between the query $Q$ and transpose of $K$ matrix:

$\text{Scores} = Q \cdot K^T$

Where $Q$ has dimensions $(6,512)$ and $K^T$ (the transpose of $K$ ) has dimensions $(512,6)$ , the result of the matrix multiplication is a scores matrix with dimensions $(6,6)$ . Each element in the $\text{Scores}_{i,j}$ matrix represents the relevance of the $i$ -th token ( $Q$ ) to the $j$ -th token ( $K$ ) in the sequence:

$\text{Scores} = \begin{bmatrix} \text{Score}_{\text{The, The}} && \text{Score}_{\text{The, cat}} && \text{...} && \text{Score}_{\text{The, mat}} \\ \text{Score}_{\text{cat, The}} && \text{Score}_{\text{cat, cat}} && \text{...} && \text{Score}_{\text{cat, mat}} \\ ⋮ && ⋮ && ⋮ && ⋮ \\ \text{Score}_{\text{mat, The}} && \text{Score}_{\text{mat, cat}} && \text{...} && \text{Score}_{\text{mat, mat}} \end{bmatrix}$

For example, if we focus on one specific calculation for the token "The" (as the $Q$ ) and "cat" (as the $K^T$ ), the score is calculated as:

$\text{Scores}_{\text{The, cat}}= Q_{\text{The}} \cdot K^T_{\text{cat}}$

Step 4. To stabilize the calculated $\text{Scores}$ and avoid large values, the dot products are scaled by square root of the embedding size (or $\sqrt{512}$ for our example). Then, the scores are passed through the softmax function to normalize them, so each row of the score matrix sums to 1:

$\text{Attention Scores} = \text{softmax} \frac{\text{Scores}} {\sqrt{512}}$

The attention scores matrix has a shape of $(6,6)$ , where each row corresponds to the attention distribution for one token in the sequence over all tokens (including itself):

For the token "The," the attention scores are:

$\text{Attention Scores for "The"} = [0.4, 0.2,0.1,0.1,0.1,0.1]$

This means the token "The" pays 40% attention to itself, 20% to "cat," and so on.

Step 5. The attention scores are used to compute a weighted sum of the value ( $V$ ) vectors. This gives the final attention output for each token.

For example, the attention output for "The" is computed as:

$\text{Output}_{\text{The}} = 0.4 \cdot V_{\text{the}}+0.2 \cdot V_{\text{cat}}+0.1 \cdot V_{\text{sat}} +...$

The result is an attention output matrix of shape $(6,512)$ , where:

Each of the $6$ rows corresponds to a token in the input sequence (e.g., "The," "cat," "sat," etc.).
Each row is a context-aware embedding of size $512$ , which combines information from all other tokens in the sequence.

These embeddings are no longer isolated representations of individual tokens; instead, they now encode information about their relationships with all other tokens in the sequence. This allows the model to understand context, such as dependencies and relationships, regardless of token position.

Self-attention has the following key properties:

Self-attention treats all words equally, without knowing their order. To account for word positions, we use positional encodings to label each word's position in the sentence.
It works directly using the meanings of the words (embeddings) and their positions (positional encodings). No additional parameters are needed for this process.
Each word tends to focus on itself the most, so the diagonal values in the attention table are usually the highest. For example: In "The cat sat on the mat," the word "The" will mostly focus on "The" itself but will also consider nearby words as "cat" or "sat" to capture their relationships.

Multi-head attention

Instead of computing a single attention score for each token pair in the sequence, multi-head attention splits the model's attention mechanism into multiple "heads". Each head computes copy of $Q$ , $K$ , and $V$ matrices of the original embedding $A$ and processes the input independently. This allows the model to capture different relationships or features across the sequence in parallel:

Multihead attention illustration

Multi-head attention can be calculated through the following steps:

Step 1. The input matrix $A$ (sequence of embeddings) is split into multiple smaller dimensions for each head. For example, if the embedding size $d$ is $512$ and there are $8$ heads, each head will work with embeddings of size

d_h= \frac{512} {8} = 64

Step 2. For each head, separate learned weight matrices $W_Q, W_K, W_K$ are applied to compute $Q$ , $K$ , and $V$ for that head.

Step 3. Each head computes attention independently using the self-attention formula:

$\text{Attention}(Q, K, V) = \text{softmax}\big(\frac{QK^T}{ \sqrt{d_k}} \Big) V$

Here, $Q$ , $K$ , and $V$ are specific to the head, and $d_k$ is the dimension of $Q$ or $K$ for the head.

Step 4. The outputs from all heads are concatenated to form a single output of size:

$\text{MultiHead}(Q, K,V) = \text{concat}(\text{head}_1, \text{head}_2,..., \text{head}_n)W_o$

Step 5. The concatenated outputs are linearly transformed to produce the final result of multi-head attention.

In multi-head attention, the weight matrices $(W_Q, W_K, W_K)$ dynamically adjust during training, allowing each head to focus on different aspects of the sequence, such as local dependencies, long-range connections, or syntactic and semantic relationships. This mechanism enables the model to capture diverse patterns and relationships, combining the outputs of all heads to form a richer, context-aware representation of the sequence.

By focusing on multiple aspects of the input, multi-head attention enhances the model's ability to learn both in-text patterns and long-range dependencies.

Conclusion

Attention mechanisms, especially self-attention and multi-head attention, enable models to effectively process sequential data. By focusing on the most relevant parts of the input, these mechanisms help analyze long-range dependencies, leading to better performance. Transformers, which use these attention mechanisms, have become the foundation of models in NLP and CV.

How did you like the theory?

Report a typo