Computer scienceData scienceNLPLanguage representationTransformers

Attention Mechanisms in Transformers

Attention Score

Report a typo

Consider a sequence of tokens: ["The", "cat", "sat", "on", "the", "mat"]. The attention scores matrix has a shape of $(6,6)$ , where each row corresponds to the attention distribution for one token in the sequence, showing how much attention each token pays to every other token (including itself). For example, the attention scores for the token "The" might be:

$\text{Attention Scores for "The"} = [0.4, 0.2,0.1,0.1,0.1,0.1]$

This means the token "The" pays 40% attention to itself, 20% to "cat," and so on.

Which of the following statements are correct about how the attention mechanism works?

Select one or more options from the list

The attention scores for a token, as "The," show how much attention it gives to each other token in the sequence, including itself.

The attention scores are used to compute a weighted sum of the value (V) vectors, which gives the final attention output for each token.

The attention scores are directly used to calculate the weighted sum of the key vectors (K), which forms the final attention output for each token.

The attention scores for each token are calculated independently of the other tokens in the sequence.

___

Create a free account to access the full topic