Computer scienceData scienceNLPLanguage representationTransformers

Attention Mechanisms in Transformers

Theory

Attention Score

Report a typo

Consider a sequence of tokens: ["The", "cat", "sat", "on", "the", "mat"]. The attention scores matrix has a shape of (6,6)(6,6), where each row corresponds to the attention distribution for one token in the sequence, showing how much attention each token pays to every other token (including itself). For example, the attention scores for the token "The" might be:

Attention Scores for "The"=[0.4,0.2,0.1,0.1,0.1,0.1]\text{Attention Scores for "The"} = [0.4, 0.2,0.1,0.1,0.1,0.1]

This means the token "The" pays 40% attention to itself, 20% to "cat," and so on.

Which of the following statements are correct about how the attention mechanism works?

Select one or more options from the list
___

Create a free account to access the full topic