Consider a sequence of tokens: ["The", "cat", "sat", "on", "the", "mat"]. The attention scores matrix has a shape of , where each row corresponds to the attention distribution for one token in the sequence, showing how much attention each token pays to every other token (including itself). For example, the attention scores for the token "The" might be:
This means the token "The" pays 40% attention to itself, 20% to "cat," and so on.
Which of the following statements are correct about how the attention mechanism works?