The self-attention mechanism involves calculating attention scores between a [[query]] and all the [[key]]s in the sequence. These attention scores are then used to compute a weighted sum of the values, which becomes the output for that particular query.
In the case of a sentence, self-attention computes the relation between each word and all the other words in the sentence.
The self-attention mechanism involves three main components: [[query|queries]], [[key|keys]], and [[value|values]]. Each of these components is derived from the [[input embedding|input embeddings]] of the sequence elements (e.g., words in a sentence).
A step-by-step explanation of how self-attention works:
1. **Embeddings and linear projections:**
- Each word in the sequence is represented by a vector, often referred to as a [[input embedding|word embedding]].
- These embeddings are linearly transformed into three types of vectors: [[query]] vectors, [[key]] vectors, and [[value]] vectors. This transformation involves learned weight matrices.
2. **Calculating [[attention score|attention scores]]:**
- For each [[query]] vector, the model calculates a set of [[attention score|attention scores]] by taking the [[dot product]] between the [[query]] vector and all [[key]] vectors in the sequence.
- The dot product measures the similarity between the [[query]] and each [[key]], giving higher scores to words that are semantically relevant to the query.
3. **Applying [[softmax function|softmax]] and weighted sum:**
- The attention scores for a specific query are passed through a [[softmax function]] to obtain a probability distribution over all words in the sequence.
- These probabilities indicate how much attention each word should receive from the query.
- The [[value]] vectors are then multiplied by these probabilities and summed up, resulting in a weighted sum that represents the contribution of different words to the query.
4. **Output representation:**
- The weighted sum obtained from the previous step is the output of the self-attention mechanism for that specific [[query]].
- This output captures information from words that are most relevant to the query, considering their semantic relationships and positions in the sequence.
5. **Multi-head self-attention:**
- Transformer models often use multiple sets of query-key-value transformations and self-attention operations in parallel, called "heads."
- Each [[attention head]] learns different relationships between words, capturing different patterns and nuances within the data.
- The outputs from all heads are concatenated and linearly transformed to produce the final output of the self-attention mechanism.
By using self-attention, transformer models can effectively capture dependencies between words, regardless of their distance in the sequence. This is in contrast to traditional sequential models like RNNs, which might struggle with long-range dependencies. Self-attention is one of the key innovations that has contributed to the success of transformers in various natural language processing tasks, including machine translation, text generation, sentiment analysis, and more.
[[learnius/llms/2 LLMs and Transformers/attention]] < [[Hands-on LLMs]]/[[2 LLMs and Transformers]] > [[query]]