encoder-decoder attention

In addition to self-attention, the [[transformer decoder]] also uses an attention mechanism that attends to the encoded input sequence (output of the encoder). This enables the decoder to incorporate information from the input sequence while generating the output sequence. The encoder-decoder attention is a key component of the [[transformer model]] architecture, particularly in tasks that involve sequence-to-sequence mapping, such as machine translation and text summarization. The encoder-decoder attention works as follows: 1. **Input from encoder:** - The encoded input sequence is provided as input to the decoder. - The encoder produces a set of key vectors and value vectors for each position in the input sequence. 2. **Query from decoder:** - The decoder generates a query vector for the current position during sequence generation. 3. **Calculating attention scores:** - The [[attention score|attention scores]] are calculated by taking the [[dot product]] between the [[query]] vector from the decoder and the [[key]] vectors from the encoder. - These attention scores reflect the relevance or similarity between the decoder's current position and each position in the input sequence. 4. **Weighted sum of values:** - The calculated attention scores are used to calculate a weighted sum of the [[value]] vectors from the encoder. - This weighted sum provides the decoder with information from the input sequence that is most relevant to the current step of generating the output sequence. Mathematically, the process of encoder-decoder attention can be represented as: $ \text{Attention}(\text{Query}, \text{Key}, \text{Value}) = \text{Softmax} \left( \frac{\text{Query} \cdot \text{Key}^\top}{\sqrt{d_k}} \right) \cdot \text{Value} $ Where: - $\text{Query}$ is the query vector generated by the decoder. - $\text{Key}$ and $\text{Value}$ are the key and value vectors from the encoder for each position. - $d_k$ is the dimension of the key vectors. - The dot product between $\text{Query}$ and $\text{Key}^\top$ calculates the similarity between the query and key vectors. The encoder-decoder attention mechanism allows the decoder to consider different parts of the input sequence while generating each token of the output sequence. This helps the decoder capture relevant information from the input and produce contextually appropriate outputs. It's worth noting that the encoder-decoder attention operates in conjunction with the self-attention mechanisms within both the encoder and decoder. [[masked multi-head self-attention]] < [[Hands-on LLMs]]/[[2 LLMs and Transformers]] > [[transformer output layer]]