masked multi-head self-attention

Similar to the [[transformer encoder]], the [[transformer decoder]] has a [[multi-head self-attention]] mechanism. However, during decoding, the self-attention is masked to ensure that the model only attends to tokens that have been generated up to the current position. This masking prevents the model from "cheating" by looking at future tokens when generating the current token. [[transformer decoder]] < [[Hands-on LLMs]]/[[2 LLMs and Transformers]] > [[encoder-decoder attention]]