In transformer models, a "head" refers to a subcomponent that operates in parallel within [[multi-head self-attention]] and feedforward neural network layers. Transformers use multiple heads to capture different patterns and relationships in the input data. In the [[self-attention]] mechanism in transformers calculates attention scores between [[query]], [[key]], and [[value]] vectors. In [[multi-head self-attention]] multiple instances (heads) of the self-attention mechanism run in parallel. Each head has its own set of learned linear transformations (projection matrices) for [[query|queries]], [[key|keys]], and [[value|values]]. This allows each head to focus on different aspects of the input sequence. For example, one head might focus on syntactic relationships, while another might focus on semantic relationships. The concept of heads in transformers allows the model to learn diverse features and relationships from the input data simultaneously. By using multiple heads, transformers can capture a wide range of information, which is beneficial for various tasks, including machine translation, text generation, sentiment analysis, and more. The number of heads in a transformer model is a hyperparameter that can be tuned. While increasing the number of heads can improve the model's capacity to learn complex patterns, it also increases computational requirements. [[logit]] < [[Hands-on LLMs]]/[[2 LLMs and Transformers]] > [[multi-head self-attention]]