A residual connection (also known as a skip connection or shortcut connection) is a mechanism used to address the vanishing gradient problem in deep neural networks and improve the training and convergence of the model. Residual connections are a crucial component of the [[transformer model]] architecture and play a key role in allowing them to effectively handle very deep networks.
The idea behind a residual connection is to add the original input (or a modified version of it) to the output of a deeper layer. This helps mitigate the degradation of gradient information as it flows backward through multiple layers during training. Residual connections enable the network to learn incremental changes rather than trying to learn the entire transformation from scratch.
The [[transformer model]] architecture uses residual connections in the [[multi-head self-attention]] layer where residual connections are applied around the layer's normalization and attention sub-layers.
The input to the multi-head self-attention layer is added to the output of the attention sub-layer. This addition is followed by layer normalization and dropout. Mathematically, the residual connection in the self-attention layer can be represented as:
$
Output=LayerNorm(Input+SelfAttention(Input))
$
It is also used in the [[feedforward neural network]] layer that follows it:
$
Output=LayerNorm(Input+FeedForward(Input))
$
Residual connections have been shown to improve the training and convergence of deep networks, enabling the successful training of extremely deep architectures like the ones used in transformer models. They contribute to the stability and effectiveness of transformers in capturing complex patterns and relationships in sequential data.
[[multi-head self-attention]] < [[Hands-on LLMs]]/[[2 LLMs and Transformers]] > [[transformer encoder]]