Content-based attention is a type of attention mechanism commonly used in sequence-to-sequence models, particularly in tasks such as machine translation or text generation. It allows the model to focus on different parts of the input sequence when generating the corresponding output sequence.
In content-based attention, the attention weights are computed based on the content or meaning of the input sequence at each time step. It involves comparing the decoder state with the encoder hidden states to determine the relevance or importance of each encoder hidden state for the current decoding step.
Typically, content-based attention uses a similarity metric, such as dot product or cosine similarity, to measure the similarity between the decoder state and each encoder hidden state. These similarity scores are then normalized to obtain attention weights, indicating the importance of each encoder's hidden state in generating the current output.
The purpose of content-based attention is to allow the model to focus on the most relevant parts of the input sequence during the generation process. By attending to different parts of the input, the model can effectively capture the dependencies and relationships between the input and output sequences, leading to improved translation or generation quality.
Content-based attention has been widely used in various sequence-to-sequence tasks and has proven to be effective in aligning input and output sequences, improving translation accuracy, and generating coherent and meaningful text.
The content-based attention was initially applied to machine translation using encoder-decoder models. In this framework, the encoder converts the input sequence $\mathbf{x}$ into a vector $\mathbf{c}$ using hidden states $\mathbf{h}$:
$
h_{t} = f (x_{t}, h_{t-1})
$
where $h_{t}$ is a hidden state at time $t$ and $f()$ is a non-linear function.
The output of the encoder is a non-linear function of the input and hidden states:
$
\mathbf{c} = q(\{h_{1}, \dots, h_{T_{x}}\})
$
The output of the encoder is referred to as the context vector.
The decoder is trained to predict the next word $y_{t}$ given the context vector and previously predicted words. It defines a probability over the translation $\mathbf{y}=(y_{1},\dots,y_{T_{y}})$ by decomposing the joint probability into the ordered conditionals
$
p(\mathbf{y}) = \prod_{t=1}^{T} p(y_{t}|\{y_{1}, \dots, y_{t-1}\}, \mathbf{c})
$
Each conditional probability can be modeled with a [[recurrent neural network (RNN)]]:
$
p(y_{t}|\{y_{1},\dots,y_{t-1}\}, \mathbf{c}) = g(y_{t-1}, s_{t}, \mathbf{c})
$
where $s_{t}$ is the hidden state of the RNN $g()$.
The model proposed by Bahdanau et al (2014) consists in using a bidirectional RNN as encoder and decoder to emulate the search through a source sentence during decoding a translation.
In this case the hidden states ($h_{j}$) can be considered _annotations_ of the input words, and the context vector is a weighted sum:
$
c_{i} = \sum_{j=1}^{T_{x}} \alpha_{ij} h_{j}
$
The weight $\alpha_{ij}$ of each annotation $h_{j}$ is computed by
$
\alpha_{ij} = \text{softmax}(e_{ij})
$
where $e_{ij}$ is the score for how well the input around position $j$ matches the output around position $i$ (alignment model):
$
e_{ij} = a(s_{i-1}, h_{j})
$
The alignment model can be parametrize as a feedforward neural network jointly trained with the other components of the system:
$
e_{ij} = w^{T} \tanh(W s_{i} + V h_{j} + b)
$
where $w$ and $b$ are vectors, and $W$ and $V$ are matrices.
![[bahdanau-cont-based-att.png|300]]
[Bahdanau et al (2014)](https://arxiv.org/pdf/1409.0473)
The content-based attention mechanism was later modified to [[location-sensitive attention|location-based attention]] where the model can explicitly use previous alignments for computing the next attention state.
## Reference
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. [arXiv:1409.0473](https://arxiv.org/pdf/1409.0473), September 2014.