The location-sensitive attention, or location-based attention, is a variant of the attention mechanism commonly used in sequence-to-sequence models, particularly in speech synthesis tasks. It extends the standard attention mechanism by incorporating additional information about the alignment between the input and output sequences.
In location-sensitive attention, the attention weights are computed not only based on the current decoder state and encoder hidden states but also take into account the positions or "locations" of the previous alignments. These locations refer to the attention weights from previous time steps.
The purpose of location-sensitive attention is to provide the model with a notion of the alignment history, allowing it to better capture the temporal dependencies and align the generated output with the relevant input. By considering the past alignments, the attention mechanism can focus on different parts of the input sequence, taking into account the previous alignments and their relationship to the current step.
This approach helps address some of the limitations of standard attention, such as not being able to effectively handle long input sequences or properly align outputs in certain scenarios. Location-sensitive attention can improve the quality of synthesized speech by providing more accurate alignment and capturing long-range dependencies between the input and output sequences.
The location-based attention is an extension of the [[content-based attention]] mechanism by including the previous alignments ($\alpha_{i-1}$) in the computation of the next attention state.
To make the previous model location-aware it extracts $k$ vectors $f_{ij} \in \mathbb{R}^{k}$ for every position $j$ of the previous alignment $\alpha_{i-1}$ by convolving it with a matrix $F \in \mathbb{R}^{k \times r}$:
$
f_{i,j} = F \ast \alpha_{(i-1)j}
$
The additional vector are then be used by the scoring mechanism in the feedforward neural network that implements the alignment model:
$
e_{ij} = w^{T} \tanh(W s_{i} + V h_{j} + U f_{ij} + b)
$
The context vector is a weighted sum continues to be a weighted sum of the hidden states ($h_{j}$) (_annotations_):
$
c_{i} = \sum_{j=1}^{T_{x}} \alpha_{ij} h_{j}
$
The weight $\alpha_{ij}$ of each annotation $h_{j}$ is computed by
$
\alpha_{ij} = \text{softmax}(e_{ij})
$
## Reference
J.K.Chorowski, D.Bahdanau, D.Serdyuk, K.Cho, and Y.Bengio, “Attention-based models for speech recognition,” in Proc. NIPS, 2015, pp. 577–585 [PDF](https://arxiv.org/pdf/1506.07503)