![[grave-icml-2006.png]]
The idea of using neural networks for end-to-end speech recognition was first proposed in the late 1980s and early 1990s. Researchers such as Alex Waibel, Li Deng, and Geoffrey Hinton were among the first to explore the use of neural networks for speech recognition, using a variety of architectures such as recurrent neural networks (RNNs) and connectionist temporal classification (CTC).
In the early 2000s, researchers at Microsoft Research, including Alex Acero and Xuedong Huang, began to develop deep neural network (DNN) models for speech recognition. These models used a hierarchical architecture to learn features at different levels of abstraction and showed significant improvements over previous methods.
The idea of using ANNs for end-to-end speech recognition was first proposed in a paper by Alex Graves and colleagues in 2006 ([Graves, ICML 2006](https://www.cs.toronto.edu/~graves/icml_2006.pdf)). In the years since, researchers have continued to refine and improve these models, with many achieving state-of-the-art results on benchmark speech recognition tasks.
In 2014, a breakthrough in end-to-end speech machine translation was achieved by researchers at Google. They proposed a model called the "sequence-to-sequence" model, which used a combination of an RNN-based encoder and decoder to directly map the input speech signal to the corresponding transcription. This approach showed significant improvements over previous methods and has since become a widely used approach for end-to-end speech recognition.
In 2015 the attention-based model was first applied in ASR ([Chorowski et al. 2015](https://arxiv.org/abs/1506.07503))
Since then, researchers have continued to refine and improve end-to-end speech recognition models, using various architectures such as convolutional neural networks (CNNs), transformer networks, and attention mechanisms. Today, end-to-end models are considered state-of-the-art in speech recognition