An end-to-end model is a type of machine learning model that is designed to directly map the input to the output, without relying on multiple stages or components used in the [[traditional pipeline model]]. In other words, an end-to-end model takes in raw input data (such as speech) and produces a desired output (such as transcription) in a single step.
The term "end-to-end" refers to the fact that the model spans the entire process, from input to output, without the need for any intermediate steps or representations. This approach can be contrasted with traditional pipeline models, which typically involve several stages of processing, each of which performs a specific subtask (such as feature extraction, normalization, or classification).
One of the advantages of end-to-end models is that they can often be trained more efficiently than pipeline models, as they require fewer separate training steps. They can also be more flexible and adaptable to different types of input and output, as they do not rely on handcrafted features or intermediate representations.
A structured approach to implementing an end-to-end model is the use of [[sequence-to-sequence (seq2seq) model]]s involving an encoder and decoder network. However, end-to-end models are generally more flexible and can be used for a wider range of tasks, but may not be as effective at handling variable-length input-output sequences.
Some important milestones in the development of end-to-end models for speech applications are:
- [[e2e speech recognition (2006)]]
- [[neural machine translation (2014)]]
- [[neural speech synthesis (2016)]]
- [[zero-shot voice cloning (2019)]]
- [[fine-tunning of large language models (2022)]]