A transformer model is a neural network that learns context in sequential data. It is particularly useful for [[natural language processing]] since the meaning of a sentence can be extracted from the relationships between the words.
The transformer model was introduced in a 2017 [paper](https://arxiv.org/abs/1706.03762) from Google and was first applied for machine translation. These models were labeled "foundation models" in a 2021 paper by Stanford researchers because they see them as a paradigm shift in AI.
The first step is to convert the sequence of words in the sentence into a sequence of [[token]]s. Then we compute an [[input embedding]] for each token. The word embedding vector can have hundreds or thousands of values.
To keep track of the word order, transformers use a technique called [[positional encoding]]. This associates each word with a vector of values that depend on the position of the word in the sentence.
The [[positional encoding]] vector is added to the [[input embedding]] vector to produce a vector that characterizes the word and its location in the sentence.
To associate words that are somehow related, transformers use a mechanism called [[attention]].
In the case of the association of words in the input sequence, this mechanism is called [[self-attention]].
To compute the similarity between the words of the input sequence a set of [[query]] is computed by a linear combination of the vector with the sum of the word embedding and position encoder vectors. A set of [[key]] is computed in the same way but with different weights.
[[large language model]] < [[Hands-on LLMs]]/[[2 LLMs and Transformers]] > [[token]]