Transformers have been extensively been used for Natural Language Processing tasks, they have not yet been so much of tried and tested method in Computer Vision.

Transformers work on an attention based architecture or what is known as a sequence-to-sequence architecture. To understand, sequence-to-sequence (Seq2seq) architecture, consider a NLP task, of translation where these models are widely used.

In a typical translation task in NLP, if we a look at the task at a macro level, we take a sequence of words from one language and we transform it into a sequence of words in another language, for these tasks the sequence in which the words are present is important, this kind of data is called as Sequence-Dependent Data, for this kind of data we generally prefer using Long-Short-Term Memory (LSTM) based models. With the sequence-dependent nature of this kind of data, the order of words is important for understanding the sentences, LSTM work well with these kinds of data.

Seq2Seq models consist of an Encoder and Decoder. The encoder takes an input sequence and maps it into a higher dimensional space ( n-dimensional vector) . This abstract vector is fed into the Decoder which turns it into an output sequence. This output sequence can be in another language, symbols, a copy of the input or anything else.

These Encoder and Decoder can be imagined as human translators who can speak only two languages. Their first language is their mother tongue, which differs between both of them and the other language is the one they have in common. For example, to convert German to Spanish, the encoder would convert German into the only other language with the Decoder can understand other than Spanish, for example let it be English, The encoder converts German to English, English which the decoder understands is used by the decoder to convert into spanish. Acting together, Encode and Decoder can form a model to translate German to Spanish.

In case of encoder and decoder, we train them to become fluent in the common language (i.e., english in the above example) so that they can translate into and from their native language.

A very basic choice for training these Encoder and Decoder networks using Seq2Seq model is a single LSTM for each of them.

Another key aspect, before diving into transformers we need to understand the concept of Attention.

The attention mechanism looks at an input sequence and decides at each step which other parts of the sequence are important. This might sound trivial but, even when we read a sentence, we tend to find certain words in the sentence which are important to convey the message the sentence was written for.

This is similar to how the attention-mechanism works in a sequence, in the human encoder and decoder example, adding the attention mechanism would mean that when converting from german to english, the transalator at the encoder also writes down keywords that are important to the semantics of the sentence which is in turn passed on to the translator at the decoder’s end in addition to the regular translation, this would ensure that these keywords given by the encoder make the decoder’s job easier as it can identify which parts of the sentence are important and which keywords are crucial to convey the message.

Transformers

Now we can talk about Transformers, in the paper “Attention is all you need” the authors introduce the architecture of a Transformer.

Most competitive neural sequence transduction models have an encoder-decoder structure, wherein the encode maps an input sequence of symbol representation (x_1, \dotsb, x_n) to a sequence of continuous representations z = (z_1, \dotsb, z_n) . Given z, the decoder then generates an output sequence y = (y_1, \dotsb, y_m) of symbols one element at a time. At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next.

Transformers follow this architecture using stacked self-attention and point-wise, fully connected layers for both encoder and decoder.

Transformer is an architecture for transforming one sequence into another one with the help of two parts: An Encoder and a Decoder. Different from the previously described and existing sequence-to-sequence models because it does not imply on any Recurrent Networks like GRUs. LSTMs etc.

The Architecture in depth

Source: Attention is all you need, Vaswani et al

The left part in the above architecture is the encoder and the one on the right is the decoder.

If we consider the architecture as described by the authors of “Attention is all you need”.

Encoder

The encoder is a stack of N = 6 identical layers, each layer consisting of two sub-layers. The first of these is a multi-head self-attention mechanism, and the second is a simple, position-wise fully-connected feed-forward network. A residual connection is used around each of the two sub-layers, followed by layer normalization, in which we take the output of a each sub-layer is LayerNorm(x+Sublayer(x)) , where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate the use of residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension d_{model} = 512 .

Decoder

The decoder is also composed of N = 6 identical layers, in addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs the multi-head attention over the output of the encoder stack. Similar to the encoder, the authors employ residual connections in the decoder as well around each of the sub-layers, followed by layer normalization. The self-attention sub-layer is also modified in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with the fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the know n outputs at positions less than i.

Attention

As described above, attention functions can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values and output are all vectors. The output is computed as a weightedsum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. The authors call their particular attention as “Scaled Dot-Product Attention” .

 

Scaled-Dot-Product Attention

Source: Attention is all you need, Vaswani et al

In Scaled-Dot-Product Attention, the input consists of queries and keys of dimension d_k and values of dimension d_v . We compute the dot products of the query with all keys, divide each by \sqrt{d_k} .

In practice, the attention function is computed on a set of queries simultaneously, packed together into a matrix Q . The keys and values are also packed together into matrices K and V . The matrix of outputs is computed as:

Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V

 Most common attention functions used are the additive attention and the multiplicative or the dot-product attention.

The only difference betwen the Dot Product Attention and the Scaled Dot Product Attention used in our model is the scaling factor of \frac{1}{\sqrt{d_k}} . The advantage of using the Dot product attention over the additive attention is that it is much fasterand more space-efficient in practice, as it can be implemented using a highly optimized matrix multiplication code, the additive attention on the other hand computes the compatibility function using a feed forward network with a single hidden layer.

Multi-Head Attention 

Source: Attention is all you need, Vaswani et al.

The authors of the paper found out that it was better to linearly project queries, keys and values h times with different, learned linear projections to d_k, d_k and  d_v dimensions respectively instead of a single attention function d_{model}-dimensional keys, values and queries.

During each of these projections of queries, keys and values the attention function is performed in parallel, yielding a d_v -dimensional output values. These are concatenated and once again projected, resulting in the final linear values.

Using a multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this,

MultiHead(Q, K, V) = Concat(head_1,\dotsb, head_h)W^0

 where, head_i = Attention (QW^Q_i,  KW^K_i,  VW^V_i )