The Transformer model has emerged as a highly effective neural network architecture for natural language processing tasks. Introduced in the seminal paper “Attention Is All You Need,” the Transformer has since become the state-of-the-art model for a range of NLP applications. In addition, it has also been adapted for use in computer vision tasks, demonstrating its versatility and potential in various domains. In this post, we will dive into the details of the Transformer model, including its underlying principles and applications in both language processing and computer vision.
Recap
The Transformer Model
The Transformer model is a neural network architecture about which I have already talked in my last article but still let’s have a quick recap,
Transformers rely solely on self-attention to capture dependencies between different tokens in an input sequence. Unlike traditional recurrent or convolutional neural networks, the Transformer can process input sequences in parallel, making it highly efficient for processing long sequences. The self-attention mechanism allows the Transformer to selectively attend to different parts of the input sequence, creating a context for each token based on the entire input sequence.
The Transformer Architecture
Transformers consists of two main components: the encoder and the decoder. The encoder takes an input sequence and produces a hidden representation for each token in the sequence, while the decoder takes the hidden representation from the encoder and generates a new sequence based on it. Both the encoder and decoder consist of multiple layers of the same basic building block, called the Transformer block, which contains multiple self-attention and feedforward layers.
Transformer Block:
The Transformer block is the basic building block of the Transformer model, consisting of two sub-layers: the self-attention layer and the feedforward layer. The self-attention layer takes in a sequence of vectors and produces a sequence of the same length, where each vector is a weighted sum of all the input vectors, with the weights determined by their similarity to the current vector. The feedforward layer consists of two linear transformations with a ReLU activation function in between.
Encoder:
The encoder takes an input sequence and passes it through a stack of N identical Transformer blocks, producing a sequence of hidden representations. The decoder takes the output of the encoder and generates a sequence of output tokens by passing it through a similar stack of N Transformer blocks, along with an additional masked self-attention layer to prevent the decoder from looking ahead.
Image worth 16x16 words?
Have you ever tried describing an image to someone? It can be challenging to capture the details, the colors, and the overall composition in words. This is why images are such a powerful medium for communication. However, understanding and processing images is still a difficult task for machines. Traditional approaches to image analysis relied heavily on hand-crafted features and domain-specific knowledge. But with the advent of deep learning, image analysis has seen a revolution.
One of the latest breakthroughs in this field is the paper “An Image is Worth 16×16 Words” by Dosovitskiy et al. In this paper, the authors propose a new way of processing images that is inspired by the success of transformer models in natural language processing.
In this blog post, we will dive into the details of the paper and explore how it works.
Introduction
The authors of the paper start by pointing out that the standard approach to image analysis involves converting the image into a fixed-length feature vector using convolutional neural networks (CNNs) and then using this feature vector for downstream tasks such as classification or object detection. However, this approach has several limitations. Firstly, the feature vector does not capture the spatial relationships between different parts of the image. Secondly, it is difficult to vary the level of detail in the feature vector. Finally, this approach requires significant domain-specific knowledge and is not easily transferable to new domains.
The authors propose a new approach to image analysis that is based on the transformer model, which has been successful in natural language processing tasks. In the transformer model, self-attention is used to capture the relationships between different words in a sentence. The authors propose to adapt this approach to images by dividing the image into a grid of patches and applying self-attention to the patches. In other words, the image is treated as a sequence of patches, and self-attention is used to capture the relationships between the patches.

Method
The authors propose a new architecture called the Vision Transformer (ViT), which consists of two main components: the patch embedding and the transformer encoder.
Patch Embedding
The first step in the ViT architecture is to divide the input image into a grid of non-overlapping patches. Each patch is then linearly embedded into a fixed-size vector using a learnable linear projection. The resulting embeddings are concatenated to form the input sequence.
The second step in the ViT architecture is to apply a transformer encoder to the input sequence of patch embeddings. The transformer encoder consists of a stack of transformer blocks, where each block contains a multi-head self-attention mechanism and a feedforward neural network. The output of the transformer encoder is a sequence of feature vectors, where each vector corresponds to a patch in the input image.
The transformer encoder is trained in a supervised manner using a standard cross-entropy loss for classification tasks. During training, the model learns to predict the correct class label for a given input image.
Encoder in ViT

Let’s talk a bit more about the encoder for Vision Transformers,
The encoder module in ViT is responsible for extracting features from the input image. It consists of a series of blocks, each of which contains a multi-head self-attention layer followed by a feed-forward layer. The self-attention layer is used to compute the attention weights between all pairs of image patches, allowing the model to identify important features in the image. The feed-forward layer then processes the output of the self-attention layer to produce a new set of features.
Each block in the encoder module is followed by a normalization layer and a residual connection, which help to mitigate the vanishing gradient problem during training. The residual connection allows gradients to flow back more easily through the network, which can speed up training and improve the quality of the learned features.
The ViT encoder consists of a series of transformer blocks, where each block has a self-attention layer and a feedforward neural network layer. The self-attention layer learns to attend to different parts of the input image based on the importance of each part, while the feedforward neural network layer applies non-linear transformations to the self-attention outputs. Let’s take a closer look at these two layers:
Self-Attention Layer:
The self-attention layer in ViT is very similar to the self-attention layer in the original Transformer model. It takes in a set of input embeddings (in the form of patches), and produces a set of output embeddings, where each output embedding attends to different parts of the input based on their importance.
The self-attention operation involves computing three linear projections of the input embeddings, followed by a dot product operation between the first two projections (query and key) to obtain a similarity score matrix. This similarity score matrix is then normalized using a softmax function, and is used to compute a weighted sum of the third projection (value) to obtain the final output embeddings. Mathematically, this can be represented as follows:
Self-Attention(X) = softmax(\frac{QXW^T_qKW_k}{\sqrt{d_k}})VW_k
Where X is the input embedding matrix, Q, K and V are learned parameter matrices for the query, ket and value projections respectively. W_q, W_k, W_v are learned parameter matrices for scaling the query, key and value projections, respectively and d_k is the dimensionality of the key vectors.
Feedforward Neural Network Layer:
The feedforward neural network layer in ViT is a two-layer MLP that applies non-linear transformations to the self-attention output embeddings. Specifically, it applies a linear transformation followed by a GELU activation function, followed by another linear transformation. This can be represented mathematically as follows:
MLP(X) = Layer_2(GELU(Layer_1(X)))where X is the input embedding matrix, Layer_1 and Layer_2 are linear transformations, and GELU is the GELU activation function.
Together, the self-attention layer and feedforward neural network layer form a transformer block in the ViT encoder. By stacking multiple transformer blocks on top of each other, the ViT encoder is able to learn hierarchical representations of the input image, which can be used for downstream tasks such as image classification.
Decoder:
After the input image has been processed by the encoder module, the resulting features are passed to the decoder module. The decoder module is responsible for transforming these features into the final output, such as an image classification or segmentation map. The decoder module consists of a series of fully connected layers, followed by a softmax activation function that produces the final output probabilities.
In the ViT decoder, positional embeddings are added to the input embeddings to encode their spatial information. Unlike the encoder, which receives input as a flattened sequence of patches, the decoder needs to process the image in a spatially coherent manner. To achieve this, the positional embeddings in the decoder are added to the input embeddings in a way that preserves their spatial structure.
The positional embeddings in the decoder are learned and represented as a 2D grid of vectors, one vector per position, with each vector having the same dimension as the input embeddings. The positional embeddings are then added to the input embeddings in the decoder, similar to the encoder. However, the positional embeddings in the decoder are added only to the query vectors and the key/value vectors in the self-attention module. This is done to enable the decoder to attend to different parts of the input image while still preserving the spatial structure.
In the ViT decoder, the self-attention mechanism is used to generate the queries, keys, and values from the input embeddings with positional embeddings added. The self-attention is calculated for all positions in the input embeddings, including the positional embeddings, allowing the decoder to attend to the spatial information in the input image.
After the self-attention, the output is fed through a feed-forward network, similar to the encoder. The output of the feed-forward network is then passed through another layer normalization, dropout, and residual connection before being passed to the next layer in the decoder.
Overall, the use of positional embeddings in the decoder allows ViT to attend to the spatial structure of the input image while still leveraging the benefits of the transformer architecture. This allows for effective image processing and has led to state-of-the-art performance in many computer vision tasks.
Results
The authors evaluate the ViT model on several standard image classification benchmarks, including ImageNet, CIFAR-100, and VTAB. They compare the performance of the ViT model to other state-of-the-art models such as ResNet and EfficientNet. The results show that the ViT model outperforms these models on all benchmarks, despite using significantly fewer parameters.
The authors also conduct experiments to investigate the importance of self-attention in the ViT model. They compare the performance of the ViT model with and without self-attention and find that self-attention is crucial for achieving high performance.
Conclusion
In conclusion, the Vision Transformer (ViT) model has revolutionized the field of computer vision by effectively using transformers, a popular architecture in natural language processing, to analyze images. The key innovation in ViT is the use of patches as inputs, which allows for efficient processing of large images.
ViT’s encoder-decoder architecture with self-attention mechanisms enables the model to effectively capture spatial relationships and dependencies between patches. The decoder component of ViT uses positional embeddings to allow the model to decode the output into a meaningful image.
The performance of ViT has been demonstrated to be superior to traditional convolutional neural networks on various image classification tasks. However, there is still much to be explored in terms of the potential applications and improvements to the ViT model.
Overall, the success of ViT has opened up new avenues for research and development in computer vision, and has shown the power of using transformer-based models beyond natural language processing. As the field of computer vision continues to evolve, it will be exciting to see how ViT and other transformer-based models shape the future of image analysis.
Summary
- Conventional convolutional neural networks (CNNs) for image recognition are computationally expensive and require large amounts of training data.
- The authors propose a new approach to image recognition using transformers, which have been successful in natural language processing.
- The approach, called Vision Transformer (ViT), uses a transformer encoder on image patches followed by a transformer decoder to classify the image.
- The image is divided into non-overlapping patches and flattened into a sequence, which is then fed into the encoder.
- The encoder consists of multiple transformer blocks, each consisting of multi-head self-attention and feed-forward layers.
- The decoder consists of a classification token, which is concatenated to the output of the encoder, and a transformer decoder that predicts the class probabilities.
- Positional embeddings are added to the input sequence to provide spatial information to the model.
- ViT achieves state-of-the-art performance on several image recognition benchmarks, including ImageNet and COCO.
- The authors also propose a variant of ViT called DeiT that achieves even better performance by distilling knowledge from a teacher network.
- ViT is computationally efficient and can be trained with much smaller datasets than traditional CNNs.
- The success of ViT suggests that transformers can be a powerful tool for image recognition tasks.