Transformers are a type of deep learning architecture, based primarily upon the self-attention module, that were originally proposed for sequence-to-sequence tasks (e.g., translating a sentence from one language to another). Recent deep learning research has achieved impressive results by adapting this architecture to computer vision tasks, such as image classification. Transformers applied in this domain are typically referred to (not surprisingly) as vision transformers.
Wait … how can a language translation model be used for image classification? Good question. Although this post will deeply explore this topic, the basic idea is to:
Convert an image into a sequence of flattened image patches
Pass the image patch sequence through the transformer model
Take the first element of the transformer’s output sequence and pass it through a classification module
Compared to widely-used convolutional neural network (CNN) models, vision transformers lack useful inductive biases (e.g., translation invariance and locality). Nonetheless, these models are found to perform quite well relative to popular CNN variants on image classification tasks, and recent advances have made their efficiency — both in terms of the amount of data and computation required — more reasonable. As such, vision transformers are now a viable and useful tool for deep learning practitioners.