How can transformers be used for tasks other than natural language processing, such as computer vision?

Question

How can transformers be used for tasks other than natural language processing, such as computer vision?

1 Answer

rahuljain1 · Answer 1 · 2023-07-21T22:56:08+0000

Transformers are a type of deep learning architecture, based primarily upon the self-attention module, that were originally proposed for sequence-to-sequence tasks (e.g., translating a sentence from one language to another). Recent deep learning research has achieved impressive results by adapting this architecture to computer vision tasks, such as image classification. Transformers applied in this domain are typically referred to (not surprisingly) as vision transformers.

Wait … how can a language translation model be used for image classification? Good question. Although this post will deeply explore this topic, the basic idea is to:

Convert an image into a sequence of flattened image patches

Pass the image patch sequence through the transformer model

Take the first element of the transformer’s output sequence and pass it through a classification module

Compared to widely-used convolutional neural network (CNN) models, vision transformers lack useful inductive biases (e.g., translation invariance and locality). Nonetheless, these models are found to perform quite well relative to popular CNN variants on image classification tasks, and recent advances have made their efficiency — both in terms of the amount of data and computation required — more reasonable. As such, vision transformers are now a viable and useful tool for deep learning practitioners.

How can transformers be used for tasks other than natural language processing, such as computer vision?

Please log in or register to answer this question.

1 Answer