A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision
There has been a recent explosion of computer vision models which perform many tasks and are composed of an image encoder (usually a ViT) and an autoregressive decoder (usually a Transformer). However, most of this work simply presents one system and its results, leaving many questions regarding des...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | There has been a recent explosion of computer vision models which perform
many tasks and are composed of an image encoder (usually a ViT) and an
autoregressive decoder (usually a Transformer). However, most of this work
simply presents one system and its results, leaving many questions regarding
design decisions and trade-offs of such systems unanswered. In this work, we
aim to provide such answers. We take a close look at autoregressive decoders
for multi-task learning in multimodal computer vision, including
classification, captioning, visual question answering, and optical character
recognition. Through extensive systematic experiments, we study the effects of
task and data mixture, training and regularization hyperparameters,
conditioning type and specificity, modality combination, and more. Importantly,
we compare these to well-tuned single-task baselines to highlight the cost
incurred by multi-tasking. A key finding is that a small decoder learned on top
of a frozen pretrained encoder works surprisingly well. We call this setup
locked-image tuning with decoder (LiT-decoder). It can be seen as teaching a
decoder to interact with a pretrained vision model via natural language. |
---|---|
DOI: | 10.48550/arxiv.2303.17376 |