Towards Models that Can See and Read
Visual Question Answering (VQA) and Image Captioning (CAP), which are among the most popular vision-language tasks, have analogous scene-text versions that require reasoning from the text in the image. Despite their obvious resemblance, the two are treated independently and, as we show, yield task-s...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Visual Question Answering (VQA) and Image Captioning (CAP), which are among
the most popular vision-language tasks, have analogous scene-text versions that
require reasoning from the text in the image. Despite their obvious
resemblance, the two are treated independently and, as we show, yield
task-specific methods that can either see or read, but not both. In this work,
we conduct an in-depth analysis of this phenomenon and propose UniTNT, a
Unified Text-Non-Text approach, which grants existing multimodal architectures
scene-text understanding capabilities. Specifically, we treat scene-text
information as an additional modality, fusing it with any pretrained
encoder-decoder-based architecture via designated modules. Thorough experiments
reveal that UniTNT leads to the first single model that successfully handles
both task types. Moreover, we show that scene-text understanding capabilities
can boost vision-language models' performance on general VQA and CAP by up to
2.69% and 0.6 CIDEr, respectively. |
---|---|
DOI: | 10.48550/arxiv.2301.07389 |