VCR: Visual Caption Restoration
We introduce Visual Caption Restoration (VCR), a novel vision-language task that challenges models to accurately restore partially obscured texts using pixel-level hints within images. This task stems from the observation that text embedded in images is intrinsically different from common visual ele...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We introduce Visual Caption Restoration (VCR), a novel vision-language task
that challenges models to accurately restore partially obscured texts using
pixel-level hints within images. This task stems from the observation that text
embedded in images is intrinsically different from common visual elements and
natural language due to the need to align the modalities of vision, text, and
text embedded in images. While numerous works have integrated text embedded in
images into visual question-answering tasks, approaches to these tasks
generally rely on optical character recognition or masked language modeling,
thus reducing the task to mainly text-based processing. However, text-based
processing becomes ineffective in VCR as accurate text restoration depends on
the combined information from provided images, context, and subtle cues from
the tiny exposed areas of masked texts. We develop a pipeline to generate
synthetic images for the VCR task using image-caption pairs, with adjustable
caption visibility to control the task difficulty. With this pipeline, we
construct a dataset for VCR called VCR-Wiki using images with captions from
Wikipedia, comprising 2.11M English and 346K Chinese entities in both easy and
hard split variants. Our results reveal that current vision language models
significantly lag behind human performance in the VCR task, and merely
fine-tuning the models on our dataset does not lead to notable improvements. We
release VCR-Wiki and the data construction code to facilitate future research. |
---|---|
DOI: | 10.48550/arxiv.2406.06462 |