VL-GLUE: A Suite of Fundamental yet Challenging Visuo-Linguistic Reasoning Tasks
Deriving inference from heterogeneous inputs (such as images, text, and audio) is an important skill for humans to perform day-to-day tasks. A similar ability is desirable for the development of advanced Artificial Intelligence (AI) systems. While state-of-the-art models are rapidly closing the gap...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Deriving inference from heterogeneous inputs (such as images, text, and
audio) is an important skill for humans to perform day-to-day tasks. A similar
ability is desirable for the development of advanced Artificial Intelligence
(AI) systems. While state-of-the-art models are rapidly closing the gap with
human-level performance on diverse computer vision and NLP tasks separately,
they struggle to solve tasks that require joint reasoning over visual and
textual modalities. Inspired by GLUE (Wang et. al., 2018)- a multitask
benchmark for natural language understanding, we propose VL-GLUE in this paper.
VL-GLUE consists of over 100k samples spanned across seven different tasks,
which at their core require visuo-linguistic reasoning. Moreover, our benchmark
comprises of diverse image types (from synthetically rendered figures, and
day-to-day scenes to charts and complex diagrams) and includes a broad variety
of domain-specific text (from cooking, politics, and sports to high-school
curricula), demonstrating the need for multi-modal understanding in the
real-world. We show that this benchmark is quite challenging for existing
large-scale vision-language models and encourage development of systems that
possess robust visuo-linguistic reasoning capabilities. |
---|---|
DOI: | 10.48550/arxiv.2410.13666 |