MIBench: Evaluating Multimodal Large Language Models over Multiple Images
Built on the power of LLMs, numerous multimodal large language models (MLLMs) have recently achieved remarkable performance on various vision-language tasks. However, most existing MLLMs and benchmarks primarily focus on single-image input scenarios, leaving the performance of MLLMs when handling re...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Built on the power of LLMs, numerous multimodal large language models (MLLMs)
have recently achieved remarkable performance on various vision-language tasks.
However, most existing MLLMs and benchmarks primarily focus on single-image
input scenarios, leaving the performance of MLLMs when handling realistic
multiple images underexplored. Although a few benchmarks consider multiple
images, their evaluation dimensions and samples are very limited. In this
paper, we propose a new benchmark MIBench, to comprehensively evaluate
fine-grained abilities of MLLMs in multi-image scenarios. Specifically, MIBench
categorizes the multi-image abilities into three scenarios: multi-image
instruction (MII), multimodal knowledge-seeking (MKS) and multimodal in-context
learning (MIC), and constructs 13 tasks with a total of 13K annotated samples.
During data construction, for MII and MKS, we extract correct options from
manual annotations and create challenging distractors to obtain multiple-choice
questions. For MIC, to enable an in-depth evaluation, we set four sub-tasks and
transform the original datasets into in-context learning formats. We evaluate
several open-source and closed-source MLLMs on the proposed MIBench. The
results reveal that although current models excel in single-image tasks, they
exhibit significant shortcomings when faced with multi-image inputs, such as
limited fine-grained perception, multi-image reasoning and in-context learning
abilities. The annotated data of MIBench is available at
https://huggingface.co/datasets/StarBottle/MIBench. |
---|---|
DOI: | 10.48550/arxiv.2407.15272 |