MageBench: Bridging Large Multimodal Models to Agents
LMMs have shown impressive visual understanding capabilities, with the potential to be applied in agents, which demand strong reasoning and planning abilities. Nevertheless, existing benchmarks mostly assess their reasoning abilities in language part, where the chain-of-thought is entirely composed...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | LMMs have shown impressive visual understanding capabilities, with the
potential to be applied in agents, which demand strong reasoning and planning
abilities. Nevertheless, existing benchmarks mostly assess their reasoning
abilities in language part, where the chain-of-thought is entirely composed of
text.We consider the scenario where visual signals are continuously updated and
required along the decision making process. Such vision-in-the-chain reasoning
paradigm is more aligned with the needs of multimodal agents, while being
rarely evaluated. In this paper, we introduce MageBench, a reasoning capability
oriented multimodal agent benchmark that, while having light-weight
environments, poses significant reasoning challenges and holds substantial
practical value. This benchmark currently includes three types of environments:
WebUI, Sokoban, and Football, comprising a total of 483 different scenarios. It
thoroughly validates the agent's knowledge and engineering capabilities, visual
intelligence, and interaction skills. The results show that only a few
product-level models are better than random acting, and all of them are far
inferior to human-level. More specifically, we found current models severely
lack the ability to modify their planning based on visual feedback, as well as
visual imagination, interleaved image-text long context handling, and other
abilities. We hope that our work will provide optimization directions for LMM
from the perspective of being an agent. We release our code and data at
https://github.com/microsoft/MageBench. |
---|---|
DOI: | 10.48550/arxiv.2412.04531 |