Cantor: Inspiring Multimodal Chain-of-Thought of MLLM
With the advent of large language models(LLMs) enhanced by the chain-of-thought(CoT) methodology, visual reasoning problem is usually decomposed into manageable sub-tasks and tackled sequentially with various external tools. However, such a paradigm faces the challenge of the potential "determi...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | With the advent of large language models(LLMs) enhanced by the
chain-of-thought(CoT) methodology, visual reasoning problem is usually
decomposed into manageable sub-tasks and tackled sequentially with various
external tools. However, such a paradigm faces the challenge of the potential
"determining hallucinations" in decision-making due to insufficient visual
information and the limitation of low-level perception tools that fail to
provide abstract summaries necessary for comprehensive reasoning. We argue that
converging visual context acquisition and logical reasoning is pivotal for
tackling visual reasoning tasks. This paper delves into the realm of multimodal
CoT to solve intricate visual reasoning tasks with multimodal large language
models(MLLMs) and their cognitive capability. To this end, we propose an
innovative multimodal CoT framework, termed Cantor, characterized by a
perception-decision architecture. Cantor first acts as a decision generator and
integrates visual inputs to analyze the image and problem, ensuring a closer
alignment with the actual context. Furthermore, Cantor leverages the advanced
cognitive functions of MLLMs to perform as multifaceted experts for deriving
higher-level information, enhancing the CoT generation process. Our extensive
experiments demonstrate the efficacy of the proposed framework, showing
significant improvements in multimodal CoT performance across two complex
visual reasoning datasets, without necessitating fine-tuning or ground-truth
rationales. Project Page: https://ggg0919.github.io/cantor/ . |
---|---|
DOI: | 10.48550/arxiv.2404.16033 |