MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
We propose MM-REACT, a system paradigm that integrates ChatGPT with a pool of vision experts to achieve multimodal reasoning and action. In this paper, we define and explore a comprehensive list of advanced vision tasks that are intriguing to solve, but may exceed the capabilities of existing vision...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We propose MM-REACT, a system paradigm that integrates ChatGPT with a pool of
vision experts to achieve multimodal reasoning and action. In this paper, we
define and explore a comprehensive list of advanced vision tasks that are
intriguing to solve, but may exceed the capabilities of existing vision and
vision-language models. To achieve such advanced visual intelligence, MM-REACT
introduces a textual prompt design that can represent text descriptions,
textualized spatial coordinates, and aligned file names for dense visual
signals such as images and videos. MM-REACT's prompt design allows language
models to accept, associate, and process multimodal information, thereby
facilitating the synergetic combination of ChatGPT and various vision experts.
Zero-shot experiments demonstrate MM-REACT's effectiveness in addressing the
specified capabilities of interests and its wide application in different
scenarios that require advanced visual understanding. Furthermore, we discuss
and compare MM-REACT's system paradigm with an alternative approach that
extends language models for multimodal scenarios through joint finetuning.
Code, demo, video, and visualization are available at
https://multimodal-react.github.io/ |
---|---|
DOI: | 10.48550/arxiv.2303.11381 |