InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language
We present an interactive visual framework named InternGPT, or iGPT for short. The framework integrates chatbots that have planning and reasoning capabilities, such as ChatGPT, with non-verbal instructions like pointing movements that enable users to directly manipulate images or videos on the scree...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We present an interactive visual framework named InternGPT, or iGPT for
short. The framework integrates chatbots that have planning and reasoning
capabilities, such as ChatGPT, with non-verbal instructions like pointing
movements that enable users to directly manipulate images or videos on the
screen. Pointing (including gestures, cursors, etc.) movements can provide more
flexibility and precision in performing vision-centric tasks that require
fine-grained control, editing, and generation of visual content. The name
InternGPT stands for \textbf{inter}action, \textbf{n}onverbal, and
\textbf{chat}bots. Different from existing interactive systems that rely on
pure language, by incorporating pointing instructions, the proposed iGPT
significantly improves the efficiency of communication between users and
chatbots, as well as the accuracy of chatbots in vision-centric tasks,
especially in complicated visual scenarios where the number of objects is
greater than 2. Additionally, in iGPT, an auxiliary control mechanism is used
to improve the control capability of LLM, and a large vision-language model
termed Husky is fine-tuned for high-quality multi-modal dialogue (impressing
ChatGPT-3.5-turbo with 93.89\% GPT-4 Quality). We hope this work can spark new
ideas and directions for future interactive visual systems. Welcome to watch
the code at https://github.com/OpenGVLab/InternGPT. |
---|---|
DOI: | 10.48550/arxiv.2305.05662 |