Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation
The increasing demand for intelligent systems capable of interpreting and reasoning about visual content requires the development of large Vision-and-Language Models (VLMs) that are not only accurate but also have explicit reasoning capabilities. This paper presents a novel approach to develop a VLM...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The increasing demand for intelligent systems capable of interpreting and
reasoning about visual content requires the development of large
Vision-and-Language Models (VLMs) that are not only accurate but also have
explicit reasoning capabilities. This paper presents a novel approach to
develop a VLM with the ability to conduct explicit reasoning based on visual
content and textual instructions. We introduce a system that can ask a question
to acquire necessary knowledge, thereby enhancing the robustness and
explicability of the reasoning process. To this end, we developed a novel
dataset generated by a Large Language Model (LLM), designed to promote
chain-of-thought reasoning combined with a question-asking mechanism. The
dataset covers a range of tasks, from common ones like caption generation to
specialized VQA tasks that require expert knowledge. Furthermore, using the
dataset we created, we fine-tuned an existing VLM. This training enabled the
models to generate questions and perform iterative reasoning during inference.
The results demonstrated a stride toward a more robust, accurate, and
interpretable VLM, capable of reasoning explicitly and seeking information
proactively when confronted with ambiguous visual input. |
---|---|
DOI: | 10.48550/arxiv.2401.10005 |