Enhancing Multimodal Large Language Models with Multi-instance Visual Prompt Generator for Visual Representation Enrichment
Multimodal Large Language Models (MLLMs) have achieved SOTA performance in various visual language tasks by fusing the visual representations with LLMs leveraging some visual adapters. In this paper, we first establish that adapters using query-based Transformers such as Q-former is a simplified Mul...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Multimodal Large Language Models (MLLMs) have achieved SOTA performance in
various visual language tasks by fusing the visual representations with LLMs
leveraging some visual adapters. In this paper, we first establish that
adapters using query-based Transformers such as Q-former is a simplified
Multi-instance Learning method without considering instance
heterogeneity/correlation. We then propose a general component termed
Multi-instance Visual Prompt Generator (MIVPG) to incorporate enriched visual
representations into LLMs by taking advantage of instance correlation between
images or patches for the same sample. Quantatitive evaluation on three public
vision-language (VL) datasets from different scenarios shows that the proposed
MIVPG improves Q-former in main VL tasks. |
---|---|
DOI: | 10.48550/arxiv.2406.02987 |