MGIMM: Multi-Granularity Instruction Multimodal Model for Attribute-Guided Remote Sensing Image Detailed Description

Generating detailed textual descriptions of remote sensing images is challenging because it requires capturing both global and local visual information. The complexity of backgrounds and the scale variations among targets make it difficult to align visual regions with corresponding textual attribute...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on geoscience and remote sensing 2024, Vol.62, p.1-13
Hauptverfasser: Yang, Cong, Li, Zuchao, Zhang, Lefei
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Generating detailed textual descriptions of remote sensing images is challenging because it requires capturing both global and local visual information. The complexity of backgrounds and the scale variations among targets make it difficult to align visual regions with corresponding textual attributes. Furthermore, large multimodal models, while effective in general scenarios, struggle in remote sensing due to their lack of specialized knowledge and regional awareness. To address these issues, this article proposes an attribute-guided multi-granularity instruction multimodal model (MGIMM) for remote sensing image detailed description. MGIMM guides the multimodal model to learn the consistency between visual regions and corresponding text attributes (such as object names, colors, and shapes) through region-level instruction tuning. Then, with the multimodal model aligned on region attribute, guided by multigrain visual features, MGIMM fully perceives both region-level and global image information, utilizing large language models for comprehensive descriptions of remote sensing images. Due to the lack of a standard benchmark for generating detailed descriptions of remote sensing images, we construct a dataset featuring 38320 region-attribute pairs and 23463 image-detailed description pairs. Compared with various advanced methods on this dataset, the results demonstrate the effectiveness of MGIMM's region-attribute-guided learning approach. The code is available at https://github.com/yangcong356/MGIMM.git .
ISSN:0196-2892
1558-0644
DOI:10.1109/TGRS.2024.3497976