View-Based Knowledge-Augmented Multimodal Semantic Understanding for Optical Remote Sensing Images
Optical remote sensing (RS) images serve as a pivotal source of geographic information. Due to the continuous development of deep learning technology, the evolving demands for multisource optical RS of the public shifted from recognition and acquisition of explicit features to comprehension and appl...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on geoscience and remote sensing 2025, Vol.63, p.1-33 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Optical remote sensing (RS) images serve as a pivotal source of geographic information. Due to the continuous development of deep learning technology, the evolving demands for multisource optical RS of the public shifted from recognition and acquisition of explicit features to comprehension and application of the fine-grained semantics and relationships implied in images. To address this challenge, we propose a semantic-augmented approach integrated multiview knowledge graph for a comprehensive understanding of optical RS images (RSMVKF). The RSMVKF delves into the structured representations of external knowledge from different human-like cognitive views and further explores the discovery ability of high-level features on the basis of multiple modalities and granularities. Specifically, the RSMVKF consists of two stages. First, we guide a large language model (LLM) to condense relevant knowledge from lengthy external knowledge passages and generate a view-level knowledge graph (RS-VKG). Then, an asymmetric multimodal contrastive network model (RS-M2CL) is designed to investigate efficient semantic augmentation. In this way, two types of contrastive loss functions, cross-modal and cross-granularity, are adopted to improve the understanding of implicit knowledge. The experimental results demonstrate that the RSMVKF greatly improves several perception tasks and reasoning tasks with rich features in optical RS imagery. In particular, in perception tasks such as fine-grained object detection and k-nearest neighbor (KNN) retrieval, the RSMVKF yields enhancements of 6.7% and 8.1%, respectively. In addition, in knowledge-driven reasoning tasks such as RS image captioning (RSCP), RS visual grounding (RSVG), and RS visual question answering (RSVQA), the RSMVKF demonstrates superior performance with margins of 8.9%, 5.3%, and 11.4%, respectively. |
---|---|
ISSN: | 0196-2892 1558-0644 |
DOI: | 10.1109/TGRS.2025.3532349 |