Injecting Linguistic Into Visual Backbone: Query-Aware Multimodal Fusion Network for Remote Sensing Visual Grounding

The remote sensing visual grounding (RSVG) task focuses on accurately identifying and localizing specific targets in remote sensing (RS) images using descriptive query expressions. Existing methods independently extract visual and textual features, ignoring early complementary information between im...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on geoscience and remote sensing 2024, Vol.62, p.1-14
Hauptverfasser: Li, Chongyang, Zhang, Wenkai, Bi, Hanbo, Li, Jihao, Li, Shuoke, Yu, Haichen, Sun, Xian, Wang, Hongqi
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The remote sensing visual grounding (RSVG) task focuses on accurately identifying and localizing specific targets in remote sensing (RS) images using descriptive query expressions. Existing methods independently extract visual and textual features, ignoring early complementary information between image and text. This leads to information loss and misalignment, limiting the model's ability to distinguish similar targets. To address this challenge, we propose the query-aware multimodal fusion network (QAMFN), which introduces an innovative query-guided visual attention (QGVA) mechanism in the early stages of the visual encoder. This mechanism integrates textual information during the early visual feature extraction process, thereby resolving the issue of missing image-text complementary information. QGVA ensures that the visual backbone accurately focuses on local features highly relevant to the query by injecting textual information into the visual encoding process. Additionally, to enhance the model's ability to integrate multimodal information and adapt to more complex RS images, we introduce the text-semantic attention-guided masking (TAM) module. TAM aggregates multimodal features processed by the backbones and filters out redundant information, producing high-quality fused features. Experiments demonstrate that our approach sets a new record on the DIOR-RSVG dataset, improving accuracy to 81.67% (an absolute increase of 4.98%).
ISSN:0196-2892
1558-0644
DOI:10.1109/TGRS.2024.3450303