Remote Sensing Image-Text Retrieval With Implicit-Explicit Relation Reasoning
Remote sensing image-text retrieval (RSITR) has become a research hotspot in recent years for its wide application. Existing methods in this context, based either on local or global feature matching, overlook the sensing variation-leaded visual deviation and geographically nearby image-text mismatch...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on geoscience and remote sensing 2024, Vol.62, p.1-11 |
---|---|
Hauptverfasser: | , , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Remote sensing image-text retrieval (RSITR) has become a research hotspot in recent years for its wide application. Existing methods in this context, based either on local or global feature matching, overlook the sensing variation-leaded visual deviation and geographically nearby image-text mismatching problems of remote sensing (RS) images. This work notes that this would limit the retrieval accuracy for RSITR. To handle this, we present IERR, an implicit-explicit relation reasoning framework that learns relations between local visual-textual tokens and enhances global image-text matching without requiring additional prior supervision. Specifically, masked image modeling (MIM) and masked language modeling (MLM) are used for symmetric mask reasoning consistency alignment. Meanwhile, masked features (i.e., implicit relation) and unmasked features (i.e., explicit relation) are fed into a multimodal interaction encoder to enhance the representations of the textual-visual features. Extensive experimental results on the RSICD and RSITMD datasets demonstrate the superiority of IERR compared with 17 baselines. |
---|---|
ISSN: | 0196-2892 1558-0644 |
DOI: | 10.1109/TGRS.2024.3466909 |