Generating the captions for remote sensing images: A spatial-channel attention based memory-guided transformer approach

Remote sensing image captioning (RSIC) is cross-modal interaction task in an artificial intelligence that leads to automatic description of Earth’s geological properties captured from an aerial view. It is noted that, convolutional neural network (CNN) and recurrent neural network (RNN) based encode...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Engineering applications of artificial intelligence 2022-09, Vol.114, p.105076, Article 105076
Hauptverfasser: Gajbhiye, Gaurav O., Nandedkar, Abhijeet V.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Remote sensing image captioning (RSIC) is cross-modal interaction task in an artificial intelligence that leads to automatic description of Earth’s geological properties captured from an aerial view. It is noted that, convolutional neural network (CNN) and recurrent neural network (RNN) based encoder–decoder methods are widely adopted for RSIC, but has two main constrains: first, insufficient to capture inherent geographical characteristics due to single level static convolutional features; second, difficult to train regressive time-step sequences. To address these challenges, a novel fully-attentive framework entitled Spatial-Channel Attention based MEmory-guided Transformer (SCAMET) is proposed, which calibrates multilevel visual attentive features and aligns with linguistic information through persistent memory. Here, CNN is integrated with Transformer to generate captions for remote sensing image. To comprehend deeper semantic knowledge of multi-scale, multi-shape, multi-object in remote sensing image, multi-attentive visual features are extracted by employing spatial and channel attention separately. To decode multi-attentive feature into caption, this work proposes memory-guided Transformer as linguistic decoder. Specifically, learnable memory elements are incorporated in multi-head attention block, which perceives intrinsic association within visual multi-attentive features and reconciles with linguistic information. The ablation studies are conducted on three public RSIC datasets, Sydney-captions, UCM-captions and RSICD to evaluate performance of proposed method. The quantitative and qualitative analyses reveal that proposed method performs satisfactory compared to state-of-the-art approaches. This work also proposes a “Weighted Mean Score” index to evaluate conclusive performance of model across all datasets by leveraging global contribution of each test set. The implementation of proposed work is available at: https://github.com/GauravGajbhiye/SCAMET_RSIC. •Automatic caption generation for remote sensing image is critical task in remote sensing domain, since remote sensing images contains multi-spectral, multi-scale and multi-object visual contents with various translation.•To tackle this issue, novel fully-attentive CNN-Transformer approach is proposed. This approach integrates multi-attentive visual encoder and memory-guided Transformer based linguistic decoder.•The multi-attentive encoder provides distinct visual features by paying overall att
ISSN:0952-1976
1873-6769
DOI:10.1016/j.engappai.2022.105076