A Mask-Guided Transformer Network with Topic Token for Remote Sensing Image Captioning

Remote sensing image captioning aims to describe the content of images using natural language. In contrast with natural images, the scale, distribution, and number of objects generally vary in remote sensing images, making it hard to capture global semantic information and the relationships between...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Remote sensing (Basel, Switzerland) Switzerland), 2022-06, Vol.14 (12), p.2939
Hauptverfasser: Ren, Zihao, Gou, Shuiping, Guo, Zhang, Mao, Shasha, Li, Ruimin
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Remote sensing image captioning aims to describe the content of images using natural language. In contrast with natural images, the scale, distribution, and number of objects generally vary in remote sensing images, making it hard to capture global semantic information and the relationships between objects at different scales. In this paper, in order to improve the accuracy and diversity of captioning, a mask-guided Transformer network with a topic token is proposed. Multi-head attention is introduced to extract features and capture the relationships between objects. On this basis, a topic token is added into the encoder, which represents the scene topic and serves as a prior in the decoder to help us focus better on global semantic information. Moreover, a new Mask-Cross-Entropy strategy is designed in order to improve the diversity of the generated captions, which randomly replaces some input words with a special word (named [Mask]) in the training stage, with the aim of enhancing the model’s learning ability and forcing exploration of uncommon word relations. Experiments on three data sets show that the proposed method can generate captions with high accuracy and diversity, and the experimental results illustrate that the proposed method can outperform state-of-the-art models. Furthermore, the CIDEr score on the RSICD data set increased from 275.49 to 298.39.
ISSN:2072-4292
2072-4292
DOI:10.3390/rs14122939