Geometry Attention Transformer with position-aware LSTMs for image captioning

In recent years, Transformer structures have been widely applied in image captioning with impressive performance. However, previous works often neglect the geometry and position relations of different visual objects. These relations are often thought of as crucial information for good captioning res...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Expert systems with applications 2022-09, Vol.201, p.117174, Article 117174
Hauptverfasser:	Wang, Chi, Shen, Yulin, Ji, Luping
Format:	Artikel
Sprache:	eng
Schlagworte:	Ablation Coders Gate-controlled geometry attention Geometry Image captioning Optimization Performance enhancement Position-aware LSTM Representations Spatial data Transformer framework Transformers
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	In recent years, Transformer structures have been widely applied in image captioning with impressive performance. However, previous works often neglect the geometry and position relations of different visual objects. These relations are often thought of as crucial information for good captioning results. Aiming to further promote the image captioning by Transformers, this paper proposes an improved Geometry Attention Transformer (GAT) framework. In order to obtain geometric representation ability, two novel geometry-aware architectures are designed respectively for the encoder and decoder in our GAT by i) a geometry gate-controlled self-attention refiner, and ii) a group of position-LSTMs. The first one explicitly incorporates relative spatial information into the image representations in encoding steps, and the second one precisely informs the decoder of relative word positions for generating caption texts. The image representations and spatial information are extracted by a pretrained Faster-RCNN network. Our ablation study has proved that these two designed optimization modules could efficiently improve the performance of image captioning. The experiment comparisons on the datasets MS COCO and Flickr30K, also show that our GAT could often outperform current state-of-the-art image captioning models. •An improved image captioning model, GAT is proposed on transformer framework.•We design an encoder cooperated by a gate-controlled GSR.•We reconstruct a decoder promoted by position-LSTM groups.•Ablation experiments and comparisons are performed on COCO and Flickr30K.
ISSN:	0957-4174 1873-6793
DOI:	10.1016/j.eswa.2022.117174