Re-Caption: Saliency-Enhanced Image Captioning Through Two-Phase Learning

Visual saliency and semantic saliency are important in image captioning. However, a single-phase image captioning model benefits little from limited saliency information without a saliency predictor. In this paper, a novel saliency-enhanced re-captioning framework via two-phase learning is proposed...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on image processing 2020-01, Vol.29, p.694-709
Hauptverfasser:	Zhou, Lian, Zhang, Yuejie, Jiang, Yu-Gang, Zhang, Tao, Fan, Weiguo
Format:	Artikel
Sprache:	eng
Schlagworte:	Adaptation models Computational modeling Computer Science Computer Science, Artificial Intelligence Engineering Engineering, Electrical & Electronic Fans Image captioning Image enhancement Learning Predictive models robust estimation Salience saliency salient region detection Science & Technology Semantics Task analysis Technology two-phase learning visual attribute Visualization
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Visual saliency and semantic saliency are important in image captioning. However, a single-phase image captioning model benefits little from limited saliency information without a saliency predictor. In this paper, a novel saliency-enhanced re-captioning framework via two-phase learning is proposed to enhance single-phase image captioning. In the framework, both visual and semantic saliency cues are distilled from the first-phase model and fused with the second-phase model for model self-boosting. The visual saliency mechanism can generate a saliency map and a saliency mask for an image without learning a saliency predictor. The semantic saliency mechanism sheds some lights on the properties of those words with the part-of-speech Noun in a caption. Besides, another type of saliency, sample saliency is proposed to compute the saliency degree of each sample, which is helpful for more robust image captioning. In addition, how to combine the three types of saliency for further performance boost is also examined. Our framework can treat an image captioning model as a saliency extractor, which may benefit other captioning models and the related tasks. The experimental results on both the Flickr30k and MSCOCO datasets show that the saliency-enhanced models can obtain promising performance gains.
ISSN:	1057-7149 1941-0042
DOI:	10.1109/TIP.2019.2928144