Image captioning via proximal policy optimization

Image captioning is the task of generating captions of images in natural language. The training typically consists of two phases, first minimizing the XE (cross-entropy) loss, and then with RL (reinforcement learning) over CIDEr scores. Although there are many innovations in neural architectures, fe...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Image and vision computing 2021-04, Vol.108, p.104126, Article 104126
Hauptverfasser:	Zhang, Le, Zhang, Yanshuo, Zhao, Xin, Zou, Zexiao
Format:	Artikel
Sprache:	eng
Schlagworte:	Image captioning Proximal policy optimization Reinforcement learning
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Image captioning is the task of generating captions of images in natural language. The training typically consists of two phases, first minimizing the XE (cross-entropy) loss, and then with RL (reinforcement learning) over CIDEr scores. Although there are many innovations in neural architectures, fewer works are proposed for the RL phase. Motivated by one recent state-of-the-art architecture X-Transformer [Pan et al., CVPR 2020], we apply PPO (Proximal Policy Optimization) to it to establish a further improvement. However, trivially applying a vanilla policy gradient objective function with the clipping form of PPO would not improve the result. Therefore, we introduce certain modifications. We show that PPO is capable of enforcing trust-region constraints effectively. Also, experimentally performance decreases when PPO is combined with the regularization technique dropout. We analyze the possible reason in terms of KL-divergence of RL policies. As to the baseline adopted in the policy gradient estimator of RL, it is generally sentence-level. So all words in the same sentence use the same baseline in the gradient estimator. We instead use a word-level baseline via Monte-Carlo estimation. Thus, different words can have different baseline values. With all these, by fine-tuning a pre-trained X-Transformer, we train a single model achieving a competitive result of 133.3% on the MSCOCO Karpathy test set. Source code is available at https://github.com/lezhang-thu/xtransformer-ppo. •Proximal policy optimization is capable of enforcing trust-region constraints.•Performance decreases when combining dropout with proximal policy optimization.•Word-level, rather than sentence-level, baselines are preferred in CIDEr optimization.
ISSN:	0262-8856 1872-8138
DOI:	10.1016/j.imavis.2021.104126