Effective Multimodal Encoding for Image Paragraph Captioning

In this paper, we present a regularization-based image paragraph generation method. We propose a novel multimodal encoding generator (MEG) to generate effective multimodal encoding that captures not only an individual sentence but also visual and paragraph-sequential information. By utilizing the en...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on image processing 2022, Vol.31, p.1-1
Hauptverfasser: Nguyen, Thanh-Son, Fernando, Basura
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:In this paper, we present a regularization-based image paragraph generation method. We propose a novel multimodal encoding generator (MEG) to generate effective multimodal encoding that captures not only an individual sentence but also visual and paragraph-sequential information. By utilizing the encoding generated by MEG, we regularize a paragraph generation model that allows us to improve the results of the captioning model in all the evaluation metrics. With the support of the proposed MEG model for regularization, our paragraph generation model obtains state-of-the-art results on the Stanford paragraph dataset once further optimized with reinforcement learning. Moreover, we perform extensive empirical analysis on the capabilities of MEG encoding. A qualitative visualization based on t-distributed stochastic neighbor embedding (t-SNE) illustrates that sentence encoding generated by MEG captures some level of semantic information. We also demonstrate that the MEG encoding captures meaningful textual and visual information by performing multimodal sentence retrieval tasks and image instance retrieval given a paragraph query.
ISSN:1057-7149
1941-0042
DOI:10.1109/TIP.2022.3211467