Hierarchical Photo-Scene Encoder for Album Storytelling
In this paper, we propose a novel model with a hierarchical photo-scene encoder and a reconstructor for the task of album storytelling. The photo-scene encoder contains two sub-encoders, namely the photo and scene encoders, which are stacked together and behave hierarchically to fully exploit the st...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In this paper, we propose a novel model with a hierarchical photo-scene
encoder and a reconstructor for the task of album storytelling. The photo-scene
encoder contains two sub-encoders, namely the photo and scene encoders, which
are stacked together and behave hierarchically to fully exploit the structure
information of the photos within an album. Specifically, the photo encoder
generates semantic representation for each photo while exploiting temporal
relationships among them. The scene encoder, relying on the obtained photo
representations, is responsible for detecting the scene changes and generating
scene representations. Subsequently, the decoder dynamically and attentively
summarizes the encoded photo and scene representations to generate a sequence
of album representations, based on which a story consisting of multiple
coherent sentences is generated. In order to fully extract the useful semantic
information from an album, a reconstructor is employed to reproduce the
summarized album representations based on the hidden states of the decoder. The
proposed model can be trained in an end-to-end manner, which results in an
improved performance over the state-of-the-arts on the public visual
storytelling (VIST) dataset. Ablation studies further demonstrate the
effectiveness of the proposed hierarchical photo-scene encoder and
reconstructor. |
---|---|
DOI: | 10.48550/arxiv.1902.00669 |